top of page

Building an ML Options Arbitrage Detection Model with Python

  • Writer: Nikhil Adithyan
    Nikhil Adithyan
  • Feb 21
  • 6 min read

A guide to combining options and stock data for ML-driven arbitrage analysis



Written in collaboration with Gaille Amolong


Options arbitrage opportunities often arise from pricing inefficiencies between the bid and ask of an options contract and the underlying asset’s price movements. With the right data and machine learning techniques, traders can spot these discrepancies faster and more accurately than manual observation.


In this article, we walk through how we leveraged EODHD’s US Stock Options Data API and End-of-Day Historical Stock Market Data API to build and train a model that detects potential arbitrage. We will showcase the dataset, the complete Python script, and the final model’s performance results, along with a discussion of what the results mean and potential improvements.


Table of Contents

  1. Introduction & Motivation


  2. Overview of the EODHD APIs

    - Options Contracts Endpoint

    - Historical Stock Data Endpoint


  3. Data Extraction and Preprocessing

    - Fetching Options Data

    - Fetching Historical Stock Data

    - Data Merging & Feature Engineering


  4. Building the Machine Learning Model


  5. Results & Visualization


  6. Conclusion & Next Steps


1. Introduction & Motivation

When trading stock options, arbitrage opportunities appear if there is a mismatch in the market’s pricing of calls, puts, or the underlying asset. While these changes can be fleeting, a machine learning approach can monitor numerous contracts, analyze pricing trends, and identify potential discrepancies much faster than manual observation.


Goal:

  • Collect real options data (bid, ask, implied volatility, Delta, Gamma) from EODHD’s API

  • Combine with historical stock prices

  • Engineer features (e.g., spread, relative spread, Delta)

  • Train a model to detect whether an option’s price is “out of line” with the underlying stock — a proxy for arbitrage


2. Overview of the EODHD APIs

EODHD provides comprehensive market data APIs that cover both options and stock price histories. In this project, we specifically utilized two endpoints:


Options Contracts Endpoint



GET https://eodhd.com/api/mp/unicornbay/options/contracts
    ?filter[underlying_symbol]=AAPL
    &fields[options-contracts]=bid,ask,bid_date,contract,exp_date,strike,delta,gamma,volatility
    &api_token=demo

This endpoint allowed us to specify which fields we wanted, ensuring we could capture the necessary metrics (bid, ask, Delta, Gamma, volatility, etc.) for each options contract.


Historical Stock Data Endpoint

The End-of-Day Historical Stock Market Data API returns OHLC (open, high, low, close) and volume data for a specified ticker over time. In our case, we used AAPL.US:



GET https://eodhd.com/api/eod/AAPL.US?api_token=demo&fmt=json

This gives us a chronological list of historical price records, which we can merge with our options data to see how the underlying stock’s price lines up with each option’s bid and ask quotes.


3. Data Extraction and Preprocessing


3.1 Fetching Options Data

Below is a snippet demonstrating how we pulled down AAPL options data using the demo token. Notice we specifically request bid, ask, bid_date, contract, exp_date, strike, delta, gamma, and volatility:



def fetch_options_data(api_token, ticker="AAPL"):
    """
    Fetch options contracts data from the EODHD API using the "Get options contracts" endpoint.
    Using the demo API token for AAPL.
    """
    # Build URL with a filter for the underlying symbol and request additional fields
    url = (
        f"https://eodhd.com/api/mp/unicornbay/options/contracts"
        f"?filter[underlying_symbol]={ticker}"
        f"&fields[options-contracts]=bid,ask,bid_date,contract,exp_date,strike,delta,gamma,volatility"
        f"&api_token={api_token}"
    )
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Error fetching options data: " + response.text)
    
    data = response.json()
    # Expecting data in the "data" field with each item containing an "attributes" dict.
    options_list = data.get("data", [])
    records = [item.get("attributes", {}) for item in options_list]
    options_df = pd.DataFrame(records)
    
    # Convert bid_date to string (if exists) and later to datetime.
    if "bid_date" in options_df.columns:
        options_df["bid_date"] = options_df["bid_date"].astype(str)
    return options_df


When we print the first few rows, we see columns like contract, exp_date, strike, bid, ask, volatility, delta, gamma, and bid_date. These fields form the basis of our options dataset.


3.2 Fetching Historical Stock Prices

To complement the options quotes, we fetch the underlying stock’s daily price data:



def fetch_stock_data(api_token, ticker="AAPL.US"):
    """
    Fetch historical stock data from the EODHD End-of-Day Historical Stock Market Data API.
    """
    url = f"https://eodhd.com/api/eod/{ticker}?api_token={api_token}&fmt=json"
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Error fetching stock data: " + response.text)
    
    stock_data = response.json()
    stock_df = pd.DataFrame(stock_data)
    
    if "date" in stock_df.columns:
        stock_df["date"] = pd.to_datetime(stock_df["date"], errors="coerce")
    for col in ['close']:
        if col in stock_df.columns:
            stock_df[col] = pd.to_numeric(stock_df[col], errors="coerce")
    
    return stock_df


This endpoint returns records with fields such as date, open, high, low, close, and volume, typically going back years. We only need the close price for this particular project.


3.3 Data Merging & Feature Engineering

After retrieving both DataFrames, we must align them by date. Our options data has a bid_date column that typically includes a date and time stamp (e.g., 2025-02-14 20:59:59). We strip out the time portion so we can merge on the date column from the stock DataFrame.



def preprocess_data(options_df, stock_df):
    """
    Preprocess and merge options and stock data.
    - Extract date from options' bid_date.
    - Merge on matching dates.
    - Engineer features such as bid-ask spread.
    - Create a simple target label.
    """
    # Check that the options DataFrame contains the expected 'bid_date' column
    if "bid_date" not in options_df.columns:
        raise KeyError("The options data does not contain the 'bid_date' field. Please check the API response.")
    
    # Extract the date part from bid_date (assumes format like "2025-01-24 20:59:59")
    options_df["bid_date_only"] = options_df["bid_date"].apply(
        lambda x: x.split(" ")[0] if isinstance(x, str) and " " in x else x
    )
    options_df["bid_date_only"] = pd.to_datetime(options_df["bid_date_only"], errors="coerce")
    
    # Merge options and stock data on matching dates
    merged_df = pd.merge(options_df, stock_df, left_on="bid_date_only", right_on="date", how="inner")
    
    if merged_df.empty:
        print("Warning: No overlapping dates found between options bid_date and stock data date.")
        return merged_df

    # If bid and ask are missing, you can fill with dummy values (only if necessary)
    if "bid" not in merged_df.columns or merged_df["bid"].isnull().all():
        merged_df["bid"] = 1.0  
    if "ask" not in merged_df.columns or merged_df["ask"].isnull().all():
        merged_df["ask"] = 1.05 
    
    # Compute derived features
    merged_df["spread"] = merged_df["ask"] - merged_df["bid"]
    merged_df["midpoint"] = (merged_df["ask"] + merged_df["bid"]) / 2.0
    merged_df["relative_spread"] = merged_df["spread"] / merged_df["close"]
    
    # Create a simple target variable: flag as arbitrage if relative spread exceeds a threshold (e.g., 2%)
    threshold = 0.02
    merged_df["arbitrage"] = merged_df["relative_spread"].apply(lambda x: 1 if x > threshold else 0)
    
    return merged_df

Why These Features?


  • spread: The difference between ask and bid can reflect market inefficiencies. A larger-than-usual spread might indicate liquidity issues or a mispricing.

  • midpoint: The average of bid and ask, often used as a more “fair” price estimate.

  • relative_spread: Dividing by the underlying close price normalizes the spread across different stock price levels.

  • arbitrage (label): A simple rule-based flag that marks an option as mispriced if the spread is large relative to the stock’s price.


4. Building the Machine Learning Model

We used a Random Forest classifier to predict whether an option’s current market data suggests an arbitrage. The features were [bid, ask, spread, midpoint, close], and the label was the binary arbitrage flag.



def train_model(df):
    """
    Train a Random Forest classifier to predict arbitrage opportunities.
    """
    feature_cols = ["bid", "ask", "spread", "midpoint", "close"]
    X = df[feature_cols]
    y = df["arbitrage"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Model Accuracy: {acc:.2f}")
    
    importances = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
    return model, importances

Here, we focus on the five key features derived from the quotes (bid, ask, spread, midpoint) plus the stock’s close price. We then evaluate how well the model can predict the binary arbitrage label.


5. Results & Visualization


Merged Data Sample

After running the script, we get a merged DataFrame that shows columns for contract, exp_date, strike, bid, ask, volatility, delta, gamma, and the underlying’s close price. We also see our engineered features (spread, midpoint, relative_spread) and the final arbitrage label:



Model Accuracy & Feature Importances

When we trained the model on this particular dataset, we observed a high accuracy score (e.g., 1.00) on our test split. However, in this example, all the option records ended up labeled as non-arbitrage (0), so the model effectively learned a trivial solution. In a more varied dataset containing genuine mispriced contracts, we would expect a more nuanced accuracy measure.




  • The feature importances reveal that spread is the most critical predictor in this baseline approach, followed by ask, bid, and midpoint, while close contributed very little — likely because it was constant across the merged sample in our demonstration data.


(Note: The exact bar chart in your environment may vary slightly, but the principle remains the same.)


6. Conclusion & Next Steps

By bringing together EODHD’s US Stock Options Data and End-of-Day Historical Stock Market Data, we’ve demonstrated a straightforward way to detect potential mispricings in the options market. Our approach involved fetching detailed option quotes — like bid, ask, implied volatility, Delta, and Gamma — then merging them with the stock’s closing price. From there, we created simple features (such as the bid-ask spread and its relative_spread to the stock’s price) and trained a Random Forest model.


Although our small dataset resulted in a somewhat trivial 100% accuracy, the process shows how easy it is to set up a pipeline that systematically flags potentially mispriced options.


Key Takeaways


  • Data Integration: Combining options data with stock prices provides a more complete view of market conditions.


  • Feature Engineering: Even simple calculations like spread and relative_spread can highlight early signals of mispricing.


  • Modeling: A Random Forest classifier is a good starting point, giving insights into which features matter most for detecting potential arbitrage.


Potential Enhancements


  • Refining the Target Label: Instead of a fixed 2% threshold, you could use more advanced metrics (like put-call parity or volatility skew) or track real-time price changes.


  • Expanding Features: Consider adding more “Greeks” (e.g., Theta, Vega), market sentiment data, or time-based trends.


  • Time-Series Analysis: Look at each contract’s price over multiple dates to identify persistent mispricing patterns.


  • Hyperparameter Tuning: Use methods like grid search or Bayesian optimization to boost model accuracy.


In essence, machine learning can help traders spot arbitrage opportunities more efficiently by scanning the market faster than any individual can. With EODHD’s rich data coverage and versatile endpoints, anyone can build a similar pipeline to automate the hunt for mispriced contracts — and potentially level up their options trading strategy.

Comentarios


Bring information-rich articles and research works straight to your inbox (it's not that hard). 

Thanks for subscribing!

© 2023 by InsightBig. Powered and secured by Wix

bottom of page