Comparing 3 LLMs for Generating Profitable Trading Strategies

Nikhil Adithyan
Oct 9, 2025
9 min read

A guide to choosing the right LLM for the right purpose

Introduction

LLMs have been used to write code, summarize earnings calls, and even assist with debugging. But can they build actual trading strategies?

That’s the question we’re exploring in this article. We’ll feed structured financial data into three leading LLMs:

ChatGPT
Gemini
Perplexity

Each model will be asked to generate a trading strategy based on the same input.

The input includes historical stock prices, technical indicators, and core fundamentals. All of this is extracted from EODHD’s API endpoints to ensure consistency and accuracy.

We’re not looking for vague advice like “buy the dip.” Each model must provide specific entry and exit logic that can be independently backtested.

Once the strategies are in place, we’ll run backtests, compare their results, and evaluate each model on multiple dimensions: returns, interpretability, and originality.

This is not a general showcase of LLMs. It’s a focused experiment on whether they can reason over financial data and suggest trading strategies that hold up under scrutiny.

The Setup

To compare how well each LLM generates trading strategies, we first need to give them the same foundation: consistent, structured data. The input must be rich enough to support actual analysis but simple enough to be interpretable by a language model.

We’ll extract three datasets for a single stock (AAPL): historical prices, technical indicators, and core fundamentals. These represent the three pillars of most trading strategies: price trends, technical momentum, and financial health.

Each dataset is saved into a CSV or JSON file. These files are then used as input to the LLMs in the next stage.

Extracting and Saving Financial Data

We use EODHD’s API to pull three distinct types of data for Apple (AAPL):


import requests
import pandas as pd
import json

api_key = 'YOUR EODHD API KEY'

from_date = '2020-01-01'
to_date = '2025-06-01'

# 1. Historical price data
url_price = f'https://eodhd.com/api/eod/AAPL.US?api_token={api_key}&fmt=json&from={from_date}&to={to_date}'
price_df = pd.DataFrame(requests.get(url_price).json())
price_df.to_csv('aapl_price.csv', index=False)

# 2. Fundamentals
fundamentals = requests.get(f'https://eodhd.com/api/fundamentals/AAPL.US?api_token={api_key}&fmt=json').json()
with open('aapl_fundamentals.json', 'w') as f:
    json.dump(fundamentals, f, indent=2)

# 3. Technical Indicators

# 3.1 RSI
rsi_url = f'https://eodhd.com/api/technical/AAPL.US?api_token={api_key}&order=d&function=rsi&period=14&from={from_date}&to={to_date}&fmt=json'
rsi_df = pd.DataFrame(requests.get(rsi_url).json()).iloc[::-1]
rsi_df.to_csv('aapl_rsi.csv')

# 3.2 SMA
sma_url = f'https://eodhd.com/api/technical/AAPL.US?api_token={api_key}&order=d&function=sma&period=50&from={from_date}&to={to_date}&fmt=json'
sma_df = pd.DataFrame(requests.get(sma_url).json()).iloc[::-1]
sma_df.to_csv('aapl_sma.csv')

# 3.3 MACD
macd_url = f'https://eodhd.com/api/technical/AAPL.US?api_token={api_key}&order=d&function=macd&from={from_date}&to={to_date}&fmt=json'
macd_df = pd.DataFrame(requests.get(macd_url).json()).iloc[::-1]
macd_df.to_csv('aapl_macd.csv')

The aapl_price.csv file contains daily OHLCV data. This gives the LLMs a timeline of how the stock moved.

The aapl_fundamentals.json file holds key financial data like earnings, margins, cash flow, and more. This helps models reason about the business behind the price.

The three technical datasets (RSI, SMA, MACD) give signals used in many rule-based strategies. These are stored in memory but can be merged if needed.

All three datasets will be summarized and passed to each LLM in a controlled prompt format. This ensures the comparison is fair, consistent, and repeatable.

The Prompting Framework

Each LLM receives the exact same set of data inputs — no memory, no prior chat history, and no outside references. Just a consistent, controlled environment designed to test reasoning, not recall.

But instead of one big prompt, the task is split into a series of instructions. This allows each model to build its logic step by step, closely simulating how a human might approach strategy design.

Here’s the actual prompting sequence used for each model:

Analyze the Data

The first prompt asks the LLM to review the provided historical prices (CSV), technical indicators (CSV), and fundamentals (JSON) to identify key patterns or takeaways.

Propose a Strategy

Once the model has completed its analysis, it is asked to generate a trading strategy. The instruction explicitly demands:

Clear entry and exit logic
Use of data in a reasoned way
No vague phrasing like “buy when it looks strong”
A structure that can be realistically backtested

Backtest the Strategy

The next step asks the model to backtest the proposed strategy. Although precise backtesting can’t be expected from the LLM, this prompt is mainly framed to test how the LLM interprets its own strategy when hypothetically applied to the dataset.

Interpret the Results

Finally, the LLM is asked to reflect on the outcome of its strategy. This includes drawing conclusions about performance, possible refinements, or limitations.

This iterative structure allows us to measure not just the quality of the strategy, but also how well the model reasons through the process.

The Initial Prompt

Before testing the reasoning power of each LLM, we start with a fixed initial prompt.

This first message acts as the foundation. It puts the model on the right path and defines what a “valid strategy” looks like. Every LLM receives the exact same input: the same data files, the same wording, the same rules.

This isn’t a one-shot instruction. The rest of the process follows a step-by-step dialogue. But the opening prompt is crucial for interpreting the data and crafting the trading strategy.

Here’s the exact message sent to all models:


You are given three datasets for the stock AAPL.US:

- Historical daily price data (2020–2025)

- A few basic technical indicators (RSI, MACD, SMA)

- A detailed set of fundamental metrics (valuation, profitability, cash flow, etc.)

These are provided only for additional context.
Your task is to analyze the price and fundamentals to create a trading strategy based on insights such as:

- Valuation shifts (e.g., PE compression)

- Growth acceleration or deceleration

- Financial stress or momentum trends

- Price reactions to earnings or revenue changes

The strategy should not be based directly on RSI, MACD, or SMA signals. Use them only if they add supporting insight.

Your strategy must have:

- Clear entry and exit logic

- Defined time frame (swing, positional, long-term)

- A short rationale based on the data

Avoid vague advice. The strategy should be testable using only the given data.

This prompt does a few important things:

It clarifies that fundamentals and price behavior are the core inputs, not technical indicators.
It demands specificity. No vague phrases, no abstract trends. The output must be concrete, structured, and testable.
It leaves room for creativity, allowing the models to interpret the data in their own style

From here, the rest of the conversation flows as outlined in the prompting framework: the LLM analyzes the data, proposes a strategy, backtests it, and interprets the results.

LLM #1: Google Gemini

Gemini gave me a very elaborate response, and it was really interesting as it shows its thinking along the way. It gave me three strategies in one go, where each one was a revision of the previous one. This was the end strategy proposed by Gemini:

I personally felt the strategy looked pretty good, but it had a problem. This strategy required data beyond what we already have. So I asked it to propose a strategy that can be implemented using the available data, and this is what it gave:

At a glance, I felt the strategy looked a bit complex, with multiple conditions to be satisfied for both entry and exit conditions, but I decided to go for a backtest since all the required data is readily available. This was my prompt for backtesting the strategy:


Backtest this strategy and calculate the following metrics:

- Cumulative Return
- Annualized Sharpe Ratio (risk-free rate = 0)
- Max Drawdown
- Win/Loss Ratio
- Average Trade Duration (in days)

These were the backtesting results given by Gemini:

The metrics were pretty decent, but one of the objectives of this article is to explore how the model interprets the strategy’s performance. I used this prompt to get Gemini’s take on the strategy’s metrics:


Give me a brief and concise interpretation of the strategy's backtesting results and provide a critical analysis

This was Gemini’s response:

I think it’s fair to say that Gemini did a good job in interpreting the metrics and performing a critical analysis. Although it failed to mention the immediate steps that can be taken, the response is still a very good starting point for understanding and tuning the strategy.

LLM #2: ChatGPT

Now let’s see how the most anticipated and expected LLM, ChatGPT, performs.

This is what it proposed:

There isn't anything wrong with this strategy, but I felt the entry conditions were very strict and was doubtful if it would generate a decent amount of trades.

I still went ahead with the backtesting as an experimental move, but the response was what I predicted:

Like I said, the strict rules of the strategy turned out to be the main reason for not generating any trades. Although the LLM gave options to fine-tune, none of them convinced me to stick with this strategy.

So I asked for a revised strategy that doesn't have a very strict trade logic like the previous one, using this prompt:


Give me a revised strategy that doesn't have very strict entry and exit conditions

This was the revised strategy proposed by ChatGPT:

This strategy definitely has relaxed rules than the previous one, yet I was not satisfied with it because something seemed off. I didn't ask for another revision to prevent working in a closed loop. So I asked the model to backtest the strategy, and these were the results produced:

At a glance, it can be seen that the strategy has performed very poorly. The first metric, which is the cumulative return, alone is more than enough to decide the underwhelming performance of the strategy.

One aspect that still confuses me is the primary driver of this poor performance. It can be one of these two reasons: the inability to backtest the strategy, or the nature of the strategy itself is flawed. Either way, ChatGPT proves to be “not-so-efficient” for generating a viable trading strategy.

What surprised me the most is its take on the strategy’s performance:

The second point mentioned as an insight literally questions the entire strategy itself, yet no emphasis has been made on it by the model.

LLM #3: Perplexity

The last model in our list is Perplexity. I’m personally not a huge fan of Perplexity’s UI/UX, hence I was pretty biased from the start. However, the results proved me wrong.

This was the proposed strategy:

The proposed strategy looked clean and had no notable flaws. By now, it can be sensed that the strategies proposed by the different LLMs are kind of similar with repetitive usage of the PE ratio and Quarterly earnings growth. But that’s a topic for another day.

These were the backtesting results generated by Perplexity:

At a glance, the numbers look solid. The strategy delivered a cumulative return of nearly 90%, with a Sharpe ratio of 1.87, and didn’t take a single loss across all three trades. Even the drawdown is decently contained, sitting at just -13.1%. With an average trade duration of 67 days, the strategy aligns well with the typical swing/positional timeframe.

Just like with the other models, I wanted to see how Perplexity interprets its own results. So I asked it for a brief analysis of the performance and a critical review of the strategy. Here’s what it said:

The first part of the response was a performance summary, and honestly, it hit the key points. Strong returns, no losses, controlled risk. It neatly checked all the boxes without over-explaining.

Then came the critical analysis section, and I have to say, I was impressed. The response was structured and self-aware. It mentioned both the strengths and limitations without sugarcoating anything.

The highlight was this line:

“The strategy demonstrates strong historical efficacy, but its limited number of trades and reliance on specific fundamental triggers mean further testing and monitoring are warranted before full-scale deployment.”

That’s exactly what I was hoping to see. Not just cheerleading the metrics, but acknowledging real-world caveats. In fact, among all three models, Perplexity’s post-analysis felt the most level-headed and realistic. It didn’t just celebrate the outcome, but it questioned the robustness.

For a model I had low expectations from, this was a surprisingly well-rounded finish.

Performance Comparison

With all three LLMs having completed their tasks, from analyzing the data to proposing and backtesting a strategy, it’s time to put their outputs side by side.

This comparison isn’t just about which model had the highest returns. It’s about understanding the tradeoffs: interpretability, logic quality, use of data, and the model’s ability to self-reflect on the results.

Here’s a quick look at the backtest metrics:

At a glance, Gemini and Perplexity tie at the top with identical metrics across the board. This suggests the underlying logic of their strategies was likely similar, even if phrased differently. ChatGPT, on the other hand, struggled both in strategy design and interpretation.

But numbers alone don’t tell the whole story.

Gemini stood out in its reasoning. It gave layered responses, progressively improved its logic, and handled feedback well. However, it required a second prompt to adapt the strategy to the available data, which slightly dents its reliability on the first try.

ChatGPT failed to deliver a usable strategy in the first attempt. The real issue was the disconnect between its backtest results and its interpretation. Though it sort of acknowledged weaknesses, it didn’t act on them or suggest any practical changes.

Perplexity was the surprise winner. It produced a clean, backtestable strategy right away, delivered strong results, and offered the most balanced post-analysis. It didn’t oversell, and it didn’t dodge the limitations either.

Overall, Gemini and Perplexity share the top spot in terms of performance. But if you’re looking for clarity and readiness on the first attempt, Perplexity edges ahead.

Final Thoughts

If this experiment proved anything, it’s that LLMs can build trading strategies, just not all equally well.

Gemini showed solid reasoning but needed nudging. ChatGPT had potential but lacked consistency. Perplexity, surprisingly, nailed both logic and humility in one go.

The key enabler here wasn’t just the models. It was the structured, reliable data from EODHD that made the whole process viable. Without consistent inputs, even the smartest models wouldn’t have stood a chance.

Are LLMs ready to replace quants? Not yet. But with the right data and the right framing, they’re definitely ready to join the workflow.

tech. finance. ai