Predicting Stock Movements using News Data and ML

Nikhil Adithyan
Jul 28, 2025
9 min read

A Python guide to combine news + AI for useful insights

Intro

News sentiment analysis is one of those things that feels exciting to talk about, but when you actually sit down to build something with it, it gets messy real fast. Most tutorials either stop at assigning a sentiment score or try to predict a stock’s movement using the headline alone, without checking if the stock actually moved in the first place.

This project came out of that frustration. I didn’t want to predict. I just wanted to see if the sentiment of a news article aligns with how the stock actually moved afterward, and then train a model to classify that relationship.

So this isn’t about forecasting future prices. It’s about labeling historical news based on what happened next and using that as ground truth. Then we build a simple text classifier to predict whether a news article is “positive,” “negative,” or “neutral” based on the stock’s real intraday movement after the news dropped.

That’s it. Nothing fancy. But everything is real and grounded in news and price data.

The Idea

The goal is simple:

Take a bunch of real news articles, look at how the mentioned stock moves in the hour after the news drops, and label each article based on that movement.

If the stock goes up meaningfully, it’s “positive.”
If it drops, it’s “negative.”
If it barely moves, it’s “neutral.”

That’s the only labeling logic. No manual interpretation, no vague sentiment scores. Just let the price do the talking.

To pull this off, we need two things:

A clean dataset of timestamped news articles linked to specific stocks. This will be extracted using Alpha News Stream’s (ANS) API endpoints.
Intraday price data around each news timestamp, ideally minute-level. For this, we can use the yfinance library to keep things simple.

Once we get this data in shape, the rest follows: cleaning, matching, calculating % change, and training a classifier.

After the labels are ready, we extract features from the news headlines using TF-IDF and train a multi-class classifier to see if it can learn to spot positive, negative, and neutral news on its own.

Python Implementation

Now that we have a clear understanding of the idea, let’s start to actually work on it. We’ll build an entire pipeline from extracting the required data to labeling it and building the ML model.

Importing Packages

The first and foremost step of any project is to import all the required packages at once, at least that’s what I believe. Importing packages as we go makes it really hard to keep track, leading to confusion.


import pandas as pd
import requests
from datetime import datetime, timedelta
import yfinance as yf
import time
import pytz

from scipy.sparse import hstack
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, f1_score

import matplotlib.pyplot as plt
import seaborn as sns

As you can see, no fancy packages but just the usual ones like pandas, yfinance and scikit-learn. Make sure to install all these packages before importing them into the environment.

Extracting News Data

For this project, I’m using Alpha News Stream as the source for news headlines. Their API is simple to use and gives access to 15,000 to 20,000 financial articles daily, complete with headlines and summaries.

This is the code to extract news data for the last 15 days:


# 2. EXTRACTING NEWS DATA

API_KEY = "YOUR ANS API KEY"
BASE_URL = "https://api.alphanewsstream.com/v1/news"

def fetch_headlines(start_date, end_date, count=1000):
    headers = {
        "X-Api-Key": f"{API_KEY}"
    }
    params = {
        "start_date": start_date,
        "end_date": end_date,
        "count": count,
        "summary": "yes"
    }

    response = requests.get(BASE_URL, headers=headers, params=params)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return []

#format_string = "%Y-%m-%d"
#today = datetime.strptime('2025-07-01', format_string).date()
today = datetime.today().date()

chunks = [
    (today - timedelta(days=i + 1), today - timedelta(days=i))
    for i in range(13, 0, -1)
]

all_articles = []

for start, end in chunks:
    print(f"Fetching from {start} to {end}")
    articles = fetch_headlines(start.isoformat(), end.isoformat())
    all_articles.extend(articles['articles'])
    time.sleep(1)

# Convert to DataFrame
df = pd.DataFrame(all_articles)
print(f"Total articles collected: {len(df)}")

In the above code, I’m first defining a function named fetch_headlines that hits the API with the right parameters and returns the results in JSON. Then, we’re creating 2-day chunks for a 15-day window and looping through them to collect news in batches.

So we now have 13000 articles in total, which is a solid amount to work with.

Before moving any further, let’s drop the null values from the created dataset:


df = df.dropna().reset_index(drop = True)
print(f'Dataframe length after dropping null values: {len(df)} rows')

Glad that we didn't lose any useful data.

Data Clean-Up

Now that the full dataset is in place, the next step is to clean it up a bit so it’s ready for further processing.


n_df = df.copy()

symbol = df['tags'].apply(lambda x: x.get('symbols', [None]) if isinstance(x, dict) else None)
n_df['symbol'] = symbol

for i in range(len(n_df)):
    if symbol[i] != []:
        n_df['symbol'][i] = symbol[i][0]
    else:
        n_df['symbol'][i] = None
        
n_df = n_df.dropna().reset_index(drop = True)
n_df['timestamp'] = df['date'] + ' ' + df['time']
n_df = n_df.drop(['url','images','id'], axis = 1)

n_df.head()

First, I created a copy of the original dataframe just to be safe. Then I extracted the associated stock symbols from the tags column, which is a dictionary in most cases. Since a single article can have multiple symbols, I just picked the first one.

After that, I dropped all the rows with missing values and merged the date and time columns into a single timestamp field for later use. Finally, we’re removing some columns that are not relevant.

This is the final dataframe:

This gives us a much cleaner dataset to move forward with.

Calculating % Change

The idea behind this step is to:

Extract 1-minute intraday data for the stock mentioned in the news
Calculate the percentage change in stock price

Here’s the code to do the same:


eastern = pytz.timezone('America/New_York')

def fetch_price_change(symbol, timestamp_local):
    # Ensure timestamp is timezone-aware in Eastern
    t0 = pd.to_datetime(timestamp_local).tz_localize(eastern)
    if t0.hour >= 16:
        t0 = (t0 + pd.Timedelta(days=1)).replace(hour=9, minute=30, second=0)
    
    # Get the full day's data
    day = t0.date()
    start = pd.Timestamp(day).tz_localize(eastern)
    end = start + pd.Timedelta(days=1)

    df = yf.download(
        symbol.split(':')[1],
        start=start.tz_convert(None).strftime('%Y-%m-%d'),
        end=end.tz_convert(None).strftime('%Y-%m-%d'),
        interval='1m',
        progress=False
    )

    # Convert to Eastern
    df.index = df.index.tz_convert(eastern)
    
    try:
        # Get price just before t0
        before_t0 = df[df.index < t0]
        after_t1 = df[df.index >= t0 + pd.Timedelta(minutes=60)]

        if len(before_t0) == 0:
            price_t0 = df.iloc[0]['Close']
            price_t1 = after_t1.iloc[-1]['Close']
        else:
            price_t0 = before_t0.iloc[-1]['Close']
            price_t1 = after_t1.iloc[0]['Close']

        pct_change = (price_t1 - price_t0) / price_t0 * 100
        return price_t0.values[0], price_t1.values[0], pct_change.values[0]
     
    except:
        return None, None, None

The above-created function uses yfinance to pull 1-minute data for the stock symbol covered in the news. I converted everything to Eastern Time to make sure that if a headline comes after 4 PM, the reference point is moved to the next trading day’s open.

I then grab the price just before the news hits and the price exactly one hour after, and compute the percent change. This is important because we’re not extracting a day’s worth of data, but only for an hour to gauge the momentum change accurately.

Now we’ll apply this function to every article in the dataset:


results = []

for _, row in n_df.iterrows():
    p0, p1, pct = fetch_price_change(row['symbol'], row['timestamp'])
    results.append((p0, p1, pct))

n_df[['price_t0', 'price_t1', 'pct_change']] = pd.DataFrame(results, index=sample.index)
n_df = n_df.dropna().reset_index(drop = True)
n_df.head()

This part takes a while to run, depending on the number of headlines, since each row is making an individual API call to yfinance.

Here’s what the updated dataframe looks like after calculating the intraday movement:

And finally, let’s save the dataset to a CSV to avoid re-running this process every single time:


n_df.to_csv('p_change.csv')

Labeling

We’ve got the price change. Now it’s time to assign a label to each article so that we can later train a model to classify headlines based on their impact.

Here’s the basic idea I’m using:

If the price change is greater than or equal to +1%, I label it as 1 (positive impact).
If it’s less than or equal to -1%, I label it as -1 (negative impact).
And if it’s in between, I label it as 0 (neutral).

Here’s the code to apply this logic:


df = pd.read_csv('p_change.csv').drop('Unnamed: 0', axis = 1)

def label_pct_change(pct):
    if pct >= 1:
        return 1
    elif pct <= -1:
        return -1
    else:
        return 0

df['label'] = df['pct_change'].apply(label_pct_change)
df.head()

The label_pct_change function is pretty self-explanatory. It’s just a simple threshold-based classifier. I chose ±1% as the cutoff to make sure we filter out tiny fluctuations that don’t really reflect any meaningful market reaction.

Here’s what the labeled data looks like:

Feature Engineering

This step is where we try to capture every bit of useful signal from the raw data, like time, source, text, symbol, and turn them into something an ML model can understand.


# date-time
df['datetime'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek

# symbol, source encoding
symbol_dummies = pd.get_dummies(df['symbol'], prefix='sym')
source_dummies = pd.get_dummies(df['source'], prefix='src')

# text vectorization
df['text'] = df['headline'].fillna('') + ' ' + df['summary'].fillna('')
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,2), stop_words='english')
X_text = vectorizer.fit_transform(df['text'])

# scale nums
num_features = df[['hour', 'day_of_week']]
scaler = sklearn.preprocessing.StandardScaler()
X_num = scaler.fit_transform(num_features)

# combine everything
X_final = hstack([X_text, X_num, symbol_dummies.values, source_dummies.values])
y_final = df['label']

Here’s what we’re doing:

Extracted the hour of the day and the day of the week from the timestamp.
One-hot encoded the symbol and source fields to help the model learn which stocks or sources typically move more.
Combined the headline and summary into a new text column and used TF-IDF to vectorize it with unigrams and bigrams.
Scaled the numeric features using StandardScaler.
Finally, stacked everything horizontally, which includes the vectorized text, the scaled numeric features, and the one-hot encodings, to build the final feature matrix X_final.

With that, all the raw data has now been converted into clean, usable input for training.

Model Training & Evaluation

We’ve now come to the last and most important step, which is training the ML model to classify the sentiment based on the news data.

I’m going with a basic Logistic Regression model here. It’s fast, works surprisingly well for text-based classification, and is a good first baseline. Also, it supports multiclass classification with the multinomial setting, which is what we need for our -1 / 0 / 1 label setup.


# train/test split
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X_final, y_final, 
    test_size=0.2, 
    stratify=y_final, 
    random_state=42
)

# model training
model = sklearn.linear_model.LogisticRegression(
    class_weight='balanced', 
    max_iter=1000,
    solver='saga',
    multi_class='multinomial',
    n_jobs=-1
)
model.fit(X_train, y_train)

# model eval
y_pred = model.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, digits=3))

print("Macro F1 Score:", f1_score(y_test, y_pred, average='macro'))

cm = confusion_matrix(y_test, y_pred, labels=[-1, 0, 1])
sns.heatmap(cm, annot=True, fmt='d', xticklabels=['-1', '0', '1'], yticklabels=['-1', '0', '1'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

And here’s how the model performed:

(i) Classification report

(ii) Confusion matrix

Not bad for a first run.

The model’s doing decently on the neutral (0) class with an F1 of 0.701, which makes sense because that class had the most support. The positive class (1) is where it struggles the most, with an F1 of just 0.577, indicating it’s still confusing positive headlines with neutrals quite a bit.

Overall macro F1 score is around 0.627, which isn’t amazing but definitely usable for further experiments or downstream tasks. There’s a lot we could do to improve this. From better labeling to trying out transformer-based models, but as a baseline ML pipeline, this provides a good benchmark.

Thoughts on Performance Improvement

The current setup is basic by design, just to get a minimum working pipeline going. But there’s a lot of low-hanging fruit that could push this F1 score higher.

The biggest limitation right now is the model itself. Logistic Regression isn’t built to capture complex semantic patterns in language. It’s fast and interpretable, but also shallow. A tree-based model like XGBoost might squeeze out a bit more performance with the same features, but if I really want to move the needle, I need to bring in deep learning.

That means either:

A transformer model like BERT, fine-tuned on financial news headlines
Or at least using BERT embeddings as input features for a better classifier

Another major factor is the quality of the labels. Right now, I’m assuming the tags are accurate, but there’s likely noise in those annotations. Especially in a multi-class setup like this, even 10–15% bad labels can mess with recall and skew the learning.

Then there’s class imbalance. While I did set class_weight='balanced' in the Logistic Regression, more targeted sampling strategies like upsampling the minority classes or experimenting with focal loss could help the model pay more attention to those underrepresented sentiment types.

Lastly, we could bring in more context. This version only looks at the news headline and summary. But with access to full article text, metadata (e.g., sector, earnings flag), and maybe historical market behavior around similar headlines, the feature space becomes a lot richer.

In short, the current numbers aren’t the ceiling, but they’re just the floor. And there’s plenty of room to climb.

Wrapping Up

This pipeline shows how real-time financial news can be converted into actionable labels and fed into a machine learning model with minimal overhead. Even with a lightweight setup, the classifier delivers a solid macro F1 score of 0.627.

What makes this more relevant is that the data comes directly from an API stream of real market headlines, the kind you’d actually rely on in a live system. Alpha News Stream’s founder, Frank Cioffi, tells me his feeds have been used by clients as the basis for sentiment indicators. And this project is a small-scale version of that same philosophy.

Plenty of room remains to improve performance: refining the labels, switching to transformer-based embeddings, and adding more context features.

But the base structure is here, and it works, thanks in part to having a consistent, high-volume news source to build on. With that being said, you’ve reached the end of the article. Hope you learned something new and useful.

tech. finance. ai