How Python + GraphDB Transforms ETF and Company Relationship Analysis

Nikhil Adithyan
Sep 7
9 min read

Uncovering hidden connections in financial data

A widely accepted fact and a common stereotype today is that data is king. When it comes to investing, it’s clear why data are crucial in our quest to discover our strategy’s edge.

In this article, we will not discuss how to acquire data, as it is quite simple. At the start of your journey, you can easily browse the internet to find datasets that suit your needs. When these are not sufficient, you can obtain data from specialised providers like FinancialModelingPrep. So, let’s assume you already have the data. What’s next? You need to store it in a way that helps you get answers and visualise it to understand patterns and trends.

The most common ways to store the data are:

Relational databases are like Excel spreadsheets with multiple tables that can be interconnected. Their advantage lies in these links, but a disadvantage is the heavy maintenance required, as you need to update the tables whenever new information is added.

NoSQL databases are like storage for various JSON documents. Using some clever technique, you can retrieve all documents that share similarities. The advantage is that adding new information is straightforward — you just insert it into the document, saving time compared to relational databases. However, a drawback is that relationships between data are less clear, and it can be easy to get lost in the maze of millions of documents.

And after that, we have the GraphDBs!

What is a GraphDB, and how can it be used in investments

GraphDBs are emerging as a solution to address the challenges of both approaches. They store entities as nodes and relationships as edges, allowing for a clear organisation of information and its connections in a multi-layered structure. This enables efficient querying and visualisation distinctly.

A simple example, like the one we will see later in this article, helps to better understand by visualising the assets that ETFs hold and their overlaps. We will create nodes for ETFs and COMPANY, and establish relationships called HOLDS. This relationship will point from the ETF to the COMPANY. Each node will contain some basic information: ETFs will have their market cap, while COMPANY will include their market cap and sector. Additionally, at the level of the relationship (HOLDS), we will record the percentage of the ETF that is invested in each specific asset.

Before we go to the actual example, let me point out other cases where a GraphDB will be interesting:

As above with ETF holdings, we can also document the region of industry exposures, so we identify concentrated risk,
Upload regional sales so you can plot the concentration of sales of a company to a specific region, or to see the region’s concentration on particular companies.

Neo4J

In this article, we are going to use one of the most common platform for GraphDBs, which is Neo4J. The reasons for this choise are various

It is one of the most mature graph databases
It has a free cloud solution that we will use for this article
It has a Python connector
It uses Cypher language to query the database

In order to follow along this article you will need to simply to start for free for a Neo4J AuroraDB.

Gather the data

To upload meaningful data to Neo4J, we first need to gather and transform it into a format that will be easier to upload later. For this, we will use the FMP’s Stock Screener API.

Even thought the logic of this exercize is to upload as much as possible ETFs and their assets, this article is to show you the way, so practically we will limit the ETFs and their assets in a reasonable amount, so the visualization will make sense easily.

Let’s do the imports first:


import requests
import pandas as pd
import json
from typing import List, Dict, Any
from tqdm import tqdm

token = 'YOUR FMP TOKEN HERE'

Then, using the FMP screener, we will identify the ETFs with assets under management exceeding 50 billion USD. Next, we will retrieve sector exposure data from the ETF information API and filter for ETFs that have over 30% exposure in the Technology sector.


screener_url = 'https://financialmodelingprep.com/api/v3/stock-screener'
params = {
    'apikey': token,
    "country":"US",
    'isEtf': True,
    'marketCapMoreThan': 50_000_000_000
}
resp = requests.get(screener_url, params).json()
df_etfs = pd.DataFrame(resp)


def get_etf_industry_exposure(symbol: str, token: str) -> dict:
    ret = {}
    try:

        url_info = f'https://financialmodelingprep.com/api/v4/etf-info?symbol={symbol}'
        params_local = {'apikey': token}
        r = requests.get(url_info, params=params_local)
        data = r.json()
        exposures = data[0]['sectorsList']
        for exp in exposures:
            ret[exp['industry']] = exp['exposure']
    except Exception as e:
        print(f'Error getting etf-information for {symbol}: {e}')
        pass
    return ret

all_industries = set()
etf_to_industry = {}
symbols = df_etfs['symbol'].dropna().astype(str).tolist()
for symb in symbols:
    exp = get_etf_industry_exposure(symb, token)
    etf_to_industry[symb] = exp
    all_industries.update(exp.keys())

# Add columns for each industry exposure, prefixed to avoid collisions
industry_cols = [f'Industry_{ind}' for ind in sorted(all_industries)]
for col in industry_cols:
    df_etfs[col] = 0.0

# Fill exposures
for idx, row in df_etfs.iterrows():
    symb = row['symbol']
    exp = etf_to_industry.get(symb, {})
    for ind, val in exp.items():
        col = f'Industry_{ind}'
        if col in df_etfs.columns:
            df_etfs.at[idx, col] = float(val)


df_etfs = df_etfs[df_etfs['Industry_Technology'] > 30]

The next step is to retrieve the assets underlying the ETF using the respective FMP API. To ensure relevance, we will only include assets that constitute more than 0.5% of the ETF’s portfolio.


holdings_all: List[dict] = []
etf_holder_base = 'https://financialmodelingprep.com/api/v3/etf-holder/'

for sym in df_etfs['symbol'].dropna().astype(str).tolist():
    url = f'{etf_holder_base}{sym}'
    data = requests.get(url, params).json()

    for item in data:
        rec = dict(item) if isinstance(item, dict) else {'raw': item}
        rec['etfSymbol'] = sym
        holdings_all.append(rec)


df_holdings = pd.DataFrame(holdings_all)
df_holdings = df_holdings[df_holdings['weightPercentage'] > 0.5]

The last thing is to get the details of each asset with the FMP API, which gets the profile of the asset.


assert 'df_holdings' in globals(), "df_holdings not found. Run the holdings cell first."

company_info_url = 'https://financialmodelingprep.com/api/v3/profile/{}'

unique_assets = (
    df_holdings.get('asset')
    .dropna()
    .astype(str)
    .str.strip()
    .unique()
    .tolist()
)

print(f'Unique assets: {len(unique_assets)}')

records = []
for sym in unique_assets:
    try:
        r = requests.get(company_info_url.format(sym), params={'apikey': token})
        data = r.json()
        if isinstance(data, list) and data:
            d = data[0]
        elif isinstance(data, dict):
            d = data
        else:
            d = {}

        # Normalize keys and handle both mktCap/marketCap variants
        rec = {
            'symbol': d.get('symbol', sym),
            'mktCap': d.get('mktCap', d.get('marketCap')),
            'companyName': d.get('companyName'),
            'sector': d.get('sector'),
        }
    except Exception:
        rec = {'symbol': sym, 'mktCap': None, 'companyName': None, 'sector': None}

    records.append(rec)

df_company_info = pd.DataFrame(records)
df_company_info = df_company_info.drop_duplicates(subset=['symbol']).reset_index(drop=True)

This way we have all the data that are needed to upload to the Neo4J DB.

Dataframe df_etfs that holds the ETFs
Dataframe df_holdings that holds the link of the ETF to the Stock
Dataframe df_company_info that holds the data for the Stocks

Before we move to upload the data to Neo4J we need one final touch. We should

Convert the numeric fields such as market cap and shares held to numeric types.
Replace some special characters as well as spaces from the symbols since those are going to be used as the key to hold the relationship between the nodes and the edges.


df_etfs['symbol'] = df_etfs['symbol'].str.replace('/', '_').str.replace('.', '_').str.replace(' ', '_').str.replace('\\', '_').str.replace('-', '_')
df_etfs['marketCap'] = pd.to_numeric(df_etfs['marketCap'], errors='coerce').fillna(0)

df_holdings['etfSymbol'] = df_holdings['etfSymbol'].str.replace('/', '_').str.replace('.', '_').str.replace(' ', '_').str.replace('\\', '_').str.replace('-', '_')
df_holdings['asset'] = df_holdings['asset'].str.replace('/', '_').str.replace('.', '_').str.replace(' ', '_').str.replace('\\', '_').str.replace('-', '_')
df_holdings['sharesNumber'] = pd.to_numeric(df_holdings['sharesNumber'], errors='coerce').fillna(0)
df_holdings['weightPercentage'] = pd.to_numeric(df_holdings['weightPercentage'], errors='coerce').fillna(0)

df_company_info['symbol'] = df_company_info['symbol'].str.replace('/', '_').str.replace('.', '_').str.replace(' ', '_').str.replace('\\', '_').str.replace('-', '_')
df_company_info['mktCap'] = pd.to_numeric(df_company_info['mktCap'], errors='coerce').fillna(0)

Upload to Neo4J

Now we have all the data ready for upload. We should install the Python module first, which will handle the connection to the database for us.


pip install neo4j

We are using Python now; we should connect to the database. The URI, instance name, and password should already be obtained while creating the instance at Neo4J.


URI = "YOUR INSTANCE URL"
AUTH = ("neo4j", "YOUR PASSWORD FROM NEO4J")

from neo4j import GraphDatabase

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

First, we will upload the ETFs


for index, row in tqdm(df_etfs.iterrows(), total=len(df_etfs)):
  try:
    query = f'CREATE (e_{row['symbol']}:ETF{{symbol:"{row['symbol']}", name:"{row['companyName']}", marketCap:{row['marketCap']}}})'
    driver.execute_query(query,database_="neo4j",)
  except Exception as e:
    print(query)
    print(e)
    break

A sample of the cypher query that will be executed on the database is as follows:


CREATE (e_QUAL:ETF{symbol:"QUAL", name:"iShares MSCI USA Quality Factor ETF", marketCap:54236893281})

Now we should upload the stocks:


for index, row in tqdm(df_company_info.iterrows(), total=len(df_company_info)):
  try:
    query = f'CREATE (c_{row['symbol']}:COMPANY{{symbol:"{row['symbol']}",name:"{row['companyName']}", sector:"{row['sector']}", marketCap:{row['mktCap']}}})'
    driver.execute_query(query,database_="neo4j",)
  except Exception as e:
    print(query)
    print(e)
    break

And finally, the relations (edges)


for index, row in tqdm(df_holdings.iterrows(), total=len(df_holdings)):
  try:
    query = f'CREATE (e_{row['etfSymbol']})-[:HOLDS {{sharesNumber:{row["sharesNumber"]}, weightPercentage:{row["weightPercentage"]}}}]->(c_{row["asset"]})'
    driver.execute_query(query,database_="neo4j",)
  except Exception as e:
    print(query)
    print(e)
    break

The sample to create a relationship is below:


CREATE (e_QUAL)-[:HOLDS {sharesNumber:2786496.0, weightPercentage:0.522}]->(c_ED)

You will notice here the difference, and how clearly we connect two nodes together, while also being able to add data such as the weight percentage at the level of the relationship.

Visualisation of the data in Neo4J

Now, you should return to the Neo4J platform on your free instance, and under the Query menu, you should be able to see the following.

As you will notice, the Nodes and Relationships should be uploaded!

Now, let’s observe how the visualisation will work. Let’s retrieve all the nodes (ETF and Companies) using Cypher language.


MATCH (n) RETURN n;

Practically, this retrieves all the nodes (n) and returns them all. To have some fun, it is like in SQL, say, Select from which we all know how it will end up. This query will show all the nodes and relationships, like below:

The first thing that we can notice is the dots that have no relationships. For some reason, those ETFS do not have any assets returned. So, let’s see with Cypher if we can isolate those.


MATCH (e:ETF) WHERE NOT (e)--() RETURN e

This will visualise the following:

We can also visualise the holdings of a specific ETF. In our case, let’s try for the VOO (Vanguard S&P 500 ETF).


MATCH (etf:ETF {symbol: "VOO"})-[:HOLDS]->(company:COMPANY) RETURN etf, company

Now let’s explore more complex visualisations. I will write a query that includes all the ETFS holding Nvidia. I will use the name “NVIDIA Corporation” to show that we can utilise any tag from the node.


MATCH (company:COMPANY {name: "NVIDIA Corporation"})<-[:HOLDS]-(etf:ETF)-[:HOLDS]->(heldCompany:COMPANY)
RETURN DISTINCT etf, heldCompany

That is an interesting graph. Go and “play” by zooming in and out on the graph, or try to move the nodes. Trust me, it will be fun. However, let’s see what we can immediately point out from this visualisation that, with any other method, would require significant effort with questionable results.

There are two companies at the bottom left that are linked to only one ETF. Zooming in will make it clearer what I am talking about.

The information we have is that among all the ETFS with Nvidia in their portfolio, only one(Amundi MSCI) holds Taiwan Semiconductor and ASML Holding, while the rest of the stocks are owned by most of the other ETFs. For me, that warrants some investigation into these two companies. What do you think?

Let’s do one final example. Let’s show the top 5 ETFs in terms of market cap, with their holdings.


MATCH (e:ETF)-[:HOLDS]->(c:COMPANY)
WITH e, collect(c) AS companies
ORDER BY e.marketCap DESC
LIMIT 5
RETURN e, companies

If this were SQL, it would produce a table with all the companies, and we would need to run some counts and order by to truly understand what is happening. However, what makes this visualisation unique is that with this simple query, you can immediately see that the two blue ETF’s (nodes) in the centre are the ones with the most holdings, while the others on the left have the least.

Besides Visualisation?

A simple example, as above, has demonstrated the power of visualising with GraphDB. But is this the only benefit? Definitely not.

The above graph, which we obtained from the Neo4J website, shows the difference in performance based on the “hops” of a query. A hop is how many times we’ jump” from one record to another. For example, trying to find my friend’s friend’s friend involves 3 hops for each record. Now imagine how much faster this would be in a graph database compared to a relational database with thousands of stocks and ETFS in our previous example.

So the Python driver has you covered. You can execute the queries as usual and interpret the results. Obviously, do not expect these to be tables or dataframes. Let’s see a Python example for the previous query of the top 5 ETFs.


query = """
MATCH (e:ETF)-[:HOLDS]->(c:COMPANY)
WITH e, collect(c) AS companies
ORDER BY e.marketCap DESC
LIMIT 5
RETURN e, companies
"""

def get_top_etfs_with_companies(tx):
    result = tx.run(query)
    for record in result:
        etf = record["e"]
        companies = record["companies"]
        print(f"ETF Symbol: {etf.get('symbol')}, Market Cap: {etf.get('marketCap')}")
        print("Holds Companies:")
        for company in companies:
            print(f" - {company.get('name')}")
        print()

with driver.session() as session:
    session.read_transaction(get_top_etfs_with_companies)

This will return the following:


ETF Symbol: SEMG_L, Market Cap: 7316752931120
Holds Companies:
 - NVIDIA Corporation
 - Broadcom Inc.
 - Taiwan Semiconductor Manufacturing Company Limited
 - ASML Holding N.V.
 - Advanced Micro Devices, Inc.
 - Texas Instruments Incorporated
 - QUALCOMM Incorporated
 - Applied Materials, Inc.
 - Micron Technology, Inc.
 - Lam Research Corporation

ETF Symbol: VOO, Market Cap: 1326566414232
Holds Companies:
 - NVIDIA Corporation
 - Broadcom Inc.
 - Advanced Micro Devices, Inc.

ETF Symbol: VFIAX, Market Cap: 1325308446655
Holds Companies:
 - NVIDIA Corporation
 - Broadcom Inc.
 - Advanced Micro Devices, Inc.

ETF Symbol: JURE_L, Market Cap: 945886890977
Holds Companies:
 - NVIDIA Corporation
 - Texas Instruments Incorporated
 - Lam Research Corporation

ETF Symbol: IVV, Market Cap: 659327809893
Holds Companies:
 - NVIDIA Corporation
 - Broadcom Inc.

Conclusion

This article is mainly aimed at providing a high-level overview of the possibilities of a graph database and how we can combine the extensive data from FMP APIs with a Neo4J GraphDB. You can spend hours examining datasets that have complicated relationships.

GraphDBs look like they provide better answers with visualisation, and their performance is apparent. So if “a picture’s worth a thousand words” and “time is money” then graph databases must be the best storytellers — saving time, money, and sparing us from boring spreadsheets forever ;)

With that being said, you’ve reached the end of the article. Hope you learned something new and useful today. Thank you for your time.

tech. finance. ai