# Creating a Diversified Portfolio with Correlation Matrix in Python

**A must-know process to all people investing in the equity market**

**Introduction**

Investment in tradeable assets is not only done by institutional or professional traders but also by common people who aim to earn side income on a long-term basis. The first group of traders who are professionals and institutional traders love to make a hefty amount of money out of the market and hence take huge risks (like do or die situations). The second group who are common people is contrary in nature to the first one. Their aim is not to make fortunes in no time but a steady assured income that grows gradually over time. Importantly, they hate risks and love constant income.

To satisfy these two conditions at the same time, investors coined the term portfolio diversification. To define this concept, it is the process of holding a diversified number of stocks in one’s portfolio which ultimately reduces the risk and increases the certainty of constant income.

The stocks that are comprised in a diversified portfolio are not chosen randomly but instead, certain steps or approaches are followed. In this article, we are going to follow a statistical approach which is using the correlation matrix to pick the right stocks to hold in a diversified portfolio. Before that, what is correlation? Correlation is nothing but the magnitude of the relationship that exists between two or more variables. Correlation can be classified into three types: Positive correlation where the relationship between two or more variables is greater than zero (> 0), Negative correlation where the relationship is lesser than zero (< 0), and No correlation where the reading is equal to zero (= 0).

The ultimate goal of this article is to find the two best stocks among FAANG (acronym of Facebook, Amazon, Apple, Netflix, Google) to hold in a diversified portfolio that achieves lesser risk with gradual constant income. With that being said, let’s code the approach in Python.

Before moving on, a note on disclaimer: This article’s sole purpose is to educate people and must be considered as an information piece but not as investment advice or so.

**Implementation in Python**

The coding part can be separated into various steps as follows:

```
```**1.**** Importing packages**
**2.**** Extracting Stock Data ****from**** Twelve Data**
**3.**** Calculating Returns**
**4.**** Creating and Analyzing the Correlation Matrix**
**5.**** Backtesting**
**6.**** Volatility Comparison**

We will be following the order mentioned in the above list and buckle up your seat belts to follow every upcoming coding part.

**Step-1: Importing Packages**

Importing the required packages into the python environment is a non-skippable step. The primary packages are going to be Pandas to work with data, NumPy to work with arrays and for complex functions, Matplotlib and Seaborn for plotting purposes, and Requests to make API calls. The secondary packages are going to be Math for mathematical functions and Termcolor for font customization (optional).

**Python Implementation:**

```
```**# ****IMPORTING**** ****PACKAGES**
import pandas as pd
import requests
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import floor
from termcolor import colored as cl
from matplotlib import style
from matplotlib import rcParams
style.use('fivethirtyeight')
rcParams['figure.figsize'] = (20, 10)

Now that we have imported all the required packages into our python environment. Let’s pull the historical data of FAANG (acronym of Facebook, Amazon, Apple, Netflix, Google) with Twelve Data’s API endpoint.

**Step-2: Extracting Stock Data from Twelve Data**

In this step, we are going to pull the historical stock data of FAANG using an API endpoint provided by twelvedata.com. Before that, a note on twelvedata.com: Twelve Data is one of the leading market data providers having an enormous amount of API endpoints for all types of market data. It is very easy to interact with the APIs provided by Twelve Data and has one of the best documentation ever. Also, ensure that you have an account on twelvedata.com, only then, you will be able to access your API key (vital element to extract data with an API).

**Python Implementation:**

```
```**# ****EXTRACTING**** ****STOCKS**** ****DATA**
def get_historical_data(symbol, start_date, end_date):
api_key = 'YOUR API KEY'
api_url = f'https://api.twelvedata.com/time_series?symbol={symbol}&interval=1day&outputsize=5000&apikey={api_key}'
raw_df = requests.get(api_url).json()
df = pd.DataFrame(raw_df['values']).iloc[::-1].set_index('datetime').astype(float)
df = df[df.index >= start_date]
df = df[df.index <= end_date]
df.index = pd.to_datetime(df.index)
return df
fb = get_historical_data('FB', '2020-01-01', '2021-01-01')
amzn = get_historical_data('AMZN', '2020-01-01', '2021-01-01')
aapl = get_historical_data('AAPL', '2020-01-01', '2021-01-01')
nflx = get_historical_data('NFLX', '2020-01-01', '2021-01-01')
googl = get_historical_data('GOOGL', '2020-01-01', '2021-01-01')

**Code Explanation:** The first thing we did is to define a function named ‘get_historical_data’ that takes the stock’s symbol (‘symbol’), the starting date (‘start_date’), and the ending date (‘end_date’) of the historical data as parameters. Inside the function, we are defining the API key and the URL and stored them into their respective variable. Next, we are extracting the historical data in JSON format using the ‘get’ function and stored it into the ‘raw_df’ variable. After doing some processes to clean and format the raw JSON data, we are returning it in the form of a clean Pandas dataframe. Finally, we are calling the created function to pull the historic data of FAANG from the starting of 2020 and stored it into their respective variables (‘fb’, ‘amzn’, ‘aapl’, ‘nflx’, ‘googl’).

**Step-3: Calculating Returns**

In this step, we are going to calculate the cumulative returns of all FAANG stocks and plot them to observe the correlation between each one of’em.

**Python Implementation:**

```
```**# ****CALCULATING**** ****RETURNS**
fb_rets, fb_rets.name = fb['close'] / fb['close'].iloc[0], 'fb'
amzn_rets, amzn_rets.name = amzn['close'] / amzn['close'].iloc[0], 'amzn'
aapl_rets, aapl_rets.name = aapl['close'] / aapl['close'].iloc[0], 'aapl'
nflx_rets, nflx_rets.name = nflx['close'] / nflx['close'].iloc[0], 'nflx'
googl_rets, googl_rets.name = googl['close'] / googl['close'].iloc[0], 'googl'
plt.plot(fb_rets, label = 'FB')
plt.plot(amzn_rets, label = 'AMZN')
plt.plot(aapl_rets, label = 'AAPL')
plt.plot(nflx_rets, label = 'NFLX')
plt.plot(googl_rets, label = 'GOOGL', color = 'purple')
plt.legend(fontsize = 16)
plt.title('FAANG CUMULATIVE RETURNS')
plt.show()

**Output:**

**Code Explanation:** In the first few lines, we are calculating the cumulative returns of each FAANG stock by dividing the current closing price value of the stock with the initial closing price value of the stock. We are then plotting the returns with the help of the Matplotlib package and we got the above chart as the result. The returns can also be calculated in a daily timeframe but the reason for choosing cumulative returns is that it will be easy for us to notice correlations between stocks when plotted in a graph. For example, from the above chart, we could notice that a strong correlation exists between all five of the stocks since it all shows similar fluctuations or movements in its price. On contrary, it is impossible to observe such movements in a daily returns plot since the lines will be overlapping one and another.

**Step-4: Creating and Analyzing the Correlation Matrix**

Out of all the steps, this is the most interesting one where we are going to construct a correlation matrix out of the returns we calculated before and analyze it to see which stocks are best suitable for our diversified portfolio.

**Python Implementation:**

```
```**# ****CREATING**** ****THE**** ****CORRELATION**** ****MATRIX**
rets = [fb_rets, amzn_rets, aapl_rets, nflx_rets, googl_rets]
rets_df = pd.DataFrame(rets).T.dropna()
rets_corr = rets_df.corr()
plt.style.use('default')
sns.heatmap(rets_corr, annot = True, linewidths = 0.5)
plt.show()

**Output:**

**Code Explanation:** First, we are creating a variable named ‘rets’ to store all the returns we calculated before, and in the next step, we are making a dataframe out of it. To calculate the correlation between the stocks, we are using the ‘corr’ function provided by the Pandas package and stored the matrix into the ‘rets_corr’ variable. The correlation matrix does not make sense unless it is plotted in the form of a heatmap.

Heatmaps can be plotted using Matplotlib but in vain. This is where Seaborn comes into the play. Seaborn is a python package that provides an extensive amount of functions to create statistical graphical representations. So, we are using the ‘heatmap’ function provided by the Seaborn package to make a heatmap plot out of the correlation matrix.

Now, let’s analyze the heatmap that represents the correlation between the stocks. We could see some values inside the plot which is nothing but the correlation score. The correlation between Google and Facebook is 0.91 representing a strong relationship. This means, if the Google stock falls 10%, Facebook also falls around 8% (since it’s not 100% correlated), and vice versa. Likewise, we could see that the correlation between Google and Netflix is 0.69 representing a weaker relationship (when compared to the others) whose price movements would be contrary to each other (not much but a little since it is not negatively correlated).

The main goal of holding a diversified portfolio is to keep the risk as low as possible and to achieve this we have to hold stocks that are highly uncorrelated to each other. Even though one of our portfolio’s stocks falls narrowly, the other uncorrelated stock will cover it up since their price fluctuations are inversely proportional. From the correlation matrix represented above, we could say that Google and Netflix are the least correlated when compared to the rest so it is optimal to hold these two stocks in our diversified portfolio.

**Step-5: Backtesting**

Now that we have decided which stocks to hold in our diversified portfolio, let’s try doing a backtest to see how well our portfolio performs. This step is the most essential one not only while diversifying our portfolio but for every financial research since it gives some insights on how far our investment strategies perform.

**Python Implementation:**

```
```**# ****BACKTESTING**
investment_value = 100000
N = 2
nflx_allocation = investment_value / N
googl_allocation = investment_value / N
nflx_stocks = floor(nflx_allocation / nflx['close'][0])
googl_stocks = floor(googl_allocation / googl['close'][0])
nflx_investment_rets = nflx_rets * nflx_stocks
googl_investment_rets = googl_rets * googl_stocks
total_rets = round(sum(((nflx_investment_rets + googl_investment_rets) / 2).dropna()), 3)
total_rets_pct = round((total_rets / investment_value) * 100, 3)
print(cl(f'Profit gained from the investment : {total_rets} USD', attrs = ['bold']))
print(cl(f'Profit percentage of our investment : {total_rets_pct}%', attrs = ['bold']))

**Output:**

```
```**Profit gained ****from**** the investment ****:**** ****30428.957**** ****USD**
**Profit percentage ****of**** our investment ****:**** ****30.429****%**

**Code Explanation:** Firstly, we are creating a variable named ‘investment_value’ to store the total capital we would like to invest which is a hundred thousand dollars. Next, we are equally allocating the capital to each of the two stocks, and the portfolio with an equal allocation of capital is known as an equal weights portfolio. Sometimes, investors assign unique weights to each stock in a portfolio by some factors but that is out of scope in this article. After that, we are creating two new variables which are the ‘nflx_stocks’ and the ‘googl_stocks’ to store the number of stocks we could buy with our capital amount. Then, comes the calculation of the investment return. First, we are calculating the returns of each stock by multiplying the number of stocks we bought with the returns of the stock we calculated before.

After that, to calculate the total returns of our diversified portfolio, we are finding the total of each stocks’ average returns. We are also calculating the profit percentage by dividing the total investment returns by the total investment capital, then, multiplied by 100. From the output, being shown, we could see that our portfolio managed to make an approximate profit of thirty thousand and five hundred dollars with a profit percentage of 30.429 in one year. That’s not bad!

**Step-6: Volatility Comparison**

As I said before, the ultimate goal of holding a diversified portfolio is to reduce the risk as much as possible, and generating income is secondary. Now we know that we managed to gain some profits from our portfolio, let’s see if we were able to reduce the risk to the lowest. Now, there a lot of financial instruments that can be used as a gauge to measure risk, and in this article, we are going to utilize volatility. For those who don’t know what volatility is, it is a tool used by investors to calculate the risk associated with the investment.

**Python Implementation:**

```
```**# ****VOLATILITY**** ****CALCULATION**
rets_df['Portfolio'] = (rets_df[['googl', 'nflx']].sum(axis = 1)) / 2
daily_pct_change = rets_df.pct_change()
volatility = round(np.log(daily_pct_change + 1).std() * np.sqrt(252), 5)
companies = ['FACEBOOK', 'APPLE', 'AMAZON', 'NFLX', 'GOOGL', 'PORTFOLIO']
for i in range(len(volatility)):
if i == 5:
print(cl(f'{companies[i]} VOLATILITY : {volatility[i]}', attrs = ['bold'], color = 'green'))
else:
print(cl(f'{companies[i]} VOLATILITY : {volatility[i]}', attrs = ['bold']))

**Output:**

```
FACEBOOK VOLATILITY : 0.46539
APPLE VOLATILITY : 0.38944
AMAZON VOLATILITY : 0.47043
NFLX VOLATILITY : 0.46069
GOOGL VOLATILITY : 0.3881
```**PORTFOLIO**** ****VOLATILITY**** ****:**** ****0.37843**

**Code Explanation:** First, we are creating a new column ‘Portfolio’ in the ‘rets_df’ dataframe (we used previously) to store the returns of our diversified portfolio. Then, we are using the ‘pct_change’ function provided by the Pandas package to calculate the percentage change between the current reading and the prior reading of each stock’s returns in the ‘rets_df’ and stored them into the ‘daily_pct_change’ variable. Then comes the volatility calculation. Before discussing the code, there is one formula to keep in mind which is the formula to calculate the annualized volatility:

```
```**VOLATILITY**** **= **LOG**** **[ ( **STD**** ****OF**** ****PCT**** ****CHANGE** + **1**** **) * **SQRT**** ****OF**** ****252** ]
where,
**STD**** ****OF**** ****PCT**** ****CHANGE** = Standard Deviation of Daily Percentage Change
**SQRT**** ****OF**** ****252** = Square Root of 252

We are substituting the above formula into our code to calculate the volatility of each stock and stored them into the ‘volatility’ variable. From the output being shown, we could see that our diversified portfolio managed to achieve the least volatility when compared to other FAANG stocks. That’s great!

**Final Thoughts!**

After a long process of programming, we have successfully built a diversified portfolio that reduces the risk to the lowest and makes a profit. I’ve also compared our portfolio’s performance with that of SPY ETF (an ETF designed particularly to track the movement of the S&P 500 market index), and it seems we overruled it with a slight difference. Now, let’s talk about improvements.

The first aspect that can be improved is using an extensive amount of stocks. In this article, we have taken into account only five stocks and picked two of them that are less correlated than the rest. But, the concept of portfolio diversification performs at its best when holding a vast amount of uncorrelated stocks. For example, we can take into account all stocks that are comprised in the S&P 500 market index and pick highly uncorrelated stocks. By doing this, we can achieve two important things.

Firstly, we will be able to find stocks that are massively uncorrelated. In this article, the least correlation score is 0.69 (which is actually a good positive relationship) since there were only a few stocks at our disposal. But, if we consider a huge amount of stocks consisting of various sectors, we can manage to find even lowly correlated stocks that help boost our portfolio’s performance. Secondly, we will be able to reduce the risk more effectively.

Another aspect that can be improved is not only holding stocks but also other assets. The first goal of a diversified portfolio is to reduce the risk and the second primary goal is to generate constant income. The certainty of receiving returns on a regular basis from stocks is low since it holds a specific level of volatility that can’t be eradicated or reduced completely, whereas when we open our portfolio to a broader range of assets like bonds, ETFs, and so on, the probability of constant income boosts up.

With that being said, you’ve reached the end of the article. If you forgot to follow any of the coding parts, don’t worry. I’ve provided the full source code at the end. Hope you learned something new and useful from this article.

**Full code:**

```
# IMPORTING PACKAGES
import pandas as pd
import requests
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import floor
from termcolor import colored as cl
from matplotlib import style
from matplotlib import rcParams
style.use('fivethirtyeight')
rcParams['figure.figsize'] = (20, 10)
# EXTRACTING STOCKS DATA
def get_historical_data(symbol, start_date, end_date):
api_key = 'YOUR API KEY'
api_url = f'https://api.twelvedata.com/time_series?symbol={symbol}&interval=1day&outputsize=5000&apikey={api_key}'
raw_df = requests.get(api_url).json()
df = pd.DataFrame(raw_df['values']).iloc[::-1].set_index('datetime').astype(float)
df = df[df.index >= start_date]
df = df[df.index <= end_date]
df.index = pd.to_datetime(df.index)
return df
fb = get_historical_data('FB', '2020-01-01', '2021-01-01')
amzn = get_historical_data('AMZN', '2020-01-01', '2021-01-01')
aapl = get_historical_data('AAPL', '2020-01-01', '2021-01-01')
nflx = get_historical_data('NFLX', '2020-01-01', '2021-01-01')
googl = get_historical_data('GOOGL', '2020-01-01', '2021-01-01')
# CALCULATING RETURNS
fb_rets, fb_rets.name = fb['close'] / fb['close'].iloc[0], 'fb'
amzn_rets, amzn_rets.name = amzn['close'] / amzn['close'].iloc[0], 'amzn'
aapl_rets, aapl_rets.name = aapl['close'] / aapl['close'].iloc[0], 'aapl'
nflx_rets, nflx_rets.name = nflx['close'] / nflx['close'].iloc[0], 'nflx'
googl_rets, googl_rets.name = googl['close'] / googl['close'].iloc[0], 'googl'
plt.plot(fb_rets, label = 'FB')
plt.plot(amzn_rets, label = 'AMZN')
plt.plot(aapl_rets, label = 'AAPL')
plt.plot(nflx_rets, label = 'NFLX')
plt.plot(googl_rets, label = 'GOOGL', color = 'purple')
plt.legend(fontsize = 16)
plt.title('FAANG CUMULATIVE RETURNS')
plt.show()
# CREATING THE CORRELATION MATRIX
rets = [fb_rets, amzn_rets, aapl_rets, nflx_rets, googl_rets]
rets_df = pd.DataFrame(rets).T.dropna()
rets_corr = rets_df.corr()
plt.style.use('default')
sns.heatmap(rets_corr, annot = True, linewidths = 0.5)
plt.show()
# BACKTESTING
investment_value = 100000
N = 2
nflx_allocation = investment_value / N
googl_allocation = investment_value / N
nflx_stocks = floor(nflx_allocation / nflx['close'][0])
googl_stocks = floor(googl_allocation / googl['close'][0])
nflx_investment_rets = nflx_rets * nflx_stocks
googl_investment_rets = googl_rets * googl_stocks
total_rets = round(sum(((nflx_investment_rets + googl_investment_rets) / 2).dropna()), 3)
total_rets_pct = round((total_rets / investment_value) * 100, 3)
print(cl(f'Profit gained from the investment : {total_rets} USD', attrs = ['bold']))
print(cl(f'Profit percentage of our investment : {total_rets_pct}%', attrs = ['bold']))
# VOLATILITY CALCULATION
rets_df['Portfolio'] = (rets_df[['googl', 'nflx']].sum(axis = 1)) / 2
daily_pct_change = rets_df.pct_change()
volatility = round(np.log(daily_pct_change + 1).std() * np.sqrt(252), 5)
companies = ['FACEBOOK', 'APPLE', 'AMAZON', 'NFLX', 'GOOGL', 'PORTFOLIO']
for i in range(len(volatility)):
if i == 5:
print(cl(f'{companies[i]} VOLATILITY : {volatility[i]}', attrs = ['bold'], color = 'green'))
else:
print(cl(f'{companies[i]} VOLATILITY : {volatility[i]}', attrs = ['bold']))
```