Fit Multiple Distributions to Asset Returns!

March 28, 2021 by Chris

Earlier in Are Stock Returns Normally Distributed? we went through different ways to validate when asset returns are normally distributed. While we used only one stock to prove that stock returns are not normally distributed, the phenomenon applies to any volatile asset, in general.

The question though is: Really, which distribution do returns follow?

Below we load the MSFT stock returns as we did before!

Package Installation

%%capture
%pip install yahoofinancials
from yahoofinancials import YahooFinancials
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import dateutil.parser

def retrieve_stock_data(ticker, start, end):
    json = YahooFinancials(ticker).get_historical_price_data(start, end, "daily")
    columns=["adjclose"]  # ["open","close","adjclose"]
    df = pd.DataFrame(columns=columns)
    for row in json[ticker]["prices"]:
        d = dateutil.parser.isoparse(row["formatted_date"])
        df.loc[d] = [row["adjclose"]] # [row["open"], row["close"], row["adjclose"]]
    df.index.name = "date"
    return df

def normal_rets(S):
    return S.pct_change().dropna()

stock_prices = retrieve_stock_data("MSFT", "2019-01-01", "2020-01-01")

rets = normal_rets(stock_prices).dropna()
rets.columns = ['returns']

Fit all known distributions

Instead of trying to fit the Gaussian Distribution to our data, we will try to fit all the known (scipy.stats implementations) distributions and see which one (or ones) fits the data best.

import numpy as np
from scipy.stats._continuous_distns import _distn_names
import scipy.stats as st
import warnings

def fit_all_distributions(data):
    """
    Returns: Dict of the density function and the Sum of Squared Errors (SSE)
    """
    # Get histogram of original data. First, get the x for the pdf
    # second, get y to calculate the distance between the 
    # distribution and the real data
    y, x = np.histogram(data, bins='auto', density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    dist_fit = {}

    # Estimate distribution parameters from data
    for distname in _distn_names:
        distribution = getattr(st, distname)

        # Try to fit the distribution
        try:
            # Ignore warnings for data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Calculate fitted PDF and error
                pdf = distribution.pdf(x, *params)
                sse = np.sum(np.power(y - pdf, 2.0))
                dist_fit[distname] = {}
                dist_fit[distname]['pdf'] = pdf
                dist_fit[distname]['sse'] = sse
        except Exception:
            pass

    return dist_fit

dist_fit = fit_all_distributions(rets.returns)
len(dist_fit)

What? 99? :) Well, It should not be of a surprise that several distributions would fit the data. But the "fit" could be really off :) For example an exponential or even a uniform distribution could fit the data, but the squared error (the distance between the actual data points and the respective distribution points) would be really large compared to distributions that approximate the histogram of our data.

Let's get the 15 distributions with the lowest SSE (the ones that best fit the data).

dname_list = []
sse_list = []
pdf = []
for dname in dist_fit:
  dname_list.append(dname)
  sse_list.append(dist_fit[dname]['sse'])
  pdf.append(dist_fit[dname]['pdf'])

df = pd.DataFrame({'dist': dname_list, 'sse': sse_list, 'pdf':pdf})
df = df.sort_values('sse')
df_15 = df[:15]
df_15

	dist	sse	pdf
52	laplace	148.651407	[1.1152568034938246, 1.787521047421414, 2.8650...
95	gennorm	164.078771	[0.8367820735065007, 1.5154080142470994, 2.686...
17	dweibull	176.635717	[0.8793256263714793, 1.5296378660593237, 2.635...
16	dgamma	191.050465	[0.8552917947109029, 1.4658525793451274, 2.502...
44	hypsecant	213.097016	[0.7998518705010982, 1.383459240576739, 2.3919...
48	norminvgauss	230.918089	[0.9381562968673568, 1.5495654573259756, 2.580...
89	tukeylambda	232.611350	[0.7436837695141802, 1.2913003803519416, 2.270...
51	johnsonsu	241.389639	[0.9016851129004099, 1.4956683047905153, 2.519...
69	t	243.179748	[0.715121368715189, 1.2350959953532723, 2.1881...
70	nct	250.858249	[0.8688042602613769, 1.4480199008827952, 2.464...
30	genlogistic	260.651315	[0.8748265286114669, 1.5253897598139983, 2.650...
56	logistic	263.548520	[0.6451586874653275, 1.2223472162162576, 2.299...
10	burr12	271.252587	[0.7636589016658705, 1.3949317532979122, 2.529...
12	cauchy	272.502209	[1.4774507773037833, 1.8876747700442502, 2.491...
55	levy_stable	285.357222	[0.7622030792115081, 1.252074636445709, 2.2016...

Interestingly laplace was the distribution which approximates best the initial data. However, several other distributions are not that far from laplace with regards to SSE distances. To showcase that, let's see how the distributions visually correlate with the actual data.

y, x = np.histogram(rets.returns, bins=19, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
ax = rets.returns.plot(kind='hist', bins=19, figsize=(14,7), density=True, alpha=0.5, 
                       color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])

for index, row in df_15.iterrows():
  pd.Series(row.pdf, x).plot(ax=ax, label=row.dist)
ax.set_title(u'All Fitted Distributions')
ax.set_xlabel(u'Returns')
ax.set_ylabel('Frequency')
plt.legend(loc="upper left")
ax.plot()

An interesting observation from the graph above is that all there is almost a concurrence of the distribution legs and tails. On the contrary they do not really agree on the height of the mean bar, but all of them approximate it pretty well.

So, are we done yet? Shall we, from now on, use laplace as the distribution to represent asset returns?

Unfortunately no :(! And the reason why, is that there are several other parameters that can easily interfere with what we expect:

The number of data points (daily vs. minute vs. monthly)
The selected period. (last 3 months will give different result compared to last 6 months or 1 year, or more)
The reason we use a distribution (to measure risk? to predict returns? etc.)

Why T-Student is often used?

You might have wondered, why so many people use the T-Student distribution when analyzing stock return data!

The answer to that has to do with the risk that an investor can accept when placing money to a risky asset. There are different ways to measure risk and one of them is the risk estimation based on distributions (model risk).

Say for example, in the MSFT scenario above, that we would like to avoid any daily price drop of more than 2%. And we ask, how confident are we that there will be no drop of more than 2% tomorrow (since we have daily returns)?

The answer to that question can be derived from the CDF (Cumulative Distribution Function) of a distribution (which is the area under the PDF (Point Distribution Function))

Here is how the normal and student-t PDF looks like for MSFT returns in the period we test.

ax = rets.returns.plot(kind='hist', bins=19, figsize=(14,7), density=True, alpha=0.5, 
                       color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])

student_t_pdf = df.loc[df['dist'] == 't'].pdf.values[0]
pd.Series(student_t_pdf, x).plot(ax=ax, label='t')
normal_pdf = df.loc[df['dist'] == 'norm'].pdf.values[0]
pd.Series(normal_pdf, x).plot(ax=ax, label='normal')

ax.set_title(u'Student-T and Normal Distributions')
ax.set_xlabel(u'Returns')
ax.set_ylabel('Frequency')
plt.legend(loc="upper left")
ax.plot()

Back to our question now. How confident are we that tomorrow the price will not drop below 2% (-0.02)?

norm_cdf = np.cumsum(normal_pdf)
t_cdf = np.cumsum(student_t_pdf)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
fig.suptitle('CDF for Normal and T dists')
ax1.plot(x, norm_cdf)
ax2.plot(x, t_cdf)
plt.show()

norm_cdf[np.where(x <= -0.02)][-1]

9.876651890962592

t_cdf[np.where(x <= -0.02)][-1]

8.086428763077489

According to the normal distribution, there is 9.9% probability that the price will drop below 2% while only 8.1% student-t probability. Someone, based on their risk tolerance, would be more confident to place some more money knowing that the estimate is much closer to the reality (well, historical reality)!

But, again, it is really up to us to use any distribution we feel comfortable with, based on the data we have, our risk level, and of course (when it comes to statistics) the tools to do our analysis easier.

Fit Multiple Distributions to Asset Returns!

Fit all known distributions

Why T-Student is often used?

About InvestingSnippets

Topics

Popular Tags

Subscribe