MLearn – Distribution

Data distribution can be viewed using ‘Distribution’ tab of MLearn. Available plot options are density, histogram, boxplot, violin and correlation. Let’s start downloading the following dataset named ‘returns.csv’ in your local drive. Right click the link below and select ‘download linked file’.

returns

Following is a preview of the data. It is the daily returns of stocks of Microsoft (MSFT), Amazon (AMZN), Apple (AAPL) and Google (GOOG) since December 2014. This is calculated from adjusted closing price accessed by Yahoo Finance. Returns were calculated using the following formula.

Return = (Stock Price Day 1 – Stock Price Day 2)/Stock Price Day1

 

Date MSFT AMZN AAPL GOOG
12/10/14 -0.014 -0.021 -0.019 -0.014
12/11/14 0.006 0.005 -0.003 0.004
12/12/14 -0.005 0 -0.017 -0.018
12/15/14 -0.006 -0.004 -0.014 -0.009
12/16/14 -0.032 -0.036 -0.014 -0.036
12/17/14 0.013 0.013 0.025 0.019
12/18/14 0.039 -0.004 0.03 0.012
12/19/14 0.003 0.007 -0.008 0.01
12/22/14 0.007 0.022 0.01 0.017
12/23/14 0.01 -0.001 -0.004 0.011
12/24/14 -0.006 -0.011 -0.005 -0.003
12/26/14 -0.005 0.02 0.018 0.01
12/29/14 -0.009 0.01 -0.001 -0.007
12/30/14 -0.009 -0.006 -0.012 0
12/31/14 -0.012 0 -0.019 -0.008
1/2/15 0.007 -0.006 -0.01 -0.003
1/5/15 -0.009 -0.021 -0.028 -0.021

 

Now load data from returns.csv into the app.

Click ‘Distribution’ on top to open the tab. Under ‘Variables’ select columns to be visualized. Select plot type in ‘select plot’.

By default ‘density’ is selected. It calculates and plots kernel density estimates. It is the non-parametric method to estimate probability density function of a continuous random variable.

Select ‘MSFT’, ‘AMZN’, ‘AAPL’ and ‘GOOG’ and click ‘Submit Change’. This’ll plot density distributions of daily returns of the selected stocks.

Label and text size can be adjusted using dropdown under ‘Plot Design’.

Histograms are used to view the distribution of numerical data and was first introduced by Karl Pearson. For continuous variables, this is done by dividing the x axis into bins and counting the number of observations in each bin. Select ‘histogram’ under ‘select plot’ and click ‘Submit Change’.

Box plot or box and whisker plot is a way to visualize the distribution of data based on five values – minimum, first quartile, median, third quartile, and maximum. The size of the middle rectangle is made of first quartile to third quartile, also known as interquartile range (IQR). The line inside rectangle is the median and whiskers spread to data minimum and maximum. It estimates outliers as either 3 X IQR or more above the third quartile or 3 X IQR or more below the first quartile and shows as individual points. Select ‘box’ in ‘select plot’ and click ‘Submit Change’.

A violin plot is a combination of box plot and density plot. It is a mirrored density plot displayed similar to boxplot. Select ‘violin’ under ‘select plot’ and click ‘Submit Change’.

Select ‘correlation’ under ‘select plot’ to visualize correlations between all selected variables. This’ll show the correlations in a matrix. If numeric discrete variables present, they will be considered as continuous and will be included in correlation matrix. In this example, Pearson correlation coefficient between ‘GOOG’ and ‘MSFT’ is 0.6.

It is always a good practice to identify distributions and correlations between response and predictor variables before further analysis and predictive modeling of the dataset.