How I built basic reports 📈 of the US Stock Market Big Data using Python?

7 min readNov 11, 2017

Introduction

Hello, my friends 🙌

My name is Batyr and I am a graduate student, Computer Science major, at Pace University. And when I heard that our university offers a class (CS660) related to one of the hottest topic in IT world nowadays, Big Data Analytics, I immediately decided to register for this class and here I am writing my first post in Medium and sharing my thoughts about building some basic reports based on the big data from US Stock Market for 2016 year (taken from https://github.com/pydata/pandas-datareader). I will be using the most widely used language Python in Big Data Analytics and some of the very powerful open-source libraries, such as Pandas, Numpy, MatPlotLib and Seaborn. This is the first assignment, Project #1, that we have had from our professor. And if you have any improvements or thoughts about my work, please, feel free to share them in comments section, since we are all here for gaining some valuable knowledge 🖖

Prerequisites
#1. List all publicly traded corporations
#2. Find relative returns for them
#3. Best and worst performed corporations. Plot their total relative returns + using box and whiskers graphs
#4. Total revenue if $10k is invested
#5. Total return for Apple (40%), Amazon (30%) and Google (30%)

Prerequisites

There is a short list of tools and applications that should be installed on your machine for successful proceeding all the steps below:

Install Python (Anaconda) on your machine and make sure that your versions are compatible. You can use this link to install Conda and this one to install Python using conda in the terminal.
Anaconda Navigator, from where you will be able to launch Jupyter Notebook local server.
That’s it. Now you can check versions of installed tools in the terminal window. I have following versions installed on my machine: python v. = 2.7.13, anaconda v. = 4.2.0, conda v. = 4.3.25

#1. List all publicly traded corporations

So let’s start! Our first task is to print a list of all publicly traded corporations found by calling the get_nasdaq_symbols module that is the part of the pandas_datareader. This list has to include a corporation’s Symbol, Security Name and Listing exchange fields. If you would like to read more about Pandas DataReader then this link might be useful.

Before we proceed to printing a list of corporations, we have to import some libraries and check if they are set up correctly.

listing #1

Now we can start talking about how to get a list with all the fields mentioned before. I am going to use .loc[] function of Panda’s DataFrame that allow us to get labels of our table. Also we need to transform List Exchange symbols to more human-readable strings. You can get all symbols definitions here. And then we print first 5 rows (that is default value for .head() function) of our table. And this is how my code looks like:

listing #2

#2. Find relative returns for them

Of course, if we start calculating relative returns for each corporation listed in the task #1, we will spend a lot of time. Accordingly, I have decided to limit number of corporations and picked following corporations: Apple (APPL), Amazon (AMZN), Google (GOOGL) and IBM (IBM), besides we are going to use three of them in the task #5. Also we are going to use Yahoo as data source.

Before moving on to the coding part, let’s check what definition Wikipedia gives us “Relative return is a measure of the return of an investment portfolio relative to a theoretical passive reference portfolio or benchmark”. So basically, it shows a return achieved by an asset over a specific time period. We need to calculate relative returns for 2016 year based on the formula below:

listing #3

As we have all of the Close Prices for 2016 year for each corporations we are working with, we can start calculating relative returns. Before we start implementing this, I would like to share our professor’s note related to relative returns:

There are several ways one can go about when a trading strategy is to be developed. One approach is to use time-series which correspond not to actual values but changes in the monetary value of the asset. These time-series can and do assume negative values and also, their statistical properties are usually more stable than the ones of price time-series. The most frequently used forms used are relative returns defined as:
r-relative, r(t)=(p(t)–p(t−1)/p(t−1))
log-returns defined as r(t)=log(p(t)/p(t-1)), where p(t) is the price of the asset at time t.
For example, if p(t)=101 and p(t−1)=100 then r-relative (t)= (101–100)/100 = 1%

So now we are ready to code this:

listing #4

listing #5

#3. Best and worst performed corporations. Plot their total relative returns + using box and whiskers graphs

If we have all information about Close Prices for each corporations, we can calculate the best and worst performed of them, by finding difference between maximum and minimum prices of Close Price during 2016 year, and we just need to pick minimum and maximum number of them to know which one is best or worst performed. It seems like pretty easy task:

listing #6

So as we can see on the picture, the best performed corporation is Amazon and the worst one is Apple. .describe() function of Panda’s DataFrame generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values (more info). And we just added extra label min-max diff to the data frame.

Here is a code for plotting their relative returns. For more information on how to use MatPlotLib.PyPlot, please go through this link or just google it 🙃

listing #7

And now we can plot whisker graphs for both of the corporations:

listing #8

listing #9

#4. Total revenue if $10k is invested

The exact question for this task is “You have invested $10,000 in 2016 by splitting the capital into the 2 securities from step #3. What would have been your total (relative) return?”. So as we don’t have any limitations about which method of calculation to use, I have chosen to the way when you invest on 1.JAN.2016 and sell on 31.DEC.2016. So here is a code:

listing #10

And the output is:

1. Relative return for Apple = 12.11%
2. Relative return for Amazon = 4.98%

So as we can see, if use this method, we would earn more if we invest to Apple than to Amazon, but actually if we guess the dates of maximum and minimum values for Close Price, we could earn much more!

#5. Total return for Apple (40%), Amazon (30%) and Google (30%)

And the last task of our assignment is to calculate total (relative) return for 2016, if we invested 40% on Apple, 30% on Amazon and 30% on Google, so that’s why I have started with these corporations in step #2. I am going to use the same method of calculating relative return as in the previous step:

listing #11

Output:

1. Relative return for Apple = 11.54%
2. Relative return for Amazon = -0.02%
3. Relative return for Google = -20.75%
Our capital on the beginning of 2016 — $10.000. Our income for the end of 2016 = 9838.30

If we change our method of calculation, because I used the simplest one, we can achieve more total revenue in the end.

So here I am going to stop because we successfully achieved what we were asked for in the beginning of this post. And again, please, if you have any improvements or thoughts about my work, feel free to share them in comments section, since we are all here for gaining some valuable knowledge 🖖 And here is a link to Github repository.

Like this story? Please share and recommend to others. Together, we can achieve more 💪