Session 02

Pandas - python data analysis tool

Brief introduction about Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables.

In comparison with Microsoft Excel, Pandas has no real limit and handles millions of data points. It is easier to create and use complex equations and calculations on your data. As opposed to Excel, it is completely free to download and use.

Pandas and Matplotlib installation

At first, we should install Pandas and Matplotlib libraries directly to our Python IDE.

Open a PyCharm, go to Settings-->Project(name_of_your_project)-->Project Interpreter. There you should add required packages by pressing + symbol and install matplotlib and pandas from available packages respectively.

Sometimes there is a problem with installation those packages when you have Python version 3.8. This version is relatively new, there might be incompatibilities with installing particular packages.

We can avoid such an issue. During the installation required packages, install first matplotlib first and then pandas, in a consecutive way. Do not forget to update your pip as well. You can either do it in Project Interpreter menu or by typing python -m pip install --upgrade pip in the terminal.

If you find these methods unsuccessful, then you should install Python 3.7 version instead Python 3.8.

First example: IMU data

All .csv data under the scope of this course can be found in git repository: group Group Uno, project name Tutorial_IT2, folder name Praktikum 2. Open intro to pandas folder and download .csv files. Copy these files directly to that folder, where you created your python file.

First example is about data from the two different IMU (inertial measurement unit), ID IMU and MT IMU respectively. In the scope of tutorial, we will learn how to read a simple .csv file and plot the data.

We will start with data. The files are called ID_IMU.csv and MT_IMU.csv. Data are in .csv (comma-separated values) format. It is quite popular format and actually it is simple.

I will show here how to work with ID_IMU.csv file. You can work with MT_IMU.csv file in parallel. The steps will be absolutely similar.

This is what the data looks like in any editor:

_images/7.png

Open up your PyCharm. First, we have to import pandas, which is the main library:

#! /usr/bin/python

import pandas as pd
# import [module] as [another_name]

We import matplotlib for graphing:

import matplotlib.pyplot as plt

We should get our data loaded in. We do it using read_csv() function:

dataframe = pd.read_csv("ID_IMU.csv")
# in case of MT_IMU: dataframe = pd.read_csv("MT_IMU.csv", sep=";")
print(dataframe)

We can see all data in Run Window now. If you do not want to load in all of the data, use head() or tail() functions to print certain number of rows from bottom or top:

print (dataframe.head(5))

We can see now only first top three rows in Run Window.

Pandas figures out that the first line of the file is the header. We can see we have a header at the top, that gives us the ten columns we have: DIAG_STAT, X_GYRO (16 bit), etc.

   DIAG_STAT  X_GYRO (16 bit)  ...  TIME_STAMP  BM_CHECKSUM
0          0            0.100  ...           0          382
1          0           -0.275  ...           0          382
2          0           -0.025  ...           0          382
3          0           -0.050  ...           0          382
4          0           -0.175  ...           0          382

[5 rows x 10 columns]

What if your .csv file does not have a header? We can still read the file, however Pandas will manually provide the header.

There is another file called ID_IMU_no_headers.csv. It is the same file as before, without the headers.

We can see data in Run Window now. 0, 0.1, -0.1, etc. are headers of columns now:

   0    0.1   -0.1   0.15  0.02125  0.07875  -1.19875  38.2      0  382
0  0 -0.275  0.000  0.150  -0.0025  0.03000  -1.10125  38.2      0  382
1  0 -0.025 -0.125  0.075   0.0000  0.01375  -1.18750  38.2      0  382
2  0 -0.050 -0.300  0.025   0.0325  0.02875  -1.12500  38.2      0  382
3  0 -0.175 -0.275 -0.050   0.0200  0.00500  -1.01625  38.2      0  382
4  0  0.125 -0.425 -0.025   0.0425  0.01375  -1.08375  38.2      0  382

But..this looks not good. Let us declare our own headers. We would like to name them diagonal, x-axis, y-axis, etc:

headers = ["diagonal","x-gyros","y-gyros","z-gyros","x-accel","y-accel","z_accel","temp_out","time_stamp","checksum"]
dataframe_no = pd.read_csv("ID_IMU_no_headers.csv", names = headers)
print(dataframe_no.head(5))

We can see this result:

   diagonal  x-gyros  y-gyros  z-gyros  ...  z_accel  temp_out  time_stamp  checksum
0         0    0.100   -0.100    0.150  ... -1.19875      38.2           0       382
1         0   -0.275    0.000    0.150  ... -1.10125      38.2           0       382
2         0   -0.025   -0.125    0.075  ... -1.18750      38.2           0       382
3         0   -0.050   -0.300    0.025  ... -1.12500      38.2           0       382
4         0   -0.175   -0.275   -0.050  ... -1.01625      38.2           0       382

[5 rows x 10 columns]

As we have data now, it is time to plot it.

We start by removing the index from the file, as Pandas by default puts an index. If we see our data structure, we can observe the values going 0,1,2,3,4....

We remove these indexes column by replacing them with diagonal column.

So, how do we replace one with another:

dataframe = pd.read_csv("ID_IMU.csv")
dataframe.set_index("DIAG_STAT",inplace=True)
print(dataframe.head(5))

inplace makes the changes in the dataframe if true.

We see this result:

DIAG_STAT  X_GYRO (16 bit)  Y_GYRO (16 bit)  ...  TIME_STAMP  BM_CHECKSUM
                            ...
0                    0.100           -0.100  ...           0          382
0                   -0.275            0.000  ...           0          382
0                   -0.025           -0.125  ...           0          382
0                   -0.050           -0.300  ...           0          382
0                   -0.175           -0.275  ...           0          382

[5 rows x 9 columns]

Before plotting the graph, be sure that you commented (#dataframe.set_index...) the dataframe, where you have made changes with index. Plotting with changes in index might lead to wrong graph:

dataframe.plot()
plt.show()

We get this image:

_images/01.png

When you finish writing your python file, please push it to your personal git repository.

Second example: Wage hours

Let us look one more example. This time we will deal with wages data and learn how to add our own headers, deal with tab separated files and extract columns from the data.

We start as before:

#! /usr/bin/python

import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.read_csv("wages_hours.csv")
print(dataframe.head(5))

It seems to be a bit messy:

HRS\tRATE\tERSP\tERNO\tNEIN\tASSET\tAGE\tDEP\tRACE\tSCHOOL
0  2157\t2.905\t1121\t291\t380\t7250\t38.5\t2.340...
1  2174\t2.970\t1128\t301\t398\t7744\t39.3\t2.335...
2  2062\t2.350\t1214\t326\t185\t3068\t40.1\t2.851...
3  2111\t2.511\t1203\t49\t117\t1632\t22.4\t1.159\...
4  2134\t2.791\t1013\t594\t730\t12710\t57.7\t1.22...

If we open wages_hours.csv, we can notice that values in data are separated not with comma(,). Even though the name is comma separated values - csv, they can be separated by anything, including tabs.

The \t in the run window means tabs. Pandas was not able to parse the file, as it was expecting commas, not tabs. By default, it expects commas.

We will read now the file again. This time we tell to Pandas the separator is tabs, not commas using sep = \t function:

dataframe = pd.read_csv("wages_hours.csv", sep = "\t")
print (dataframe.head(5))

We can see organized data now:

    HRS   RATE  ERSP ERNO  NEIN  ASSET   AGE    DEP  RACE  SCHOOL
0  2157  2.905  1121  291   380   7250  38.5  2.340  32.1    10.5
1  2174  2.970  1128  301   398   7744  39.3  2.335  31.2    10.5
2  2062  2.350  1214  326   185   3068  40.1  2.851     *     8.9
3  2111  2.511  1203   49   117   1632  22.4  1.159  27.5    11.5
4  2134  2.791  1013  594   730  12710  57.7  1.229  32.5     8.8

What if there are data that more than sufficient to us. Let us extract only two columns from the dateframe, namely AGE and RATE:

dataframe2 = dataframe[["AGE","RATE"]]
print(dataframe2.head(5))

We can see data only for AGE and RATE:

    AGE   RATE
0  38.5  2.905
1  39.3  2.970
2  40.1  2.350
3  22.4  2.511
4  57.7  2.791

In order to plot a proper graph, we should sort the data for AGE in either ascending or descending order. Also, we should again replace AGE with index, as we did previously:

dataframe_sorted = dataframe2.sort_values(["AGE"], ascending=True)
dataframe_sorted.set_index("AGE", inplace=True)
print(dataframe_sorted.head(5))

We get:

 AGE   RATE
22.4  2.511
37.2  3.015
37.4  1.901
37.5  1.899
37.5  3.009

We are ready to plot the data:

dataframe_sorted.plot()
plt.show()

We get:

_images/03.png

When you finish writing your python file, please push it to your personal git repository.

Merging in Pandas

Now, we will learn merging in Pandas.

All .csv data under the scope of this course can be found in git repository: group Group Uno, project name Tutorial_IT2, folder name Praktikum 2. Open merging in pandas folder and download .csv files. Copy these files directly to that folder, where you created your python file.

We will start with two files: visitors.csv and visitors-new.csv. Both files are similar. The first contains all visitors to an unknown website, second contains the new visitors only, who are visiting the website for the first time.

If you open these files, you can see that the first five lines are a comment. The actual data starts below and in the format: date, visitors. So, on 9 February, 2015, we had 59 visitors. Our first task is to remove first five lines of comments. We start by importing Python libraries:

import pandas as pd
import matplotlib.pyplot as plt

The data does not contain headers, so we have to define by our own:

headers = ["date", "visitors"]
data = pd.read_csv("visitors.csv", skiprows=4, names = headers)
print(data.head(5))

We used the function skiprows. This means we skipped first four rows, as they contained comments only.

You may ask why 4 rows instead of 5? Because, Pandas assumes that the 5th row is for the header. So, we get:

         date  visitors
0  2015-02-09        59
1  2015-02-08        79
2  2015-02-07        73
3  2015-02-06        89
4  2015-02-05        80

Let us now open another file, visitors-new.csv:

headers = ["date", "visitors_new"]
data_new = pd.read_csv("visitors-new.csv", skiprows=4, names = headers)
print(data_new.head(5))

We get:

         date  visitors-new
0  2015-02-09            55
1  2015-02-08            64
2  2015-02-07            61
3  2015-02-06            79
4  2015-02-05            60

We opened both files and now we can see that they contain a common field: date. The dates are actually same.

Now we will merge these two dataframes, using merge function. Pandas is smart enough to figure out which date is common between them, so it merges on date. You can also manually specify which field you want to merge on.

data_combined = pd.merge(data, data_new)
print(data_combined.head(5))

We get now a combined data:

         date  visitors  visitors-new
0  2015-02-09        59            55
1  2015-02-08        79            64
2  2015-02-07        73            61
3  2015-02-06        89            79
4  2015-02-05        80            60

Now, let us sort the date in ascending order and remove the index:

data_combined.sort_values(["date"], inplace=True)
data_combined.set_index("date", inplace=True)
print(data_combined.head(5))

We get:

date        visitors  visitors-new
2014-07-14         5             4
2014-07-15        58            55
2014-07-16        18            15
2014-07-17        14            10
2014-07-18        11             9

We plot it:

data_combined.plot()
plt.show()

We get image:

_images/04.png

When you finish writing your python file, please push it to your personal git repository.


Data Analysis with Pandas

We will work with obesity data in England.

All .csv data under the scope of this course can be found in git repository: group Group Uno, project name Tutorial_IT2, folder name Praktikum 2. Open data analysis folder and download .csv files. Copy these files directly to that folder, where you created your python file.

We start with importing pandas and matplotlib. We should also import xlrd dependency. It is a utility that helps to extract data from Microsoft Excel spreadsheet files. Go to Settings-->Project(name_of_your_project)-->Project Interpreter. There you should add required packages by pressing + symbol and install xlrd from available packages respectively.

#! /usr/bin/python
import pandas as pd
import matplotlib.pyplot as plt
import xlrd

We will open now .xls file. The great thing about pandas is that, we can open Excel files as well. Normally, most libraries can only work with .csv files.

We print all sheet names first:

data = pd.ExcelFile("obesity-2014.xls")
print(data.sheet_names)

We print all the sheet names:

['Chapter 7', '7.1', '7.2', '7.3', '7.4', '7.5', '7.6', '7.7', '7.8', '7.9', '7.10']

Let us have a look at sheet 1. If you open obesity-2014.xls file, sheet 7.1, you can see four columns: year, total, males, females.

The year column does not have a header. The header for year is empty.

Let us define the header columns ourselves:

# Define the columns to be read
columns = ['year', 'total', 'males', 'females']

Now we read sheet 7.1. from excel file. If we look carefully, the top 4 and bottom 14 rows contain useless info, so we should skip them, using skiprows and skipfooter functions.

The former function deals with rows in the beginning, the latter function deals with last rows. Finally, we tell pandas to name the column headers using our list of names:

data = data.parse('7.1', skiprows=4, skipfooter=14, names=columns)
print(data)

By using data.parse we are analysing a string of symbols of data structures.

We get:

       year    total   males  females
0       NaN      NaN     NaN      NaN
1   2002/03   1275.0   427.0    848.0
2   2003/04   1711.0   498.0   1213.0
3   2004/05   2035.0   589.0   1442.0
4   2005/06   2564.0   746.0   1786.0
5   2006/07   3862.0  1047.0   2807.0
6   2007/08   5018.0  1405.0   3613.0
7   2008/09   7988.0  2077.0   5910.0
8   2009/10  10571.0  2495.0   8074.0
9   2010/11  11574.0  2919.0   8654.0
10  2011/12  11736.0  2993.0   8740.0
11  2012/13  10957.0  2950.0   8007.0

If we pay attention to the result we got, we see NaN (Not A Number) values. If we compare with original spreadsheet, we can see, there is an empty space on line 4.

Since it is empty, it is read as NaN by pandas.

We should get rid of it. We use inbuilt dropna function. It allows the user to analyze and drop Rows/Columns with Null values:

# Remove the N/A from the data
data.dropna(inplace = True)

We get:

       year    total   males  females
1   2002/03   1275.0   27.0    848.0
2   2003/04   1711.0   498.0   1213.0
3   2004/05   2035.0   589.0   1442.0
4   2005/06   2564.0   746.0   1786.0
5   2006/07   3862.0  1047.0   2807.0
6   2007/08   5018.0  1405.0   3613.0
7   2008/09   7988.0  2077.0   5910.0
8   2009/10  10571.0  2495.0   8074.0
9   2010/11  11574.0  2919.0   8654.0
10  2011/12  11736.0  2993.0   8740.0
11  2012/13  10957.0  2950.0   8007.0

We do not see NaN values any more.

Let us git rid of index numbers now, as we are doing since beginning:

data.set_index('year', inplace=True)
print(data)

We get:

   year    total   males  females
2002/03   1275.0   427.0    848.0
2003/04   1711.0   498.0   1213.0
2004/05   2035.0   589.0   1442.0
2005/06   2564.0   746.0   1786.0
2006/07   3862.0  1047.0   2807.0
2007/08   5018.0  1405.0   3613.0
2008/09   7988.0  2077.0   5910.0
2009/10  10571.0  2495.0   8074.0
2010/11  11574.0  2919.0   8654.0
2011/12  11736.0  2993.0   8740.0
2012/13  10957.0  2950.0   8007.0

Let us plot all:

_images/05.png

We can see that obesity for men and women has gone up more strongly.


Let us have a look now at sheet 2. Last time we defined the headers ourselves. This time, we will let pandas pick them up:

# Read 2nd section, by age
data2  = data.parse('7.2', skiprows=4, skipfooter=14)
print(data2)

We get:

    Unnamed: 0    Total  Under 16  16-24  ...   45-54   55-64  65-74  75 and over
0         NaN      NaN       NaN    NaN  ...     NaN     NaN    NaN          NaN
1     2002/03   1275.0     400.0   65.0  ...   216.0    94.0   52.0         23.0
2     2003/04   1711.0     579.0   67.0  ...   273.0   151.0   52.0         24.0
3     2004/05   2035.0     547.0  107.0  ...   364.0   174.0   36.0         32.0
4     2005/06   2564.0     583.0   96.0  ...   554.0   258.0   72.0         20.0
5     2006/07   3862.0     656.0  184.0  ...   872.0   459.0  118.0         43.0
6     2007/08   5018.0     747.0  228.0  ...  1198.0   598.0  157.0         53.0

If you remember, the year column did not have a header, that is why pandas named it Unnamed. Let us rename it:

# Rename unames to year
data2.rename(columns={'Unnamed: 0': 'Year'}, inplace=True)
print(data2)

We get:

        Year    Total  Under 16  16-24  ...   45-54   55-64  65-74  75 and over
0       NaN      NaN       NaN    NaN  ...     NaN     NaN    NaN          NaN
1   2002/03   1275.0     400.0   65.0  ...   216.0    94.0   52.0         23.0
2   2003/04   1711.0     579.0   67.0  ...   273.0   151.0   52.0         24.0
3   2004/05   2035.0     547.0  107.0  ...   364.0   174.0   36.0         32.0
4   2005/06   2564.0     583.0   96.0  ...   554.0   258.0   72.0         20.0
5   2006/07   3862.0     656.0  184.0  ...   872.0   459.0  118.0         43.0

Let us drop NaN again. Also we replace again index with Year:

# Drop empties and reset index
data2.dropna(inplace=True)
data2.set_index('Year', inplace=True)
print(data2)

We get:

 Year      Total  Under 16  16-24   25-34  ...   45-54   55-64  65-74  75 and over
                               ...
2002/03   1275.0     400.0   65.0   136.0  ...   216.0    94.0   52.0         23.0
2003/04   1711.0     579.0   67.0   174.0  ...   273.0   151.0   52.0         24.0
2004/05   2035.0     547.0  107.0   287.0  ...   364.0   174.0   36.0         32.0
2005/06   2564.0     583.0   96.0   341.0  ...   554.0   258.0   72.0         20.0

Let us plot:

data2.plot()
plt.show()

We get:

_images/06.png

We can observe from the graph, that Total column is huge and overrides everything. We should git rid of it in order to make the graph clearer:

# Drop the total column and plot
data2_minus_total = data2.drop('Total', axis = 1)

We used drop() function. It drops entries from a table.

Now we plot the graph, dropping the Total column:

data2_minus_total.plot()
plt.show()

We get:

_images/07.png

What if we want to compare age groups? How children compare to adults?

In pandas, we can review any column by using data[column_name]. We can plot any age group this way:

data2['Under 16'].plot(label = "Under 16")
data2['25-34'].plot(label = "25-34")
plt.show()

We get:

_images/08.png

We see that while children’s obesity has gone down, that for adults has jumped.

When you finish writing your python file, please push it to your personal git repository.


Congratulations, we have learnt how to use Pandas library in Python programming.