Python For Machine Learning: Pandas Dataframe

Python For Machine Learning: Pandas Dataframes

Python for Machine Learning: Pandas DataFrame is going to be a series of posts that cover the various aspect of pandas DataFrame. We will cover the most frequent operation performed over pandas DataFrame operation. Make sure to follow along with the post to get the most from it.

In this post, we will cover the pandas DataFrame basic and commonly used operations that you can perform over it.

What is DataFrame?

Dataframes is one of the most common data structures used by the data scientist. It is analogous to a table or spreadsheet. Python package “pandas” has given multiple options to create dataFrames. You can load the dataset from local storage, SQL database, CSV file, and an Excel file. Pandas DataFrames can be created from dict of series, nested lists, JSONs, dictionaries, etc.

DataFrame is a 2-dimensional tabular structure with labeled rows and columns. Its columns are mutable and data type can be of heterogenous in nature. You can consider it like a spreadsheet or SQL table. You can create DataFrame from lists, series, 2-dimensional NumPy array, and DataFrame.

Properties of DataFrame

  • Structurally like spreadsheet or SQL table
  • Columns can be heterogenous like int, float, string, etc.
  • It is mutable: columns can be inserted, deleted, and modified
  • Dataframe is a 2-dimensional data structure
  • Similar to a NumPy array
  • A DataFrame column is a Series structure

By default, every axis in a DataFrame has an index. Although you can update these indexes as per your use-case. Indexes provide fast lookups, and helpful in performing join operations.

DataFrame Structure

Consider the IRIS flower dataset example to understand the DataFrame structure.

Pandas Dataframe Structure

DataFrame Constructor

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

DataFrame([data, index, columns, dtype, copy])

Parameter

  • data – structured or homogeneous ndarray, iterable, dict, or DataFrame
  • index – Optional parameter and will default to RangeIndex if no index provided. Index to be used for the resulting frame.
  • columns – Optional parameter and will default to RangeIndex if no column labels provided. Column labels to be used for the resulting frame.
  • dtype – The data type of column. Only a single dtype is allowed for each column.
  • copy – Copy data from inputs provided, By default, it is False.

How to Create a DataFrame?

You can create a DataFrame from the following data types:

  • Dictionary of 1-Dimensional NumPy arrays, lists, dictionaries, or Series
  • 2- Dimensional NumPy array
  • Series
  • Another DataFrame

Create an Empty DataFrame

When you are not sure about the DataFrame structure then create an empty DataFrame structure. Later, load the data based on your use-case.

import pandas as pd
df = pd.DataFrame()
print(df)

Output:

Empty DataFrame
Columns: []
Index: []

Now, add few columns to the empty DataFrame from List.

df['Elements'] = ['a', 'b', 'c', 'd']
df['Numbers'] = ['1', '2', '3', '4']
print(df)

Output:

  Elements Numbers
0        a       1
1        b       2
2        c       3
3        d       4

Create a DataFrame from Lists

There are multiple ways that you use to create a DataFrame. Let’s have a look into an example to create DataFrame using a single list or list of lists.

import pandas as pd
elements = ['a', 'b', 'c', 'd']
df = pd.DataFrame(elements)
print(df)

Output:

   0
0  a
1  b
2  c
3  d

Observe the column name as “0”. Now, try adding a column name to it.

df.columns = ['Elements']
print(df)

Output:

  Elements
0        a
1        b
2        c
3        d

Create DataFrame using a list of lists.

import pandas as pd
lst = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
df = pd.DataFrame(lst, columns=['Elements','Numbers'])
df

Output:

  Elements  Numbers
0        a        1
1        b        2
2        c        3
3        d        4

Create a DataFrame from NumPy ndarray

Using Numpy ndarray to construct the DataFrame.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
print(df)

Outputs:

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Create a DataFrame from Dictionaries of Series

Dictionaries keys are taken as a column, and index values as an index. Just make sure then the length of the series access the dict object remains the same.

import pandas as pd
summaries={'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
index=['Closing price','EPS', 'Shares Outstanding(M)', 'Beta', 'P/E','Market Cap(B)']),
'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
index=['Closing price','EPS','Shares Outstanding(M)', 'Beta','P/E','Market Cap(B)']),
'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
index=['Closing price','EPS','Shares Outstanding(M)', 'P/E', 'Market Cap(B)']),
'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
index=['Closing price','EPS','Shares Outstanding(M)', 'P/E','Beta', 'Market Cap(B)'])}

stock_df=pd.DataFrame(summaries)
print(stock_df)

Output:

                         AMZN     GOOG       FB     YHOO
Beta                     0.52     0.87      NaN     0.66
Closing price          346.15  1133.43    61.48    34.90
EPS                      0.59    36.05     0.59     1.27
Market Cap(B)          158.88   380.64   150.92    35.36
P/E                    589.80    31.44   104.93    27.48
Shares Outstanding(M)  459.00   335.83  2450.00  1010.00

Create a DataFrame from the Dictionary of Lists

You can create a DataFrame structure from a dictionary of lists. The keys become the column labels. As the index is not provided, row label indexes are generated using np.range(n).

import pandas as pd
iris = {'id':[1, 2, 3, 4, 5, 6],
'SepalLengthCm':[5.1, 4.9, 7, 6.4, 6.3, 5.8],
'SepalWidthCm':[3.5, 3, 3.2, 3.2, 3.3, 2.7],
'PetalLengthCm':[1.4, 1.4, 4.7, 4.5, 6, 5.1],
'PetalWidthCm':[0.2, 0.2, 1.4, 1.5, 2.5, 1.9],
'Species':['setosa', 'setosa', 'versicolor', 'versicolor', 'virginica', 'virginica']}

iris_df = pd.DataFrame(iris)
print(iris_df)

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm     Species
0   1            5.1           3.5            1.4           0.2      setosa
1   2            4.9           3.0            1.4           0.2      setosa
2   3            7.0           3.2            4.7           1.4  versicolor
3   4            6.4           3.2            4.5           1.5  versicolor
4   5            6.3           3.3            6.0           2.5   virginica
5   6            5.8           2.7            5.1           1.9   virginica

DataFrame Operations: Selection, Assignment, Addition, and Deletion

Pandas DataFrames are mutable data structure,  and you easily perform columns/row selections, update values, add or delete any new column or rows. Let us understand through examples. 

Selection

The DataFrame has both a row and column index. Therefore, you can conveniently select the desired rows and columns from the DataFrame. 

You can select a column in a DataFrame as a Series either by dict-like notation or by attribute.

# Create IRIS DataFrame from Dicts
import pandas as pd
iris = {'id':[1, 2, 3, 4, 5, 6],
'SepalLengthCm':[5.1, 4.9, 7, 6.4, 6.3, 5.8],
'SepalWidthCm':[3.5, 3, 3.2, 3.2, 3.3, 2.7],
'PetalLengthCm':[1.4, 1.4, 4.7, 4.5, 6, 5.1],
'PetalWidthCm':[0.2, 0.2, 1.4, 1.5, 2.5, 1.9],
'Species':['Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica', 'Iris-virginica']}

iris_df = pd.DataFrame(iris, index=['a','b','c','d', 'e', 'f'])

# Selecting Column
iris_df['Species']

Output:

a        Iris-setosa
b        Iris-setosa
c    Iris-versicolor
d    Iris-versicolor
e     Iris-virginica
f     Iris-virginica
Name: Species, dtype: object

Another way for selecting a column.

iris_df.SepalLengthCm

Output:

a    5.1
b    4.9
c    7.0
d    6.4
e    6.3
f    5.8
Name: SepalLengthCm, dtype: float64

Rows can be retrieved by position or name.  Two important pandas function to remember:

  • DataFrame.iloc[] – access a group of rows and columns by integer-based position.
  • DataFrame.loc[] – access a group of rows and columns by label(s)
# Row selection - 4th row (Indexing starts with 0)
iris_df.iloc[3]

Output:

id                             4
SepalLengthCm                6.4
SepalWidthCm                 3.2
PetalLengthCm                4.5
PetalWidthCm                 1.5
Species          Iris-versicolor
Name: d, dtype: object

Note the above command returns the row as a Series.  But wait, then how to get the output as DataFrame? Yes, you can do it using a double square bracket.

row = iris_df.iloc[[3]]
print("row type: ", type(row))

# Print row
print(row)

Output:

row type:  <class 'pandas.core.frame.DataFrame'>
   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm          Species 
d   4            6.4           3.2            4.5           1.5  Iris-versicolor  

Row selection by Index, for a given example, uses index name ‘b’.

# Row selection by using Index
iris_df.loc['b']

Output:

id                         2
SepalLengthCm            4.9
SepalWidthCm               3
PetalLengthCm            1.4
PetalWidthCm             0.2
Species          Iris-setosa
Name: b, dtype: object

Now, we will look into ways to select multiple rows and columns.

# Selecting Multiple Columns, and all rows
print(iris_df.iloc[:,1:5])

Output:

   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
a            5.1           3.5            1.4           0.2
b            4.9           3.0            1.4           0.2
c            7.0           3.2            4.7           1.4
d            6.4           3.2            4.5           1.5
e            6.3           3.3            6.0           2.5
f            5.8           2.7            5.1           1.9

Selecting multiple rows using the Index.

# Selecting Multiple rows, and all columns
print(iris_df.loc[['a', 'd']])

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm          Species 
a   1            5.1           3.5            1.4           0.2      Iris-setosa 
d   4            6.4           3.2            4.5           1.5  Iris-versicolor   

Take a look to select the desired rows and columns both.

# Selection of rows and columns by position
print(iris_df.iloc[2:5, 3:6])

Output:

   PetalLengthCm  PetalWidthCm          Species
c            4.7           1.4  Iris-versicolor
d            4.5           1.5  Iris-versicolor
e            6.0           2.5   Iris-virginica

Selection by Index

# Selection of rows and columns by Index
print(iris_df.loc['a':'d', 'id':'SepalWidthCm'])

Output:

   id  SepalLengthCm  SepalWidthCm
a   1            5.1           3.5
b   2            4.9           3.0
c   3            7.0           3.2
d   4            6.4           3.2

Assignment

You can modify rows or columns by assignment operations.

# Set value for entire row
iris_df.loc['d'] = 1.5
print(iris_df)

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species 
a  1.0            5.1           3.5            1.4           0.2       Iris-setosa
b  2.0            4.9           3.0            1.4           0.2       Iris-setosa
c  3.0            7.0           3.2            4.7           1.4   Iris-versicolor
d  1.5            1.5           1.5            1.5           1.5               1.5
e  5.0            6.3           3.3            6.0           2.5    Iris-virginica
f  6.0            5.8           2.7            5.1           1.9    Iris-virginica

Set value for the entire column

# Set value for entire Column
iris_df.loc[:, 'PetalWidthCm'] = 9.99
print(iris_df)

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species
a  1.0            5.1           3.5            1.4          9.99       Iris-setosa
b  2.0            4.9           3.0            1.4          9.99       Iris-setosa
c  3.0            7.0           3.2            4.7          9.99   Iris-versicolor
d  1.5            1.5           1.5            1.5          9.99               1.5
e  5.0            6.3           3.3            6.0          9.99    Iris-virginica
f  6.0            5.8           2.7            5.1          9.99    Iris-virginica

Selective assignment for the desired cell.

# Conditional update
iris_df.loc[iris_df['SepalWidthCm'] < 3.1, ['SepalWidthCm']]= 3.99
print(iris_df)

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species
a  1.0            5.1          3.50            1.4          9.99       Iris-setosa
b  2.0            4.9          3.99            1.4          9.99       Iris-setosa
c  3.0            7.0          3.20            4.7          9.99   Iris-versicolor
d  1.5            1.5          3.99            1.5          9.99               1.5
e  5.0            6.3          3.30            6.0          9.99    Iris-virginica
f  6.0            5.8          3.99            5.1          9.99    Iris-virginica

Observe, only column SepalWidthCm updated where the value is less than 3.1.

Addition

Addition of new column and row to the existing DataFrame.

# Addition of Column - All values set to New
iris_df['Species_New'] = "New"
print(iris_df)

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species Species_New 
a  1.0            5.1          3.50            1.4          9.99       Iris-setosa         New 
b  2.0            4.9          3.99            1.4          9.99       Iris-setosa         New 
c  3.0            7.0          3.20            4.7          9.99   Iris-versicolor         New
d  1.5            1.5          3.99            1.5          9.99               1.5         New
e  5.0            6.3          3.30            6.0          9.99    Iris-virginica         New
f  6.0            5.8          3.99            5.1          9.99    Iris-virginica         New

Creating new DataFrame, and then adding these rows to the end of the existing DataFrame.

# Append rows of other to the end of the caller, returning a new object.
iris_new = {'id':[7, 8],
'SepalLengthCm':[5.1, 5.8],
'SepalWidthCm':[3.5, 2.7],
'PetalLengthCm':[1.4, 1.4],
'PetalWidthCm':[01.4, 1.5],
'Species':['Iris-versicolor', 'Iris-virginica']}

iris_df_new = pd.DataFrame(iris_new, index=['a','g'])
print(iris_df_new)

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm          Species 
a   7            5.1           3.5            1.4           1.4  Iris-versicolor 
g   8            5.8           2.7            1.4           1.5   Iris-virginica

Using the “append” option to add these rows to DataFrame.

print(iris_df.append(iris_df_new, ignore_index=True))

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species Species_New
0  1.0            5.1          3.50            1.4          9.99       Iris-setosa         New
1  2.0            4.9          3.99            1.4          9.99       Iris-setosa         New
2  3.0            7.0          3.20            4.7          9.99   Iris-versicolor         New
3  1.5            1.5          3.99            1.5          9.99               1.5         New
4  5.0            6.3          3.30            6.0          9.99    Iris-virginica         New
5  6.0            5.8          3.99            5.1          9.99    Iris-virginica         New
6  7.0            5.1          3.50            1.4          1.40   Iris-versicolor         NaN
7  8.0            5.8          2.70            1.4          1.50    Iris-virginica         NaN

Deletion

In our existing DataFrame, column “Species_New” is a new column that we added. Now, let us delete this column using the “del” function.

# Delete entire column
del iris_df['Species_New']
print(iris_df)

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species 
a  1.0            5.1          3.50            1.4          9.99       Iris-setosa
b  2.0            4.9          3.99            1.4          9.99       Iris-setosa
c  3.0            7.0          3.20            4.7          9.99   Iris-versicolor
d  1.5            1.5          3.99            1.5          9.99               1.5
e  5.0            6.3          3.30            6.0          9.99    Iris-virginica
f  6.0            5.8          3.99            5.1          9.99    Iris-virginica

You can use “DataFrame.pop(item)” function to drop column from DataFrame.

Index label can be used to delete the row from the DataFrame.

# Row Deletion - Drop row where index is equal to f
print(iris_df.drop('f'))

Output:

    id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm           Species 
a  1.0            5.1          3.50            1.4          9.99       Iris-setosa
b  2.0            4.9          3.99            1.4          9.99       Iris-setosa
c  3.0            7.0          3.20            4.7          9.99   Iris-versicolor
d  1.5            1.5          3.99            1.5          9.99               1.5
e  5.0            6.3          3.30            6.0          9.99    Iris-virginica

Conclusion

By now you have realized the importance of DataFrame, and why it is popular among the data scientist community. It is rigorously used in performing the data analysis. There are many other important functions that you need to learn to master DataFrame structures. Stay tuned, we will cover those aspects of DataFrame in another post.

Hopefully, you have enjoyed the post so far.

Interested in learning the Linear Regression concept and its implementation using Excel, R, and Python. Click here to read more about it.

Leave a Comment

Your email address will not be published. Required fields are marked *