Pandas DataFrame – Selecting and Indexing

bookcase, books, bookshelves

Pandas DataFrame – Selecting and Indexing is a continuation of the pandas DataFrame series. If you have not checked that article click here to understand pandas DataFrame, its properties, and basic operation such as selection, assignment, addition, and deletion.

Indexing in pandas provides the capability to work on the subset of the DataFrame. The index holds axis label information that helps in data alignment, and visualization.

Note: Index objects are immutable. You cannot modify the index.

Pandas Options for Selecting and Indexing

Three important pandas function to remember to perform indexing and slicing:

  • DataFrame.[]  – basic indexing option.
  • DataFrame.iloc[] – access a group of rows and columns by integer-based position.
  • DataFrame.loc[] – access a group of rows and columns by label(s).

Basic Indexing

We can use DataFrame.[] to select a subset from the pandas DataFrame.

We will continue to use our IRIS dataset example. 

import pandas as pd
iris = {'id':[1, 2, 3, 4, 5, 6],
      'SepalLengthCm':[5.1, 4.9, 7, 6.4, 6.3, 5.8],
      'SepalWidthCm':[3.5, 3, 3.2, 3.2, 3.3, 2.7],
      'PetalLengthCm':[1.4, 1.4, 4.7, 4.5, 6, 5.1],
      'PetalWidthCm':[0.2, 0.2, 1.4, 1.5, 2.5, 1.9],
      'Species':['Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica', 'Iris-virginica']}
iris_df = pd.DataFrame(iris, index=['a','b','c','d', 'e', 'f'])
print(iris_df)

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm     Species
a   1            5.1           3.5            1.4           0.2      setosa
b   2            4.9           3.0            1.4           0.2      setosa
c   3            7.0           3.2            4.7           1.4  versicolor
d   4            6.4           3.2            4.5           1.5  versicolor
e   5            6.3           3.3            6.0           2.5   virginica
f   6            5.8           2.7            5.1           1.9   virginica

You can select a column in a DataFrame as a Series either by dict-like notation or by attribute.

print(iris_df['Species'])

output:

a        setosa
b        setosa
c    versicolor
d    versicolor
e     virginica
f     virginica
Name: Species, dtype: object

You can also use a .dot operator to select single column.

print(iris_df.Species)

Output:

a        setosa
b        setosa
c    versicolor
d    versicolor
e     virginica
f     virginica
Name: Species, dtype: object

Both the options return series objects. So to return the DataFrame. Let us see.

print(type(iris_df[['Species']]))
print(iris_df[['Species']])

Output:

<class 'pandas.core.frame.DataFrame'>
      Species
a      setosa
b      setosa
c  versicolor
d  versicolor
e   virginica
f   virginica

Pass a list of columns and order to [].

print(iris_df[['Species', 'id', 'PetalLengthCm']])

Output:

      Species  id  PetalLengthCm
a      setosa   1            1.4
b      setosa   2            1.4
c  versicolor   3            4.7
d  versicolor   4            4.5
e   virginica   5            6.0
f   virginica   6            5.1

Range-Based Selection

Selecting the first two rows. Remember that the first element position starts with zero in python.

# Selecting first two rows 
print(iris_df[:2])

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm Species
a   1            5.1           3.5            1.4           0.2  setosa
b   2            4.9           3.0            1.4           0.2  setosa

Select rows at intervals of three, starting from row 0. In our example, there are 6 rows and we have selected every 3rd rows.

print(iris_df[::3])

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm     Species
a   1            5.1           3.5            1.4           0.2      setosa
d   4            6.4           3.2            4.5           1.5  versicolor

Selecting the last row from the DataFrame using a negative index. “-1” returns the last row.

print(iris_df[-1:])

Output:

   id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm    Species
f   6            5.8           2.7            5.1           1.9  virginica

Label-Based Selection of Rows and Columns

The DataFrame.loc[] operator allows label-oriented indexing. 

Selecting a single row with an index value equal to “b”.

# Selecting single row - index value = "b"
print(iris_df.loc['b'])

output:

id                    2
SepalLengthCm       4.9
SepalWidthCm          3
PetalLengthCm       1.4
PetalWidthCm        0.2
Species          setosa
Name: b, dtype: object

Selecting a single column “Species” from DataFrame.

# Selecting species column
print(iris_df.loc[:, 'Species'])

Output:

a        setosa
b        setosa
c    versicolor
d    versicolor
e     virginica
f     virginica
Name: Species, dtype: object

Select a subset of rows and columns using the .loc method.

# Selecting rows - c to e, and columns from PetalLengthCm to Species
print(iris_df.loc['c':'e', 'PetalLengthCm':'Species'])

Output:

   PetalLengthCm  PetalWidthCm     Species
c            4.7           1.4  versicolor
d            4.5           1.5  versicolor
e            6.0           2.5   virginica

Integer-Based Selection of Rows and Columns

The DataFrame.iloc[] operator allows integer-oriented indexing.

Selecting a single row at 3rd position.

print(iris_df.iloc[2]) 

Output:

id                        3
SepalLengthCm             7
SepalWidthCm            3.2
PetalLengthCm           4.7
PetalWidthCm            1.4
Species          versicolor
Name: c, dtype: object

Selecting a single column

# Selecting a single column at first position
print(iris_df.iloc[:, 0])

Output:

a    1
b    2
c    3
d    4
e    5
f    6
Name: id, dtype: int64

Selecting multiple rows and columns using integer-based slicing.

print(iris_df.iloc[2:5, 3:6])

Output:

   PetalLengthCm  PetalWidthCm     Species
c            4.7           1.4  versicolor
d            4.5           1.5  versicolor
e            6.0           2.5   virginica

Conditional Slicing

Can we select rows and columns based on some condition? Yes, we can let’s look into some examples.

# Selecting Columns 'SepalWidthCm', 'Species' where SepalWidthCm less than 3.1
print(iris_df.loc[iris_df['SepalWidthCm'] < 3.1, ['SepalWidthCm', 'Species']])

Output:

   SepalWidthCm    Species
b           3.0     setosa
f           2.7  virginica

Conclusion

In this post, we have covered, basic selecting and indexing operations, label-based, integer-based, and conditional slicing methods to select the rows and columns.

Leave a Comment

Your email address will not be published. Required fields are marked *