10 Minutes to Pandas (in 5 Minutes) - Be on the Right Side of Change

This tutorial provides you a quick and dirty introduction to the most important Pandas features. A popular quickstart to the Pandas library is provided by the official “10 Minutes to Pandas” guide.

This tutorial in front of you aims to cover the most important 80% of the official guide, but in 50% of the time. Are you ready to invest 5 of your precious minutes to get started in Pandas and boost your data science and Python skills at the same time? Let’s dive right into it!

Visual Overview [Cheat Sheet]

I always find it useful to give a quick overview of the topics covered—in visual form. To help you grasp the big picture, I’ve visualized the topics described in this article in the following Pandas cheat sheet:

Let’s go over the different parts of this visual overview step-by-step.

How to Use Pandas?

You access the Pandas library with the import pandas as pd statement that assigns the short-hand name identifier pd to the module for ease of access and brevity. Instead of pandas.somefunction(), you can now call pd.somefunction().

import pandas as pd

You can install the Pandas library in your virtual environment or your computer by using the following command:

pip install pandas

If you fail to do so, you’ll encounter the import error:

>>> import pandas as pd
Traceback (most recent call last):
  File "yourApp.py", line 1, in <module>
    import pandas as pd 
ImportError: No module named pandas

Pandas is already installed in many environments such as in Anaconda. You can find a detailed installation guide here:

Installation guide: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

How to Create Objects in Pandas?

The two most important data types in Pandas are Series and DataFrames.

A Pandas Series is a one-dimensional labeled array of data values. Think of it as a column in an excel sheet.
A Pandas DataFrame is a two-dimensional labeled data structure—much like a spreadsheet (e.g., Excel) in your Python code.

Those two data structures are labeled—we call the labels indices of the data structures. The main difference is that the Series is one-dimensional while the DataFrame is two-dimensional.

Series: Here’s an example on how to create a Series object:

import pandas as pd


s = pd.Series([42, 21, 7, 3.5])
print(s)
'''
0    42.0
1    21.0
2     7.0
3     3.5
dtype: float64
'''

You use the pd.Series() constructor and pass a flat list of values into it. You could also pass other data types such as strings into it. Pandas will automatically determine the data type of the whole series in the dtype attribute.

DataFrame: Here’s an example on how to create a DataFrame object:

import pandas as pd


s = pd.DataFrame({'age': 18,
                  'name': ['Alice', 'Bob', 'Carl'],
                  'cardio': [60, 70, 80]})

print(s)
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''

You use the pd.DataFrame() constructor with one argument: the dictionary that describes the DataFrame. The dictionary maps column names such as 'age', 'name', and 'cardio' to column values such as ['Alice', 'Bob', 'Carl'] for the column 'name'. You can only provide one column value such as 18 and assign it to a whole column such as 'age'. Pandas will then automatically broadcast the value to all existing rows in the DataFrame.

How to Select Elements in Series and DataFrames?

Let’s apply some first-principles thinking: both the Series and the DataFrame are data structures. The purpose of a data structure is to facilitate data storage, access, and analysis. Alternatively, you could store tabular data with rows and columns in a list of tuples—one per row—but data access would be very inefficient. However, accessing all elements of the i-th column would be very painful because you’d have to traverse the whole list and aggregate the i-th tuple values.

Fortunately, Pandas makes data storage, access, and analysis of tabular data as simple as it can get. It is both efficient and readable.

Column: Here’s how you can access a column with the indexing scheme you already know from Python dictionaries and NumPy arrays (square bracket notation):

import pandas as pd


s = pd.DataFrame({'age': 18,
                  'name': ['Alice', 'Bob', 'Carl'],
                  'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''

# Select all elements in column 'age'
print(s['age'])
'''
0    18
1    18
2    18
Name: age, dtype: int64
'''

After importing the Pandas module and creating a DataFrame with three columns and three rows, you select all values in the column labeled 'age' using the square bracket notation s['age']. A semantically-equivalent alternative would be the syntax s.age.

Rows: You can access specific rows in the DataFrame by using the slicing notation s[start:stop]. To access only one row, set the start and end indices accordingly:

import pandas as pd


s = pd.DataFrame({'age': 18,
                  'name': ['Alice', 'Bob', 'Carl'],
                  'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''


print(s[2:3])
'''
   age  name  cardio
2   18  Carl      80
'''

You can find a full slicing tutorial at the following Finxter blog articles.

Related Article

Boolean Indexing

A powerful way to access rows that match a certain condition is Boolean Indexing.

import pandas as pd


s = pd.DataFrame({'age': 18,
                  'name': ['Alice', 'Bob', 'Carl'],
                  'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''


print(s[s['cardio']>60])
'''
   age  name  cardio
1   18   Bob      70
2   18  Carl      80
'''

The condition s['cardio']>60 results in a number of Boolean values. The i-th Boolean value is 'True' if the i-th element of the 'cardio' column is larger than 60. This holds for the first two rows of the DataFrame.

You then pass these Boolean values as an indexing scheme into the DataFrame s which results in a DataFrame with only two rows instead of three.

Selection by Label

You can access a Pandas DataFrame by label using the indexing mechanism pd.loc[rows, columns]. Here’s an example:

import pandas as pd


df = pd.DataFrame({'age': 18,
                   'name': ['Alice', 'Bob', 'Carl'],
                   'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''


print(df.loc[:, 'name'])
'''
0    Alice
1      Bob
2     Carl
Name: name, dtype: object
'''

In the example, you access all rows from the column 'name'. To access the first two rows with columns 'age' and 'cardio', use the following indexing scheme by passing a list of column labels:

print(df.loc[:, ['age', 'cardio']])
'''
   age  cardio
0   18      60
1   18      70
2   18      80
'''

While the loc index provides you a way to access the DataFrame content by label, you can also access it by index using the iloc index.

Selection by Index

How to access the i-th row and the j-th column? The iloc index allows you to accomplish exactly that:

import pandas as pd


df = pd.DataFrame({'age': 18,
                   'name': ['Alice', 'Bob', 'Carl'],
                   'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''

i, j = 2, 1
print(df.iloc[i, j])
'''
Carl
'''

The first argument i accesses the i-th row and the second argument j accesses the j-th column of the iloc index. The data value in the third row with index 2 and the second column with index 1 is 'Carl'.

How to Modify an Existing DataFrame

You can use the discussed selection technologies to modify and possibly overwrite a part of your DataFrame. To accomplish this, select the parts to be replaced or newly-created on the right-hand side and set the new data on the left-hand side of the assignment expression. Here’s a minimal example that overwrites the integer values in the 'age' column:

import pandas as pd


df = pd.DataFrame({'age': 18,
                   'name': ['Alice', 'Bob', 'Carl'],
                   'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''

df['age'] = 17

print(df)
'''
   age   name  cardio
0   17  Alice      60
1   17    Bob      70
2   17   Carl      80
'''

First, you select the age column with df['age']. Second, you overwrite it with the integer value 17. Pandas uses broadcasting to copy the single integer to all rows in the column.

Here’s a more advanced example that uses slicing and the loc index to overwrite all but the first row of the age column:

import pandas as pd


df = pd.DataFrame({'age': 18,
                   'name': ['Alice', 'Bob', 'Carl'],
                   'cardio': [60, 70, 80]})
'''
   age   name  cardio
0   18  Alice      60
1   18    Bob      70
2   18   Carl      80
'''

df.loc[1:,'age'] = 17

print(df)
'''
   age   name  cardio
0   18  Alice      60
1   17    Bob      70
2   17   Carl      80
'''

Can you spot the difference between the DataFrames?

Pandas is very robust and if you understood the different indexing schemes—bracket notation, slicing, loc, and iloc—you’ll also understand how to overwrite existing data or add new data.

For example, here’s how you can add a new column with the loc index, slicing, and broadcasting:

df.loc[:,'love'] = 'Alice'
print(df)
'''
   age   name  cardio   love
0   18  Alice      60  Alice
1   17    Bob      70  Alice
2   17   Carl      80  Alice
'''

While Pandas has many more functionalities such as calculating statistics, plotting, grouping, and reshaping—to name just a few—the 5-minutes to Pandas tutorial ends here. If you understood those concepts discussed in this tutorial, you’ll be able to read and understand existing Pandas code with a little help from the official docs and Google to figure out the different functions.

Feel free to go over our Pandas courses and upcoming books to improve your Pandas skills over time. You can subscribe to the free email academy here.