Introduction
Something usual in data science is to work with data stored in a pandas dataframe.
Often there is a need to extract specific parts of the dataframe, like columns, rows, or both. The reasons for this could be many, the dataframe is large, and only certain parts are of interest.
This post will present how to access and slice out specific parts of a pandas dataframe content using the two pandas dataframe attributes .loc
and .iloc
.
Both .loc
and .iloc
are effective ways of slicing in a dataframe where the main difference between them is that .loc
is “label-based” while .iloc
is integer position-based.
What this means in practice will be explained using examples later.
pandas.DataFrame.loc
Let’s first see the .loc
attribute. The pandas documentation[1] says that .loc
is primarily label-based, but could also be used with a boolean array.
The allowed inputs are:
- A single label, e.g.
'name'
- A list or array of labels, e.g.
[
'
name1'
,'
name2'
,'
name3'
] - A slice object with labels e.g.
. To be noted here is that contrary to usual python slices both the start index and stop index are included as will be shown later on in this post.'
name1'
:'
name2'
- A boolean array of the same length as the axis being sliced, e.g.
[True, True, False]
- An alignable boolean series. The index of the key will be aligned before masking.
- An alignable index. The index of the returned selection will be the input.
- A callable function with one argument (the calling series or DataFrame) that returns valid output for indexing (one of the above)
You can learn more about the Pandas loc()
and iloc()
indexing schemes in this guide on the Finxter blog:
pandas.DataFrame.iloc
Next, let’s see the .iloc
attribute, which slices in the data frame similarly to .loc
.
Still, instead of providing labels as parameters which is the case with .loc
, the .iloc
attribute needs to be supplied with integer numbers.
See the full pandas documentation about the attribute for further information[2].
The allowed inputs are:
- A single integer number, e.g.
5
- A list or array of integers, e.g.
[1, 2, 3, 4]
- A slice objects with integers, e.g.
1:3
- A boolean array, e.g.
[True, True, False]
- A callable function with one argument (the calling Series or DataFrame) that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value
Slicing by index using .loc
This section will present how to slice by index using both .loc
and .iloc
.
First off, let’s look at the dataframe which was used for the examples in this post. The dataframe contains data about some used cars. The dataframe has five columns and they are the following:
maker
: The maker of the carcolor
: The color of the carkilometers_run
: The number of kilometers the car has run.date_first_registered
: The date when the car was first registered for use in trafficfuel_type
: Which type of fuel the car uses
The data is imported from a CSV file and the resulting dataframe is displayed below. All the coding for this post was done using a jupyter notebook.
# Import the necessary python packages import pandas as pd # Load the data from a CSV-file to a dataframe and display it df = pd.read_csv('Used_Car_Data.csv') display(df)
As shown in image 1, the data is sorted based on how many kilometers the cars have run. Instead, let’s set the index of the dataframe to the “maker
” column and sort it by that.
A new dataframe was created for this purpose.
# Set the index by the 'maker' and sort it in ascending order df_sort_maker = df.set_index('maker').sort_index() display(df_sort_maker)
Now it’s time to slice some data from the dataframe using the .loc
attribute.
Below, two examples are presented, one where just a single maker is sliced out, and the other example will show how to pass a slice object with .loc
to slice out multiple makers.
# Slicing out all the cars of maker 'bmw' df_bmw = df_sort_maker.loc['bmw'] display(df_bmw)
# Slicing all the cars from 'citroen' up to 'volvo' df_citroen_to_volvo = df_sort_maker.loc['citroen':'volvo'] display(df_citroen_to_volvo)
Something important to note when using .loc
is that the ending index of the slice object, in this case, “volvo
”, is also included in the sliced data unlike how slice usually behaves in Python.
Instead of just using the carmaker as the index, we could also use the color of the cars as a second index to sort the data even better.
A new dataframe was created for this purpose.
""" Setting an outer index to 'maker' and an inner index to 'color' and sorting it in ascending order """ df_sort_maker_color = df.set_index(['maker', 'color']).sort_index() display(df_sort_maker_color)
Let’s slice the data using .loc
based on both the outer index(“maker
“) and the inner index(“color
“).
Tuples containing the start and end index for both inner and outer index are passed as input to .loc
. Maybe all the cars in the range from the red Alpha Romeos to the black Fords are of interest.
""" Slicing out the cars starting with red alpha romeos up to and inclunding black fords """ df_red_alpha_to_black_ford = df_sort_maker_color.loc[('alpha romeo', 'red'):('ford', 'black')] display(df_red_alpha_romeo_to_black_ford)
There is also the option to slice from the beginning up to and including a certain index or from a certain index until the end.
# Slicing all the cars up until and including the white citroens df_start_to_citroen_white = df_sort_maker_color.loc[:('citroen', 'white')] display(df_start_to_citroen_white)
Slicing by index using .iloc
Now, let’s slice the dataframe using the .iloc
attribute instead. As said in the introduction of this post .iloc
is integer position-based in contrast to .loc
.
The used car’s data from the previous examples will be used again, but this time the dataframes index will not be set to the “maker
“ column.
It will just have its original index as it was when first created. See image 1.
The default index column ranges from 0-18. Let’s slice based on the index position using .iloc
and passing integer numbers as input.
# Slicing from index 2 up until 8 using .iloc df_sliced = df.iloc[2:8] display(df_sliced)
Slicing from start to end index or start index to end using .iloc
.
# Slicing from the start up until index 5 df_sliced = df.iloc[:5] display(df_sliced)
# Slicing from start index 14 to the end df_sliced = df.iloc[14:] display(df_sliced)
Slicing columns
.loc
and .iloc
also provide the possibility of slicing out specific columns.
This could often be useful if the dataframe contains a lot of columns and there is a need for narrowing down the dataframe.
Say the “kilometers_run
“ column til the “fuel_type
“ column is of particular interest.
Once again, the unaltered dataframe with the default index will be used as an example. This might not provide the user with a lot of helpful information. It is simply a way of showing how to use .loc
for slicing columns.
""" Slicing columns from the color up until and including 'kilometers_run' column """ df_sliced_col = df.loc[:,'color':'kilometers_run'] display(df_sliced_col.head())
Just as when slicing rows, there is the possibility of slicing either from the start to a given end index or from a start index to the end. An example is when data is sliced from the first column until the “kilometers_run
“ column.
# Slicing columns up until and including 'kilometers_run' column df_sliced_col = df.loc[:,:'kilometers_run'] display(df_sliced_col.head())
Slicing columns are also possible using .iloc
and is done like this.
# Slicing columns using .iloc df_sliced_col = df.iloc[:, 1:4] display(df_sliced_col.head())
.iloc
can slice columns from the start to a given end index or from a given start index to the end.
# Slicing columns using .iloc df_sliced_col = df.iloc[:, 2:] display(df_sliced_col.head())
Slicing both index and columns
The .loc
and .iloc
attributes offer the possibility to slice on both index and columns simultaneously. This could be very useful.
As in the previous example, the red Alpha Romeos up to the black Fords are of interest.
But now, only the “kilometers_run
“ and “date_first_registered
“ columns should be included in the output, excluding the “fuel_type
“ column.
The dataframe that was sorted earlier based on the “maker
“ and the “color
“ will be used as an example. It could be reviewed in image 5.
""" Slicing the red alpha romeos up until and including the black fords. Only including the columns, 'kilometers_run' and 'date_first_registered' """ df_slice = df_sort_maker_color.loc[('alpha romeo', 'red'):('ford', 'black'), 'kilometers_run':'date_first_registered'] display(df_slice)
The same type of slicing is possible to do using .iloc
as well.
#Slicing both rows and columns using .iloc df.slice = df.iloc[4:12, 0:3] display(df.slice)
Slicing by date using .loc
Often the dataframe might have a column containing dates. In those cases, it could be helpful to do the slicing based on dates, e.g., all the data from one given date to one other given date.
.loc
offers an effective way of doing this. The dataframe containing the used cars has a column “date_first_registered
”.
It is easy to imagine that maybe someone wishes to see all the cars registered in between two dates. The “date_first_registered
“ column contains datetime objects ordered by YYYY-MM-DD
. This will now be used to slice out data.
First off, the index will be set and sorted using the “date_first_registered
“ column.
# Setting the index to 'date_first_registered' and sorting it in ascending order df_sort_date = df.set_index('date_first_registered').sort_index() display(df_sort_date)
Next, let’s slice data based on the date the car was first registered for traffic. The sliced data will be from 2012-01-01
to 2020-01-01
.
# Slicing data based on date df_sliced_date = df_sort_date.loc['2012-01-01':'2020-01-01'] display(df_sliced_date)
Here, a note is that this slicing technique will also work with less specific dates.
As for the example above, if the input to .loc
had only been in the YYYY
format, say '2012’:’2020’
, it would have sliced the data from and in between those years.
Conclusion
That’s it for this very basic introduction to slicing data using .loc
and .iloc
.
It’s practical and can be used in so many ways. This short guide only presents a few of them.
Also, the power and usability might be more noticeable when working with larger datasets than those used as examples in this guide.
All the best and happy slicing.
References
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html