Understanding the Differences Between iloc and loc in Python Pandas

πŸ’‘ Problem Formulation: When working with data in Python’s Pandas library, it’s common to need to select subsets of data from a DataFrame. Two crucial methods for this task are loc and iloc. These functions may seem similar at first glance but cater to different needs: loc works with labels of the index or column, while iloc works with the integer positions. Imagine a DataFrame where we need to extract specific rows or columns based on their label or integer index; this is where understanding loc and iloc becomes vital.

Method 1: Indexing with Labels using loc

Using loc allows you to select data by the label of the rows and columns. It is a label-based data selecting method which means that we have to pass the name of the row and column that we want to select. This method includes the last value of the range, making it inclusive.

Here’s an example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}, index=['row1', 'row2', 'row3'])

result = df.loc['row1':'row2', 'A':'B']
print(result)

Output:

      A  B
row1  1  4
row2  2  5

This code snippet creates a DataFrame and selects rows ‘row1’ through ‘row2’ and columns ‘A’ through ‘B’ using loc. Since the selection is label-based, both the end row and column are included in the result set.

Method 2: Indexing with Positions using iloc

iloc is used for selecting rows and columns by their integer index. It is similar to Python’s list slicing, and as such, is zero-indexed and exclusive of the endpoint. The primary use of iloc is to select data in a range based on their numeric positions even if the index is made up of labels.

Here’s an example:

result = df.iloc[0:2, 0:2]
print(result)

Output:

      A  B
row1  1  4
row2  2  5

This snippet achieves the same result as the first example but uses integer positions to select the rows and columns. The difference is that iloc is using the integer indices (0 and 1 for rows, and 0 and 1 for columns), ignoring the actual labels of rows and columns.

Method 3: Boolean Indexing with loc

loc can also be used with boolean vectors to select rows or columns where the condition is true. This method is particularly useful for filtering dataframes based on a condition applied to column values. It provides a dynamic way to handle data selection without needing to know the explicit row labels or column names.

Here’s an example:

result = df.loc[df['A'] > 1]
print(result)

Output:

      A  B  C
row2  2  5  8
row3  3  6  9

The example selects all rows from the DataFrame where the value in column ‘A’ is greater than 1. Using boolean indexing with loc allows for the flexibility of condition-based selection within a Pandas DataFrame.

Method 4: Slicing with Data Types using loc

loc also allows slicing to accommodate mixed data types since it handles labels. When you have a DataFrame with an index or column labels as strings, or a mix of integers and strings, loc can deal with this effortlessly. It won’t throw errors, as seen when integers are used for positions in iloc with non-integer index values.

Here’s an example:

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4.5, 5.5, 6.5],
    'C': ['one', 'two', 'three']
})

result = df.loc[:, 'B':'C']
print(result)

Output:

     B      C
0  4.5    one
1  5.5    two
2  6.5  three

In this snippet, we select all rows and columns from ‘B’ to ‘C’, regardless of data type. Using loc allows us to index ranges with differing data types without any issues.

Bonus One-Liner Method 5: Chained Indexing Using loc and iloc

Sometimes, you may need to use a combination of loc and iloc for complex data selection scenarios. This involves utilizing one after the other in a method chain. However, note that chained indexing can sometimes lead to unexpected results due to how Pandas handles views and copies of DataFrames.

Here’s an example:

result = df.loc['row1'].iloc[0]
print(result)

Output:

1

This one-liner first selects the row ‘row1’ with loc, returning a Series, then immediately selects the first item of that series with iloc, yielding the scalar value at position (0, 0) in the original DataFrame.

Summary/Discussion

  • Method 1: Indexing with Labels using loc. Strengths: Works great with label-based indices, including ranges. Weaknesses: It requires knowing the index/column labels and does not work with default integer positions when labels are different.
  • Method 2: Indexing with Positions using iloc. Strengths: Ideal for selection by position, follows Python’s standard zero-based indexing. Weaknesses: Cannot directly handle label-based indices if they are not simple integers.
  • Method 3: Boolean Indexing with loc. Strengths: Highly flexible for conditional selections. Weaknesses: Needs a condition that is sometimes extra computation work.
  • Method 4: Slicing with Data Types using loc. Strengths: Can handle mixed data types and ranges. Weaknesses: Less intuitive for those more familiar with position-based indexing.
  • Bonus One-Liner Method 5: Chained Indexing. Strengths: Can create complex selections and is quite versatile. Weaknesses: Can lead to setting with copy warnings or unexpected results. Should be used carefully.