Pandas DataFrame Attributes and Underlying Data - Be on the Right Side of Change

The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.

Below is an example of a Database Schema for Employees of the Finxter Academy.

This article delves into each method for the DataFrame Attributes and Underlying Data.

Preparation

Before any data manipulation can occur, one (1) new library will require installation.

The Pandas library enables access to/from a DataFrame.

To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

If the installation was successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required library.

How to Install Pandas on PyCharm?

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd

Create a DataFrame

The code below creates a DataFrame and outputs the same to the terminal.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

df = pd.DataFrame(finxters)
print(df)

Line [1] creates a List of Tuples and saves it to finxters.
Line [2] converts the List of Tuples (finxters) into a DataFrame object.
Line [3] outputs the DataFrame to the terminal.

Output

	0	1	2	3	4
0	1042	Jayce	White	Data Scientist	155400
1	1043	Micah	Howes	Manager	95275
2	1044	Hanna	Groves	Assistant	65654
3	1045	Steve	Brown	Coder	88300
4	1046	Harry	Green	Writer	98314

DataFrame Columns

As shown in the output above, the Columns do not have names but have numbers. The code below resolves this issue by assigning Names to columns using the columns property.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

cols = ['ID', 'First', 'Last', 'Job', 'Salary']
df = pd.DataFrame(finxters, columns=cols)
print(df)

Line [1] creates a List of Tuples and saves it to finxters.
Line [2] assigns Column Names to a list (cols).
Line [3] creates a DataFrame and passes finxters and columns=cols.
Line [4] outputs the DataFrame to the terminal.

Output

	ID	First	Last	Job	Salary
0	1042	Jayce	White	Data Scientist	155400
1	1043	Micah	Howes	Manager	95275
2	1044	Hanna	Groves	Assistant	65654
3	1045	Steve	Brown	Coder	88300
4	1046	Harry	Green	Writer	98314

💡 Note: Place the following lines of code at the top of each script.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

cols = ['ID', 'First', 'Last', 'Job', 'Salary']

DataFrame Data Types

The property dtypes determines the Data Type for each column (field) in the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
print(df.dtypes)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the Data Types to the terminal.

Output

ID	int64
First	object
Last	object
Job	object
Salary	Int64
dtype	object

Use Square Brackets

Another way to determine the Data Type of a Column is to specify the Column Name inside square brackets. In this case, the Data Type for the ID column displays.

df = pd.DataFrame(finxters, columns=cols)
print(df['ID'].dtypes)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the Data Type to the terminal.

Output

Int64

DataFrame Info

The df.info() method outputs the DataFrame details, including the index data type, columns, non-null values, and memory usage.

The syntax for this method is as follows:

DataFrame.info(verbose=None, buf=None, max_cols=None, 
               memory_usage=None, show_counts=None, null_counts=None)

For additional details on the available parameters, click here.

For this example, the verbose parameter is used. Setting this to True provides detailed information concerning the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
print(df.info(verbose=True))

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the DataFrame information to the terminal.

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4        
Data columns (total 5 columns):      

#	Column	Non-Null Count	Dtype
0	ID	5 non-null   	int64
1	First  	5 non-null   	object
2	Last  	5 non-null   	object
3	Job  	5 non-null   	object
4	Salary  	5 non-null   	int64

dtypes: int64(2), object(3)
memory usage: 328.0+ bytes
None

Setting verbose=False summarizes the DataFrame information.

df = pd.DataFrame(finxters, columns=cols)
print(df.info(verbose=False))

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the summarized DataFrame information to the terminal.

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4        
Columns: 5 entries, ID to Salary    
dtypes: int64(2), object(3)
memory usage: 328.0+ bytes
None

DataFrame Select Dtypes

The df.select_dtypes() method allows you to specify a column Data Type you wish to view (including all associated values).

Using the DataFrame created in Section 2, this code outputs the ID and Salary values to the terminal. Both of these columns in our DataFrame have a Data Type of int64.

df = pd.DataFrame(finxters, columns=cols)
print(df.select_dtypes(include='int64'))

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the values of these two columns to the terminal.

Output

	ID	Salary
0	1042	155400
1	1043	95275
2	1044	65654
3	1045	88300
4	1046	98314

DataFrame Axes

The df.axes property returns a list representing the axes of the DataFrame. The Column Axis and Row Axis data are returned in the same order as entered (see output below).

df = pd.DataFrame(Finxters, columns=cols)
print(df.axes)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the DataFrame Axes information to the terminal.

Output

[RangeIndex(start=0, stop=5, step=1), 
Index(['ID', 'First', 'Last', 'Job', 'Salary'], 
dtype='object')]

DataFrame ndmin

The df.ndim property returns an integer representing the total number of axes/array dimensions. If a Series, the value of 1 is returned. If a DataFrame, the value of 2 is returned.

df = pd.DataFrame(finxters, columns=cols)
print(df.ndim)
# 2

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the value of ndim to the terminal. In this case, 2 because it is a DataFrame.

DataFrame Size

The df.size property returns an integer representing the total number of elements in the DataFrame object. If a Series, the number of rows returns. If a DataFrame, the number of rows * the number of columns returns.

df = pd.DataFrame(finxters, columns=cols)
print(df.size)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the DataFrame Size to the terminal.

Output

25 (5 columns * 5 rows = 25)

DataFrame Shape

The DataFrame shape property returns a tuple that represents the DataFrame dimensionality.

df = pd.DataFrame(finxters, columns=cols)
print(df.shape)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the DataFrame Shape to the terminal.

Output

(5, 5)  (5 columns, 5 rows)

DataFrame Memory Usage

The df.memory_usage() method memory usage is, by default, displayed in DataFrame.info(). However, you can also view this information below.

Parameters

indexbool, default True	This parameter specifies whether to include the memory usage of the DataFrame index in the returned Series. If index=True, the memory usage of the index is the first item in the output.
deepbool, default False	If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

df = pd.DataFrame(finxters, columns=cols)
print(df.memory_usage(index=True))

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the Memory Usage to the terminal.

Output

Index	128
ID	40
First	40
Last	40
Job	40
Salary	40
dtype	Int64

df = pd.DataFrame(finxters, columns=cols)
print(df.memory_usage(index=True, deep=True))

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs the Memory Usage to the terminal.

Output

Index	128
ID	40
First	310
Last	311
Job	326
Salary	40
dtype	Int64

DataFrame Empty

The df.empty property checks to see if a DataFrame is empty. If empty, True returns. Otherwise, False returns.

df = pd.DataFrame(finxters, columns=cols)
print(df.empty)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] outputs True/False to the terminal.

DataFrame Index

The df.set_index() method allows you to set a column as the index. If no index exists, an index (auto-increment) is automatically generated by default.

	ID	First	Last	Job	Salary
0	1042	Jayce	White	Data Scientist	155400
1	1043	Micah	Howes	Manager	95275
2	1044	Hanna	Groves	Assistant	65654
3	1045	Steve	Brown	Coder	88300
4	1046	Harry	Green	Writer	98314

For this example, the column Last will be the index.

df = pd.DataFrame(finxters, columns=cols)
df.set_index('Last', inplace=True)
print(df)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] sets Last as the index column and inplace=True.
Line [3] outputs the DataFrame to the terminal.

💡 Note: When inplace=True the DataFrame is updated and has no return value. When inplace=False (default) a copy of the updated DataFrame is returned.

Output

	ID	First	Job	Salary
Last
White	1042	Jayce	Data Scientist	155400
Howes	1043	Micah	Manager	95275
Groves	1044	Hanna	Assistant	65654
Brown	1045	Steve	Coder	88300
Green	1046	Harry	Writer	98314

DataFrame Set Flags

The df.set_flags() method allows you to set various flags. For this example, a flag is set to not allow duplicate labels in the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
df1 = df.set_flags(allows_duplicate_labels=False)
print(df1)

Line [1] assigns the Column Name from the list created earlier to columns=cols.
Line [2] sets allow duplicate labels to False and assigns this to a new DataFrame (df1).
Line [3] outputs df1 to the terminal. There is no change as the original DataFrame did not contain duplicate values.

Output

	ID	First	Last	Job	Salary
0	1042	Jayce	White	Data Scientist	155400
1	1043	Micah	Howes	Manager	95275
2	1044	Hanna	Groves	Assistant	65654
3	1045	Steve	Brown	Coder	88300
4	1046	Harry	Green	Writer	98314