Pandas DataFrame Attributes and Underlying Data

Rate this post

The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.

Below is an example of a Database Schema for Employees of the Finxter Academy.

This article delves into each method for the DataFrame Attributes and Underlying Data.


Preparation

Before any data manipulation can occur, one (1) new library will require installation.

  • The Pandas library enables access to/from a DataFrame.

To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

If the installation was successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required library.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd

Create a DataFrame

The code below creates a DataFrame and outputs the same to the terminal.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

df = pd.DataFrame(finxters)
print(df)
  • Line [1] creates a List of Tuples and saves it to finxters.
  • Line [2] converts the List of Tuples (finxters) into a DataFrame object.
  • Line [3] outputs the DataFrame to the terminal.

Output

 01234
01042JayceWhiteData Scientist155400
11043MicahHowesManager95275
21044HannaGrovesAssistant65654
31045SteveBrownCoder88300
41046HarryGreenWriter98314

DataFrame Columns

As shown in the output above, the Columns do not have names but have numbers. The code below resolves this issue by assigning Names to columns using the columns property.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

cols = ['ID', 'First', 'Last', 'Job', 'Salary']
df = pd.DataFrame(finxters, columns=cols)
print(df)
  • Line [1] creates a List of Tuples and saves it to finxters.
  • Line [2] assigns Column Names to a list (cols).
  • Line [3] creates a DataFrame and passes finxters and columns=cols.
  • Line [4] outputs the DataFrame to the terminal.

Output

 IDFirstLastJobSalary
01042JayceWhiteData Scientist155400
11043MicahHowesManager95275
21044HannaGrovesAssistant65654
31045SteveBrownCoder88300
41046HarryGreenWriter98314

💡 Note: Place the following lines of code at the top of each script.

finxters =  [(1042, 'Jayce', 'White', 'Data Scientist', 155400),
             (1043, 'Micah', 'Howes', 'Manager', 95275),
             (1044, 'Hanna', 'Groves', 'Assistant', 65654),
             (1045, 'Steve', 'Brown', 'Coder', 88300),
             (1046, 'Harry', 'Green', 'Writer', 98314)]

cols = ['ID', 'First', 'Last', 'Job', 'Salary']

DataFrame Data Types

The property dtypes determines the Data Type for each column (field) in the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
print(df.dtypes)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the Data Types to the terminal.

Output

IDint64
Firstobject
Lastobject
Jobobject
SalaryInt64
dtypeobject

Use Square Brackets

Another way to determine the Data Type of a Column is to specify the Column Name inside square brackets. In this case, the Data Type for the ID column displays.

df = pd.DataFrame(finxters, columns=cols)
print(df['ID'].dtypes)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the Data Type to the terminal.

Output

Int64

DataFrame Info

The df.info() method outputs the DataFrame details, including the index data type, columns, non-null values, and memory usage.

The syntax for this method is as follows:

DataFrame.info(verbose=None, buf=None, max_cols=None, 
               memory_usage=None, show_counts=None, null_counts=None)

For additional details on the available parameters, click here.

For this example, the verbose parameter is used. Setting this to True provides detailed information concerning the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
print(df.info(verbose=True))
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the DataFrame information to the terminal.

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4        
Data columns (total 5 columns):      

#	Column	Non-Null Count	Dtype
0	ID	5 non-null   	int64
1	First  	5 non-null   	object
2	Last  	5 non-null   	object
3	Job  	5 non-null   	object
4	Salary  	5 non-null   	int64

dtypes: int64(2), object(3)
memory usage: 328.0+ bytes
None

Setting verbose=False summarizes the DataFrame information.

df = pd.DataFrame(finxters, columns=cols)
print(df.info(verbose=False))
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the summarized DataFrame information to the terminal.

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4        
Columns: 5 entries, ID to Salary    
dtypes: int64(2), object(3)
memory usage: 328.0+ bytes
None

DataFrame Select Dtypes

The df.select_dtypes() method allows you to specify a column Data Type you wish to view (including all associated values).

Using the DataFrame created in Section 2, this code outputs the ID and Salary values to the terminal. Both of these columns in our DataFrame have a Data Type of int64.

df = pd.DataFrame(finxters, columns=cols)
print(df.select_dtypes(include='int64'))
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the values of these two columns to the terminal.

Output

 IDSalary
01042155400
1104395275
2104465654
3104588300
4104698314

DataFrame Axes

The df.axes property returns a list representing the axes of the DataFrame. The Column Axis and Row Axis data are returned in the same order as entered (see output below).

df = pd.DataFrame(Finxters, columns=cols)
print(df.axes)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the DataFrame Axes information to the terminal.

Output

[RangeIndex(start=0, stop=5, step=1), 
Index(['ID', 'First', 'Last', 'Job', 'Salary'], 
dtype='object')]

DataFrame ndmin

The df.ndim property returns an integer representing the total number of axes/array dimensions. If a Series, the value of 1 is returned. If a DataFrame, the value of 2 is returned.

df = pd.DataFrame(finxters, columns=cols)
print(df.ndim)
# 2
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the value of ndim to the terminal. In this case, 2 because it is a DataFrame.

DataFrame Size

The df.size property returns an integer representing the total number of elements in the DataFrame object. If a Series, the number of rows returns. If a DataFrame, the number of rows * the number of columns returns.

df = pd.DataFrame(finxters, columns=cols)
print(df.size)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the DataFrame Size to the terminal.

Output

25 (5 columns * 5 rows = 25)

DataFrame Shape

The DataFrame shape property returns a tuple that represents the DataFrame dimensionality.

df = pd.DataFrame(finxters, columns=cols)
print(df.shape)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the DataFrame Shape to the terminal.

Output

(5, 5)  (5 columns, 5 rows)

DataFrame Memory Usage

The df.memory_usage() method memory usage is, by default, displayed in DataFrame.info(). However, you can also view this information below.

Parameters

indexbool, default TrueThis parameter specifies whether to include the memory usage of the DataFrame index in the returned Series. If index=True, the memory usage of the index is the first item in the output.
deepbool, default FalseIf True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
df = pd.DataFrame(finxters, columns=cols)
print(df.memory_usage(index=True))
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the Memory Usage to the terminal.

Output

Index128
ID40
First40
Last40
Job40
Salary40
dtypeInt64
df = pd.DataFrame(finxters, columns=cols)
print(df.memory_usage(index=True, deep=True))
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs the Memory Usage to the terminal.

Output

Index128
ID40
First310
Last311
Job326
Salary40
dtypeInt64

DataFrame Empty

The df.empty property checks to see if a DataFrame is empty. If empty, True returns. Otherwise, False returns.

df = pd.DataFrame(finxters, columns=cols)
print(df.empty)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] outputs True/False to the terminal.

DataFrame Index

The df.set_index() method allows you to set a column as the index. If no index exists, an index (auto-increment) is automatically generated by default.

 IDFirstLastJobSalary
01042JayceWhiteData Scientist155400
11043MicahHowesManager95275
21044HannaGrovesAssistant65654
31045SteveBrownCoder88300
41046HarryGreenWriter98314

For this example, the column Last will be the index.

df = pd.DataFrame(finxters, columns=cols)
df.set_index('Last', inplace=True)
print(df)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] sets Last as the index column and inplace=True.
  • Line [3] outputs the DataFrame to the terminal.

💡 Note: When inplace=True the DataFrame is updated and has no return value. When inplace=False (default) a copy of the updated DataFrame is returned.

Output

 IDFirstJobSalary
Last 
White1042JayceData Scientist155400
Howes1043MicahManager95275
Groves1044HannaAssistant65654
Brown1045SteveCoder88300
Green1046HarryWriter98314

DataFrame Set Flags

The df.set_flags() method allows you to set various flags. For this example, a flag is set to not allow duplicate labels in the DataFrame.

df = pd.DataFrame(finxters, columns=cols)
df1 = df.set_flags(allows_duplicate_labels=False)
print(df1)
  • Line [1] assigns the Column Name from the list created earlier to columns=cols.
  • Line [2] sets allow duplicate labels to False and assigns this to a new DataFrame (df1).
  • Line [3] outputs df1 to the terminal. There is no change as the original DataFrame did not contain duplicate values.

Output

 IDFirstLastJobSalary
01042JayceWhiteData Scientist155400
11043MicahHowesManager95275
21044HannaGrovesAssistant65654
31045SteveBrownCoder88300
41046HarryGreenWriter98314