The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.
Below is an example of a Database Schema for Employees of the Finxter Academy.
This article delves into each method for the DataFrame Attributes and Underlying Data.
Preparation
Before any data manipulation can occur, one (1) new library will require installation.
- The Pandas library enables access to/from a DataFrame.
To install this library, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
If the installation was successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required library.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd
Create a DataFrame
The code below creates a DataFrame and outputs the same to the terminal.
finxters = [(1042, 'Jayce', 'White', 'Data Scientist', 155400), (1043, 'Micah', 'Howes', 'Manager', 95275), (1044, 'Hanna', 'Groves', 'Assistant', 65654), (1045, 'Steve', 'Brown', 'Coder', 88300), (1046, 'Harry', 'Green', 'Writer', 98314)] df = pd.DataFrame(finxters) print(df)
- Line [1] creates a List of Tuples and saves it to finxters.
- Line [2] converts the List of Tuples (
finxters
) into a DataFrame object. - Line [3] outputs the DataFrame to the terminal.
Output
0 | 1 | 2 | 3 | 4 | |
0 | 1042 | Jayce | White | Data Scientist | 155400 |
1 | 1043 | Micah | Howes | Manager | 95275 |
2 | 1044 | Hanna | Groves | Assistant | 65654 |
3 | 1045 | Steve | Brown | Coder | 88300 |
4 | 1046 | Harry | Green | Writer | 98314 |
DataFrame Columns
As shown in the output above, the Columns do not have names but have numbers. The code below resolves this issue by assigning Names to columns using the columns
property.
finxters = [(1042, 'Jayce', 'White', 'Data Scientist', 155400), (1043, 'Micah', 'Howes', 'Manager', 95275), (1044, 'Hanna', 'Groves', 'Assistant', 65654), (1045, 'Steve', 'Brown', 'Coder', 88300), (1046, 'Harry', 'Green', 'Writer', 98314)] cols = ['ID', 'First', 'Last', 'Job', 'Salary'] df = pd.DataFrame(finxters, columns=cols) print(df)
- Line [1] creates a List of Tuples and saves it to finxters.
- Line [2] assigns Column Names to a list (
cols
). - Line [3] creates a DataFrame and passes
finxters
and columns=cols. - Line [4] outputs the DataFrame to the terminal.
Output
ID | First | Last | Job | Salary | |
0 | 1042 | Jayce | White | Data Scientist | 155400 |
1 | 1043 | Micah | Howes | Manager | 95275 |
2 | 1044 | Hanna | Groves | Assistant | 65654 |
3 | 1045 | Steve | Brown | Coder | 88300 |
4 | 1046 | Harry | Green | Writer | 98314 |
π‘ Note: Place the following lines of code at the top of each script.
finxters = [(1042, 'Jayce', 'White', 'Data Scientist', 155400), (1043, 'Micah', 'Howes', 'Manager', 95275), (1044, 'Hanna', 'Groves', 'Assistant', 65654), (1045, 'Steve', 'Brown', 'Coder', 88300), (1046, 'Harry', 'Green', 'Writer', 98314)] cols = ['ID', 'First', 'Last', 'Job', 'Salary']
DataFrame Data Types
The property dtypes
determines the Data Type for each column (field) in the DataFrame.
df = pd.DataFrame(finxters, columns=cols) print(df.dtypes)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the Data Types to the terminal.
Output
ID | int64 |
First | object |
Last | object |
Job | object |
Salary | Int64 |
dtype | object |
Use Square Brackets
Another way to determine the Data Type of a Column is to specify the Column Name inside square brackets. In this case, the Data Type for the ID column displays.
df = pd.DataFrame(finxters, columns=cols) print(df['ID'].dtypes)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the Data Type to the terminal.
Output
Int64
DataFrame Info
The df.info()
method outputs the DataFrame details, including the index data type, columns, non-null values, and memory usage.
The syntax for this method is as follows:
DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
For additional details on the available parameters, click here.
For this example, the verbose
parameter is used. Setting this to True
provides detailed information concerning the DataFrame.
df = pd.DataFrame(finxters, columns=cols) print(df.info(verbose=True))
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the DataFrame information to the terminal.
Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 5 columns): # Column Non-Null Count Dtype 0 ID 5 non-null int64 1 First 5 non-null object 2 Last 5 non-null object 3 Job 5 non-null object 4 Salary 5 non-null int64 dtypes: int64(2), object(3) memory usage: 328.0+ bytes None
Setting verbose=False
summarizes the DataFrame information.
df = pd.DataFrame(finxters, columns=cols) print(df.info(verbose=False))
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the summarized DataFrame information to the terminal.
Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Columns: 5 entries, ID to Salary dtypes: int64(2), object(3) memory usage: 328.0+ bytes None
DataFrame Select Dtypes
The df.select_dtypes()
method allows you to specify a column Data Type you wish to view (including all associated values).
Using the DataFrame created in Section 2, this code outputs the ID and Salary values to the terminal. Both of these columns in our DataFrame have a Data Type of int64.
df = pd.DataFrame(finxters, columns=cols) print(df.select_dtypes(include='int64'))
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the values of these two columns to the terminal.
Output
ID | Salary | |
0 | 1042 | 155400 |
1 | 1043 | 95275 |
2 | 1044 | 65654 |
3 | 1045 | 88300 |
4 | 1046 | 98314 |
DataFrame Axes
The df.axes
property returns a list representing the axes of the DataFrame. The Column Axis and Row Axis data are returned in the same order as entered (see output below).
df = pd.DataFrame(Finxters, columns=cols) print(df.axes)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the DataFrame Axes information to the terminal.
Output
[RangeIndex(start=0, stop=5, step=1), Index(['ID', 'First', 'Last', 'Job', 'Salary'], dtype='object')]
DataFrame ndmin
The df.ndim
property returns an integer representing the total number of axes/array dimensions. If a Series, the value of 1 is returned. If a DataFrame, the value of 2 is returned.
df = pd.DataFrame(finxters, columns=cols) print(df.ndim) # 2
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the value of
ndim
to the terminal. In this case, 2 because it is a DataFrame.
DataFrame Size
The df.size
property returns an integer representing the total number of elements in the DataFrame object. If a Series, the number of rows returns. If a DataFrame, the number of rows * the number of columns returns.
df = pd.DataFrame(finxters, columns=cols) print(df.size)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the DataFrame Size to the terminal.
Output
25 (5 columns * 5 rows = 25)
DataFrame Shape
The DataFrame shape
property returns a tuple that represents the DataFrame dimensionality.
df = pd.DataFrame(finxters, columns=cols) print(df.shape)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the DataFrame Shape to the terminal.
Output
(5, 5) (5 columns, 5 rows)
DataFrame Memory Usage
The df.memory_usage()
method memory usage is, by default, displayed in DataFrame.info()
. However, you can also view this information below.
Parameters
indexbool, default True | This parameter specifies whether to include the memory usage of the DataFrame index in the returned Series. If index=True, the memory usage of the index is the first item in the output. |
deepbool, default False | If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values. |
df = pd.DataFrame(finxters, columns=cols) print(df.memory_usage(index=True))
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the Memory Usage to the terminal.
Output
Index | 128 |
ID | 40 |
First | 40 |
Last | 40 |
Job | 40 |
Salary | 40 |
dtype | Int64 |
df = pd.DataFrame(finxters, columns=cols) print(df.memory_usage(index=True, deep=True))
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs the Memory Usage to the terminal.
Output
Index | 128 |
ID | 40 |
First | 310 |
Last | 311 |
Job | 326 |
Salary | 40 |
dtype | Int64 |
DataFrame Empty
The df.empty
property checks to see if a DataFrame is empty. If empty, True
returns. Otherwise, False
returns.
df = pd.DataFrame(finxters, columns=cols) print(df.empty)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] outputs
True
/False
to the terminal.
DataFrame Index
The df.set_index()
method allows you to set a column as the index. If no index exists, an index (auto-increment) is automatically generated by default.
ID | First | Last | Job | Salary | |
0 | 1042 | Jayce | White | Data Scientist | 155400 |
1 | 1043 | Micah | Howes | Manager | 95275 |
2 | 1044 | Hanna | Groves | Assistant | 65654 |
3 | 1045 | Steve | Brown | Coder | 88300 |
4 | 1046 | Harry | Green | Writer | 98314 |
For this example, the column Last will be the index.
df = pd.DataFrame(finxters, columns=cols) df.set_index('Last', inplace=True) print(df)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] sets Last as the index column and inplace=True.
- Line [3] outputs the DataFrame to the terminal.
π‘ Note: When inplace=True
the DataFrame is updated and has no return value. When inplace=False
(default) a copy of the updated DataFrame is returned.
Output
ID | First | Job | Salary | |
Last | ||||
White | 1042 | Jayce | Data Scientist | 155400 |
Howes | 1043 | Micah | Manager | 95275 |
Groves | 1044 | Hanna | Assistant | 65654 |
Brown | 1045 | Steve | Coder | 88300 |
Green | 1046 | Harry | Writer | 98314 |
DataFrame Set Flags
The df.set_flags()
method allows you to set various flags. For this example, a flag is set to not allow duplicate labels in the DataFrame.
df = pd.DataFrame(finxters, columns=cols) df1 = df.set_flags(allows_duplicate_labels=False) print(df1)
- Line [1] assigns the Column Name from the list created earlier to
columns=cols
. - Line [2] sets allow duplicate labels to
False
and assigns this to a new DataFrame (df1
). - Line [3] outputs
df1
to the terminal. There is no change as the original DataFrame did not contain duplicate values.
Output
ID | First | Last | Job | Salary | |
0 | 1042 | Jayce | White | Data Scientist | 155400 |
1 | 1043 | Micah | Howes | Manager | 95275 |
2 | 1044 | Hanna | Groves | Assistant | 65654 |
3 | 1045 | Steve | Brown | Coder | 88300 |
4 | 1046 | Harry | Green | Writer | 98314 |