Pandas DataFrame Conversion

Rate this post

The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.

Below is the Database Schema for our Hockey Teams example.

This article delves into each method for DataFrame Conversions.


Preparation

Before any data manipulation can occur, a new library will require installation.

  • The Pandas library enables access to/from a DataFrame.

To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

If the installation was successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required library.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd

Create a DataFrame

For this article, we have three Hockey Teams. Each Team lists its Wins, Losses, and Ties for the Season.  

teams = {'Team-A':   [20, 2,  8], 
         'Team-B':   [18, 6,  6],
         'Team-C':   [14, 3,  13]}

df = pd.DataFrame(teams)
print(df)
  • Line [1] creates a dictionary of lists and saves them to teams.
  • Line [2] creates a DataFrame from teams and saves it to df.
  • Line [3] outputs the DataFrame to the terminal.

Output

 Team-ATeam-BTeam-C
0201814
1263
28613

💡 Note: Copy this DataFrame to the top of each script, directly below the import pandas statement.


DataFrame astype()

The astype() method offers the ability to modify column Data Types. This change can be applied to all columns or as many or as few as needed.

The syntax for this method is as follows:

DataFrame.astype(dtype, copy=True, errors='raise')
ParameterDescription
dtypeThe Data Type to be applied.
copyIf True, a copy of the DataFrame (including changes) is created. True by default.
errorsIf errors=raise an exception error displays if an issue occurs. If set to ignore, no exception error display. Default raise.

The current Data Types of the teams DataFrame is as follows:

Team-Aint64
Team-Bint64
Team-Cint64

Using the same DataFrame as above, this code changes the Data Types.

df = pd.DataFrame(teams)
df = df.astype({'Team-A': 'float64', 'Team-B': 'int32', 'Team-C': 'string'},  errors='raise') 
print(df.dtypes)
  • Line [1] uses the DataFrame created earlier.
  • Line [2] converts each column to a different Data Type based on the code.
  • Line [3] outputs the Data Types to the terminal.
Team-Afloat64
Team-BInt32
Team-Cstring

DataFrame Convert Data Types

The convert_types() method converts the Data Types and returns these changes in a new DataFrame. In this new DataFrame, each column changes to the best possible Data Type based on the data.

The syntax for this method is as follows:

DataFrame.convert_dtypes(infer_objects=True, convert_string=True, 
                         convert_integer=True, convert_boolean=True, 
                         convert_floating=True)
ParameterDescription
infer_objectsDetermines if the dtypes (Data Types) should convert to the best type. True by default.
convert_stringDetermines if object dtypes (Data Types) should be converted to StringDtype(). True by default.
convert_integerDetermines if object dtypes should be converted to BooleanDtypes(). True by default.

When the following code runs, the Data Types do not change from the original Data Type of int64. These Data Types were determined to be the best Data Type based on the data at hand.

df = pd.DataFrame(teams)
df = df.convert_dtypes()
print(df.dtypes)
  • Line [1] uses the DataFrame created earlier.
  • Line [2] converts the Data Types to the best possible Data Types.
  • Line [3] outputs the converted DataFrame to the terminal.

Output

Team-Aint64
Team-Bint64
Team-Cint64

DataFrame Infer Objects

The infer_objects() method attempts to determine the best Data Type based on the data at hand.

For this example, the original DataFrame is modified as follows:

teams = {'Team-A':    [20.0, 2,  8], 
         'Team-B':   [18, 6.2,  6],
         'Team-C':   [14, 3,  13]}

df = pd.DataFrame(teams)
df = df.iloc[1:]
print(df.infer_objects().dtypes)
  • Line [1] creates an updated DataFrame and saves it to teams.
  • Line [2] creates a DataFrame and saves it to df.
  • Line [3] uses iloc to determine the best Data Types.
  • Line [4] outputs the appropriate Data Types based on the data at hand to the terminal.

💡 Note: The first and third columns contain floating-point numbers, and the second column contains an integer. This method acts as expected.


Output

Team-Afloat64
Team-Bfloat64
Team-CInt64

Change Data Type – Alternative

Let’s say we decided to change all the Data Types to float64. An easy way to accomplish this is by running the following code. A great alternative!

teams = {'Team-A':   [20.0, 2,  8], 
         'Team-B':   [18, 6.2,  6],
         'Team-C':   [14, 3,  13]}

teams = {k:[float(i) for i in v] for k, v in teams.items()}
print(teams)

Output

{'Team-A': [20.0, 2.0, 8.0], 
 'Team-B': [18.0, 6.2, 6.0], 
 'Team-C': [14.0, 3.0, 13.0]}

In case you had some troubles understanding this code snippet, feel free to check out our full guide on dictionary comprehension:


DataFrame copy()

The copy() method makes a copy of a DataFrame.

The syntax for this method is as follows:

DataFrame.copy(deep=True/False)
Parameter Description
deep=TrueWhen a copy of a DataFrame using deep=True (shallow) is created, this copy contains its own set of data and indices. Any modifications to the new DataFrame do not affect the original DataFrame.
deep=FalseWhen a copy of a DataFrame is created using deep=False, this copy contains a reference to the original DataFrame data and indices. Any modifications to the new DataFrame automatically update the original DataFrame.
teams = {'Team-A':   [20.0, 2,  8], 
         'Team-B':   [18, 6.2,  6],
         'Team-C':   [14, 3,  13]}

df = pd.DataFrame(teams)
shallow_copy = df.copy(deep=True)
shallow_copy['Team-A'] = [4, 5, 6]
print(shallow_copy)
print(df)
  • Line [1] assigns a dictionary of lists to teams.
  • Line [2] creates a DataFrame from teams and assigns it to df.
  • Line [3] makes a deep copy of the DataFrame and assigns it to shallow_copy.
  • Line [4] makes a change to the shallow_copy variable.
  • Line [5] outputs this change to the terminal.
  • Line [6] outputs the DataFrame to the terminal.

Output

SHALLOWTeam-ATeam-BTeam-C
0418.014
156.23
266.013
ORIGINALTeam-ATeam-BTeam-C
020.018.014
12.06.23
28.06.013

DataFrame Bool

The df.bool() method references a Series/DataFrame that contains one element (value). This element/value must be True/False or 0/1. If this is not the case, a ValueError occurs.

The syntax for this method is as follows:

DataFrame.copy(deep=True/False)

Here’s the code example:

print(pd.Series([True]).bool())
print(pd.DataFrame({'col': [False]}).bool())

Output

True
False