5 Best Ways to Get Datatype and Dataframe Columns Information in Python Pandas

πŸ’‘ Problem Formulation: When working with data in Python, it’s crucial to understand the structure and datatypes of your Pandas DataFrame. This knowledge allows one to perform the correct data manipulation tasks accurately. Users often need to identify the datatypes of columns, examine DataFrame contents, and gather metadata to inform further data processing steps. For instance, one might start with a DataFrame containing mixed types and wish to identify columns of float or categorical data specifically. The following methods provide solutions for these tasks.

Method 1: Using dtypes Attribute

The simplest way to get the datatype of each column in a DataFrame is by using the dtypes attribute. This attribute returns a Series with the data type of each column. The index of this Series is the original DataFrame columns.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4.5, 5.5, 6.5],
    'C': ['foo', 'bar', 'baz']
})

# Getting the datatype of each column
dtype_info = df.dtypes

dtype_info would output:

A      int64
B    float64
C     object
dtype: object

This method is extremely straightforward. When this code is run, the attribute dtypes of the DataFrame df is accessed, which provides a summary of data types for all columnsβ€”a very quick way to get an overview of your DataFrame’s types.

Method 2: Using info() Function

The info() function is more verbose and provides a concise summary of the DataFrame, including the number of non-null entries and datatype for each column, as well as memory usage. This is a built-in DataFrame method and can offer insight into the DataFrame’s size and health in terms of data completeness.

Here’s an example:

# Get detailed info on DataFrame
df_info = df.info()

Which would typically print something like this to the console:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      int64  
 1   B       3 non-null      float64
 2   C       3 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

The info() method displays a summary that includes the number of non-null values and the datatype for each column in the DataFrame. Unlike dtypes, info() provides additional information such as the DataFrame’s index type, column count, and memory usage, making it a richer source of metadata.

Method 3: Using select_dtypes() Function

While dtypes and info() give you a broad overview, sometimes you may need to filter the DataFrame’s columns based on their datatypes. The select_dtypes() function allows you to select columns of particular datatypes, which can be useful for separating features by type for further analysis or data processing.

Here’s an example:

# Select only float columns
float_columns = df.select_dtypes(include=['float64'])

# Select only numeric columns
numeric_columns = df.select_dtypes(include=['number'])

The output will then be the respective filtered DataFrames:

   B
0  4.5
1  5.5
2  6.5

   A    B
0  1  4.5
1  2  5.5
2  3  6.5

The select_dtypes() method allows you to filter columns based on their data type specification. You can use the include and exclude parameters to define the data types you’re interested in. This gives you the flexibility to manipulate only the columns that meet your datatype criteria.

Method 4: Using dtypes with value_counts()

If you’re looking for a summarized view of how many columns belong to each datatype, you can combine the dtypes attribute with the value_counts() method. This tells you how much of each data type you’re dealing with, which can be particularly beneficial when working with datasets that contain a high number of features.

Here’s an example:

# Get a count of each datatype
datatype_counts = df.dtypes.value_counts()

This will produce:

int64      1
float64    1
object     1
dtype: int64

Combining dtypes with value_counts() provides a quick summary of the distribution of datatypes within the DataFrame. It’s an easy way to quickly assess the variety of data that you’ll need to manage when conducting data preprocessing or analysis.

Bonus One-Liner Method 5: Using apply() Function

If you’re looking for an inline method to retrieve the datatype of each column, you can use the apply() function with the type function to get the datatype of each value in the DataFrame. This method is particularly useful if the DataFrame is small or if consistency of data types within each column is in question.

Here’s an example:

# Apply type function to each value in DataFrame
types_one_liner = df.applymap(type)

Example output could look something like:

               A              B                C
0  <class 'int'>  <class 'float'>  <class 'str'>
1  <class 'int'>  <class 'float'>  <class 'str'>
2  <class 'int'>  <class 'float'>  <class 'str'>

Using applymap() with the type function on the DataFrame inspects the type of each individual element. Note that this method returns the exact type of each entry rather than the general dtype of the column and is more granular than the previous methods.

Summary/Discussion

  • Method 1: dtypes Attribute. Strengths: Quick, easy to use, cleanly formatted output. Weaknesses: Does not provide additional context such as non-null counts or memory usage.
  • Method 2: info() Function. Strengths: Comprehensive overview, including memory usage and non-null count. Weaknesses: Outputs to console, not stored as a DataFrame or Series for further manipulation.
  • Method 3: select_dtypes() Function. Strengths: Allows filtering by type, useful for separating data types. Weaknesses: Requires additional steps to get a full overview of all data types present.
  • Method 4: dtypes with value_counts(). Strengths: Summarized distribution of data types. Weaknesses: Not as detailed as other methods; no memory usage or non-null counts provided.
  • Bonus Method 5: applymap() Function. Strengths: Provides element-wise data type information. Weaknesses: Potentially verbose for large DataFrames, more CPU intensive, provides no summarization.