How to Create a DataFrame From Lists? - Be on the Right Side of Change

Pandas is a great library for data analysis in Python. With Pandas, you can create visualizations, filter rows or columns, add new columns, and save the data in a wide range of formats. The workhorse of Pandas is the DataFrame.

👉 Recommended: 10 Minutes to Pandas (in 5 Minutes)

So the first step working with Pandas is often to get our data into a DataFrame. If we have data stored in lists, how can we create this all-powerful DataFrame?

There are 4 basic strategies:

Create a dictionary with column names as keys and your lists as values. Pass this dictionary as an argument when creating the DataFrame.
Pass your lists into the zip() function. As with strategy 1, your lists will become columns in the DataFrame.
Put your lists into a list instead of a dictionary. In this case, your lists become rows instead of columns.
Create an empty DataFrame and add columns one by one.

Method 1: Create a DataFrame using a Dictionary

The first step is to import pandas. If you haven’t already, install pandas first.

import pandas as pd

Let’s say you have employee data stored as lists.

# if your data is stored like this
employee = ['Betty', 'Veronica', 'Archie', 'Jughead']
salary = [110_000, 20_000, 80_000, 70_000]
bonus = [1000, 500, 2500, 400]
tax_rate = [.1, .25, .17, .4]
absences = [0, 1, 0, 52]

Build a dictionary using column names as keys and your lists as values.

# you can easily create a dictionary that will define your dataframe
emp_data = {
    'name': employee,
    'salary': salary,
    'bonus': bonus,
    'tax_rate': tax_rate,
    'absences': absences
}

Your lists will become columns in the resulting DataFrame.

Create a DataFrame using the zip function

Pass each list as a separate argument to the zip() function. You can specify the column names using the columns parameter or by setting the columns property on a separate line.

emp_df = pd.DataFrame(zip(employee, salary, bonus, tax_rate, absences))
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']

The zip() function creates an iterator. For the first iteration, it grabs every value at index 0 from each list. This becomes the first row in the DataFrame. Next, it grabs every value at index 1 and this becomes the second row. This continues until it exhausts the shortest list.

We can loop thru the iterator to see how this works.

i = 0
for value in zip(employee, salary, bonus, tax_rate, absences):
  print(f'zipped value at index {i}: {value}')
  i += 1

Each of these values becomes a row in the DataFrame:

zipped value at index 0: ('Betty', 110000, 1000, 0.1, 0)
zipped value at index 1: ('Veronica', 20000, 500, 0.25, 1)
zipped value at index 2: ('Archie', 80000, 2500, 0.17, 0)
zipped value at index 3: ('Jughead', 70000, 400, 0.4, 52)

Create a DataFrame using a list of lists

What if you have a separate list for each employee? In this case, we can just create a list of lists. Each of the inner lists becomes a row in the DataFrame.

# lists for employees instead of features
betty = ['Betty', 110000, 1000, 0.1, 0]
veronica = ['Veronica', 20000, 500, 0.25, 1]
archie = ['Archie', 80000, 2500, 0.17, 0]
jughead = ['Jughead', 70000, 400, 0.4, 52]

emp_df = pd.DataFrame([betty, veronica, archie, jughead])
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']
emp_df

Create a DataFrame using a list of dictionaries

If the employee data is stored in dictionaries instead of lists, we use a list of dictionaries.

betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400, 
           'tax_rate': 0.4, 'absences': 52}

pd.DataFrame([betty, veronica, archie, jughead])

The columns are determined by the keys in the dictionaries. What if the dictionaries don’t all have the same keys?

betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0, 'hire_date': '2001-01-01'}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0, 'title': 'Vice Chief Leader'}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400,      
           'tax_rate': 0.4, 'absences': 52, 'rank': 'yes'}

pd.DataFrame([betty, veronica, archie, jughead])

All of the keys will be used. Anytime pandas encounters a dictionary with a missing key, the missing value will be replaced with NaN which stands for ‘not a number’.

Create an empty DataFrame and add columns one by one

This method might be preferable if you needed to create a lot of new calculated columns. Here we create a new column for after-tax income.

emp_df = pd.DataFrame()
emp_df['name'] = employee
emp_df['salary'] = salary
emp_df['bonus'] = bonus
emp_df['tax_rate'] = tax_rate
emp_df['absences'] = absences

income = emp_df['salary'] + emp_df['bonus']
emp_df['after_tax'] = income * (1 - emp_df['tax_rate'])

How to add a list to an existing DataFrame

Here is a neat trick. If you want to edit a row in a DataFrame you can use the handy loc method. Loc allows you to access rows and columns by their index value.

To access a row:

emp_df.loc[3]

Output is the row with index value 3 as a Series:

name        Jughead
salary        70000
bonus           400
tax_rate        0.4
absences         52
Name: 3, dtype: object

To access a column just pass in the column name as the index. Note that we have to specify the row and column indexes. The format is [rows, columns]. If you want all rows you can use “:” as we do here. The : also works if you want all columns.

emp_df.loc[:, 'salary']

Output is also a series

0    110000
1     20000
2     80000
3     70000
4    200000
Name: salary, dtype: int64

So how do we use loc to add a new row? If we use a row index that doesn’t exist in the DataFrame, it will create a new row for us.

new_emp = ['Fonzie', 200000, 30000, .05, 112]
emp_df.loc[4] = new_emp
emp_df

You can also update existing data with loc. Let’s drop Fonzie’s salary. It looks a bit excessive.

emp_df.loc[4, 'salary'] = 105000
emp_df

That’s more like it.

Conclusion

There are many different ways of creating a DataFrame. We looked at several methods using data stored in lists. Each will get the job done.

The most convenient method will depend on what your lists represent.

If each of your lists would best be represented as a column, then a dictionary of lists might be the easiest way to go.

If each of your lists would best be represented as a row, then a list of lists would be a good choice.

To add data in a list as a new row in an existing DataFrame, the loc method comes in handy. Loc is also useful for updating existing data.