5 Best Ways to Write a Python Program to Read Excel Data and Access First & Last Columns

Rate this post

πŸ’‘ Problem Formulation: Extracting specific columns from an excel file is a common task faced by data analysts and Python developers. The input is an Excel file, and the desired output is a list or array containing the values of the first and last columns for all rows. This article walks through different Python methods to achieve this, each with its own benefits and use cases.

Method 1: Using pandas with iloc

The pandas library is a robust tool for data manipulation in Python. With pandas, you can use the iloc indexer to select the first and last columns by their integer location. This method is efficient and highly readable, making it an excellent choice for working with large datasets.

Here’s an example:

import pandas as pd

# Read an Excel file
df = pd.read_excel('example.xlsx')

# Select the first and last columns
first_last_columns = df.iloc[:, [0, -1]]

print(first_last_columns)

Output:

   First_Column  Last_Column
0           123          789
1           456          012
2           678          345

This snippet extracts the first and last columns of the Excel file ‘example.xlsx’ into a new DataFrame with df.iloc[:, [0, -1]]. The 0 and -1 index positions correspond to the first and last columns, respectively.

Method 2: Using pandas with loc and column names

Using pandasloc method with explicit column names allows for more readable code, particularly when the column names are descriptive. This method is recommended when you know the header names of the columns you want to select ahead of time.

Here’s an example:

import pandas as pd

# Read an Excel file
df = pd.read_excel('example.xlsx')

# Select the first and last columns by name
first_column_name = df.columns[0]
last_column_name = df.columns[-1]
selected_columns = df.loc[:, [first_column_name, last_column_name]]

print(selected_columns)

Output:

   First_Column  Last_Column
0           123          789
1           456          012
2           678          345

Here, first we obtain the names of the first and last columns using df.columns[0] and df.columns[-1], and then use df.loc to select the columns.

Method 3: Using openpyxl

openpyxl is a Python library to read and write Excel 2010 xlsx/xlsm/xltx/xltm files. It allows you to work with Excel files without needing the pandas library. This method is useful if your project strictly handles Excel files and you want to avoid the larger footprint of pandas.

Here’s an example:

from openpyxl import load_workbook

# Load the workbook and select the active worksheet
wb = load_workbook('example.xlsx')
ws = wb.active

# Extract the first and last column values
first_column = [cell.value for cell in ws['A']]
last_column = [cell.value for cell in ws[ws.max_column]]

print("First Column:", first_column)
print("Last Column:", last_column)

Output:

First Column: [123, 456, 678]
Last Column: [789, 012, 345]

In this code, load_workbook loads the Excel file and ws['A'] and ws[ws.max_column] are used to reference the first and last columns, from which the values are extracted into lists.

Method 4: Using xlrd

The xlrd library provides a way to read data from Excel files. This method is straightforward and best suited for Excel files prior to Excel 2010, focusing on older xls files. However, it’s not the best choice for newer Excel file formats or if you need more complex data manipulations.

Here’s an example:

import xlrd

# Open the workbook and select the first sheet
wb = xlrd.open_workbook('example.xls')
sheet = wb.sheet_by_index(0)

# Read the first and last column values
first_column = sheet.col_values(0)
last_column = sheet.col_values(sheet.ncols - 1)

print("First Column:", first_column)
print("Last Column:", last_column)

Output:

First Column: [123, 456, 678]
Last Column: [789, 012, 345]

We open an xls file with xlrd.open_workbook() and use sheet.col_values() with appropriate indexes to extract the values from the first and last columns.

Bonus One-Liner Method 5: Using pandas with compression

For a compact and efficient single-line solution, you can use a one-liner in pandas. This method combines file reading and data extraction in a single step, ideal for quick scripts or minimalistic code.

Here’s an example:

import pandas as pd

# Read the first and last columns in a one-liner
first_last_columns = pd.read_excel('example.xlsx', usecols=lambda x: x == 0 or x == pd.read_excel('example.xlsx').shape[1] - 1)

print(first_last_columns)

Output:

   First_Column  Last_Column
0           123          789
1           456          012
2           678          345

This one-liner reads the Excel file and directly extracts the first and last columns using a lambda function within the usecols parameter, which checks for the first (0) or the last column index dynamically.

Summary/Discussion

  • Method 1: pandas with iloc. Strengths: Efficient and versatile; works well with large datasets. Weaknesses: Requires the pandas library; might be overwhelming for beginners.
  • Method 2: pandas with loc and column names. Strengths: Clear code with explicit names; easy to maintain. Weaknesses: Requires foreknowledge of column names; requires pandas.
  • Method 3: openpyxl. Strengths: Excel-specific library; lighter than pandas. Weaknesses: Less flexible compared to pandas; not suitable for datasets that aren’t in Excel format.
  • Method 4: xlrd. Strengths: Direct and simple; ideal for older Excel formats. Weaknesses: Limited to xls files; lacks advanced features of pandas.
  • Method 5: One-liner pandas. Strengths: Concise and quick; suitable for small scripts. Weaknesses: Can be less readable; still requires pandas.