5 Best Ways to Use Boto3 Library in Python to Get a List of Files from S3 Based on Last Modified Date Using AWS Resource

Rate this post

πŸ’‘ Problem Formulation: When working with AWS S3, users often need to retrieve a list of files filtered by the last modified date. This list can be used for data synchronization, backup, or other maintenance tasks. The desired outcome is a python script that employs the boto3 library to connect to an S3 bucket, fetches the file metadata, and filters the files so that only those modified after a specified date are returned.

Method 1: Basic Boto3 Resource Iteration

The first method involves using boto3’s resource object to iterate over all objects in an S3 bucket, collecting those with a last modified date greater than a specified threshold. The resource interface allows for an object-oriented approach to interacting with AWS resources.

Here’s an example:

import boto3
from datetime import datetime, timezone

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('my-bucket')
filtered_files = []

for obj in bucket.objects.all():
    if obj.last_modified > datetime(2023, 1, 1, tzinfo=timezone.utc):
        filtered_files.append(obj.key)

print(filtered_files)

Output:

['file1.txt', 'file2.txt', ...]

This code snippet initializes a connection to the S3 bucket using a resource object and loops through all objects, checking each one’s last modified date. Files modified after January 1, 2023, are added to the list filtered_files.

Method 2: Using Boto3 Client and Pagination

This method employs the boto3 client interface and paginator to manage large sets of S3 objects efficiently. It’s well-suited for buckets with thousands of files, as it avoids memory overload and allows one to handle the data incrementally.

Here’s an example:

import boto3
from datetime import datetime, timezone

s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket='my-bucket')
filtered_files = []

for page in page_iterator:
    for obj in page['Contents']:
        if obj['LastModified'] > datetime(2023, 1, 1, tzinfo=timezone.utc):
            filtered_files.append(obj['Key'])

print(filtered_files)

Output:

['file1.txt', 'file2.txt', ...]

This code uses the boto3 client to paginate through all objects within ‘my-bucket’, comparing the ‘LastModified’ date of each file against a specified date. Only files modified after this date are collected.

Method 3: Client with Filtering Parameters

This method is similar to the previous one but leverages S3’s ability to filter keys on a prefix, which can be paired with pagination to isolate the retrieval more effectively when you know a common prefix for your files.

Here’s an example:

import boto3
from datetime import datetime, timezone

s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
    Bucket='my-bucket',
    Prefix='2023/'
)

filtered_files = []

for page in page_iterator:
    for obj in page.get('Contents', []):
        if obj['LastModified'] > datetime(2023, 1, 1, tzinfo=timezone.utc):
            filtered_files.append(obj['Key'])

print(filtered_files)

Output:

['2023/file1.txt', '2023/file2.txt', ...]

In this code snippet, we’ve introduced a ‘Prefix’ parameter to the paginator to filter only files starting with ‘2023/’. This is very efficient when your objects are well-organized with a predictable naming convention.

Method 4: Using Boto3 Collections Filter

Another efficient technique involves using boto3 collections and their built-in filter method, allowing the construction of custom filters that boto3 will apply server-side to reduce client-side processing.

Here’s an example:

import boto3
from datetime import datetime, timezone

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('my-bucket')
filtered_files = list(bucket.objects.filter(
    Filter=lambda obj: obj.last_modified > datetime(2023, 1, 1, tzinfo=timezone.utc)
))

for obj in filtered_files:
    print(obj.key)

Output:

['file1.txt', 'file2.txt', ...]

Using the filter feature within boto3’s collection of bucket objects, we’re able to apply a lambda function that boto3 translates into the appropriate S3 filtering. This way, the list returned to the client has already been filtered server-side.

Bonus One-Liner Method 5: Comprehension with Resource

You can condense the file filtering logic into a single comprehensible line by utilizing Python’s list comprehension feature with the boto3 resource interface.

Here’s an example:

import boto3
from datetime import datetime, timezone

s3_resource = boto3.resource('s3')
filtered_files = [
    obj.key for obj in s3_resource.Bucket('my-bucket').objects.all()
    if obj.last_modified > datetime(2023, 1, 1, tzinfo=timezone.utc)
]

print(filtered_files)

Output:

['file1.txt', 'file2.txt', ...]

This one-liner takes advantage of list comprehension to iterate over objects in an S3 bucket and create a list of keys that meet the condition in a clean, efficient manner.

Summary/Discussion

  • Method 1: Basic Resource Iteration. Simple and readable. Not suitable for buckets with a large number of files due to potential performance issues.
  • Method 2: Client with Paginator. Great for handling large datasets. More complex and verbose compared to using resource objects.
  • Method 3: Client with Filtering Parameters. Efficient for known key prefixes. Requires upfront knowledge of the bucket’s object naming structure.
  • Method 4: Using Collections Filter. Server-side filtering reduces client-side load. Less widely known, thus can be confusing for beginners.
  • Method 5: Comprehension with Resource. Extremely concise. May suffer from readability issues, especially for those new to Python comprehensions.