5 Effective Ways to Run AWS Glue Jobs Using the Boto3 Library in Python

Rate this post

πŸ’‘ Problem Formulation: Developers often need to integrate AWS services into their Python applications. One such scenario is kicking off an AWS Glue jobβ€”a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analyticsβ€”from a Python script. This article illustrates how to use the Boto3 library to run a Glue job with various methods, assuming you already have an AWS account, configured AWS credentials, and an existing Glue job defined. The end goal is to start the Glue job programmatically from Python and optionally handle its output.

Method 1: Starting a Glue Job with start_job_run

This method uses the start_job_run function from Boto3’s Glue client. This function initiates the execution of a Glue job by passing the job name and additional runtime arguments if necessary. It returns a unique run ID for the job started, which can be used to track the job execution status.

Here’s an example:

import boto3

# Initialize a Glue client
glue = boto3.client('glue')

# Start a Glue job
response = glue.start_job_run(JobName='my-glue-job')

print(response['JobRunId'])

Output:

jr_8d9f07f9e4bcd8e5726fff8a8f8xxxxx

In the provided code snippet, we import the Boto3 library and create a Glue client which is then used to call start_job_run with the name of the Glue job. The function initiates the job and returns a response containing the ‘JobRunId’, which you can use to track the job status.

Method 2: Monitoring Glue Job Status

After starting a Glue job, you may want to monitor its status to know when it finishes or if an error occurs. You can achieve this by calling the get_job_run function periodically with the job name and run ID obtained from the initial job start response.

Here’s an example:

import boto3
import time

# Initialize a Glue client
glue = boto3.client('glue')

# Start Glue job
job_name = 'my-glue-job'
start_response = glue.start_job_run(JobName=job_name)

# Get job run ID
job_run_id = start_response['JobRunId']

# Monitor job status
while True:
    status = glue.get_job_run(JobName=job_name, RunId=job_run_id)['JobRun']['JobRunState']
    print(f'The job is: {status}')
    if status == 'SUCCEEDED' or status == 'FAILED':
        break
    time.sleep(60)  # Wait for 60 seconds before checking the job status again

Output:

The job is: STARTING
The job is: RUNNING
The job is: SUCCEEDED

This snippet involves initiating a Glue job and then entering a loop where we continuously check the job’s status. We use the get_job_run function with the job’s name and run ID. This loop runs until the job’s status changes to either ‘SUCCEEDED’ or ‘FAILED’, with a sleep of 60 seconds between each status check to prevent overwhelming the API with requests.

Method 3: Passing Arguments to a Glue Job

When starting a Glue job using the Boto3 library, you can pass arguments that your job can use at runtime. This is done by including a dictionary of key-value pairs in the Arguments parameter of the start_job_run function. These arguments are useful for customizing job runs, like setting various paths, job configurations, or any parameters your job expects.

Here’s an example:

import boto3

# Initialize a Glue client
glue = boto3.client('glue')

# Define job arguments
job_args = {
    '--s3_input_path': 's3://my-input-bucket/',
    '--s3_output_path': 's3://my-output-bucket/'
}

# Start the Glue job with the specified arguments
response = glue.start_job_run(JobName='my-glue-job', Arguments=job_args)

print(response['JobRunId'])

Output:

jr_2f9f07f5d0abf8e5234fff2b2fgxxxxx

In this code, additional arguments are passed to the Glue job by creating a dictionary named job_args, which contains custom job parameters like input and output S3 paths. These are then passed to the start_job_run function, which starts the job with the given arguments, allowing us to customize the job run.

Method 4: Handling Exceptions

It is important to handle potential exceptions that may occur while interacting with AWS services. In the context of starting a Glue job, the Boto3 client may raise exceptions due to various reasons like network issues, permission errors, or AWS service limits. Using Python’s tryexcept blocks allows your script to handle these exceptions gracefully.

Here’s an example:

import boto3
from botocore.exceptions import BotoCoreError, ClientError

# Initialize a Glue client
glue = boto3.client('glue')

try:
    # Start the Glue job and handle potential exceptions
    response = glue.start_job_run(JobName='my-glue-job')
    print(f'Job started with run ID: {response["JobRunId"]}')
except (BotoCoreError, ClientError) as error:
    print(f'An error occurred: {error}')

Output:

Job started with run ID: jr_5d8f07f7d1cdbe5731fff3a3a9gxxxxx

In the provided snippet, starting the Glue job is enclosed in a try block, which catches exceptions raised by the Boto3 client within the except block. Two common exceptions: BotoCoreError and ClientError are specifically caught, which handles most issues related to service clients. By handling these exceptions, the script can display a helpful message and, if necessary, perform cleanup or retry logic.

Bonus One-Liner Method 5: Triggering a Glue Job in a Single Command

For quick and simple Glue job execution without much error handling or status monitoring, you can use a one-liner command in Python that combines job initiation and getting the JobRunId in one go. This is suitable for scripting and ad-hoc execution when detailed control is not required.

Here’s an example:

print(boto3.client('glue').start_job_run(JobName='my-glue-job')['JobRunId'])

Output:

jr_1c4g07f8c2edaf83726fff8b8ghxxxxx

The one-liner example here instantiates a Glue client and starts a job immediately, printing the JobRunId directly to the console. It is the most concise way to trigger a Glue job via Boto3 in Python; however, it omits error handling and status checking.

Summary/Discussion

  • Method 1: Starting a Glue Job with start_job_run. Straightforward and easy to implement. Does not provide status tracking or error handling.
  • Method 2: Monitoring Glue Job Status. Enables status tracking throughout execution. May require pacing to avoid hitting API rate limits.
  • Method 3: Passing Arguments to a Glue Job. Provides customization for job runs. Requires knowledge of the arguments expected by the job script.
  • Method 4: Handling Exceptions. Essential for robust scripts, especially when automating or scheduling Glue job runs. Adds complexity to the script.
  • Method 5: Triggering a Glue Job in a Single Command. Quick execution with minimal code. Not suitable for production use due to the lack of error handling and job monitoring.