Practical Way to Use AWS Glue with Postgresql

AWS Glue is an event-driven, serverless computing platform provided by Amazon as part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

As a popular ETL service, Glue offers numerous options to connect to various databases, including PostgreSQL, which is a widely-used RDBMS.

Glue provides several ways to set up ETL (Extract, Transform, Load) processes, as shown below:

With its visual setup, performing ETL tasks becomes much easier.

You only need a few clicks to create an ETL job that helps transform data from an S3 input to a PostgreSQL output.

However, this setup has several restrictions because you need to follow all the available options before you can create a properly functioning ETL job.

If you are looking for more flexibility in configuration, you can consider using a script setup.

With a script setup, you can connect to your data source or output directly from the script. To do this, switch from the visual setup to the script page as shown below:

For the code, you can use simple scripts like the following:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, [‘JOB_NAME’])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)

# Read data from S3
s3_path = ‘s3://your-S3-REPO/’
datasource = glueContext.create_dynamic_frame.from_options(
connection_type=”s3″,
connection_options={“paths”: [s3_path]},
format=”csv”, # Adjust format as necessary
format_options={“withHeader”: True, “separator”: “,”}
)

datasource.printSchema()

# Transform data if needed (this is a simple pass-through in this example)
transformed = ApplyMapping.apply(
frame = datasource,
mappings = [
(“id”, “string”, “id”, “int”),
(“name”, “string”, “name”, “string”),
(“age”, “string”, “age”, “int”)
]
)

transformed.printSchema()

# Write data to PostgreSQL
glueContext.write_dynamic_frame.from_options(
frame = transformed,
connection_type = “postgresql”,
connection_options = {
“url”: “jdbc:postgresql://your-PostgresqlDB-Endpoint”,
“dbtable”: “your_table”,
“user”: “your-Posgresql-User”,
“password”: “your-Posgresql-Password”
}
)

# Commit the job
job.commit()

And for the input, you can use a CSV format file like this:

id,name,age

1,John Doe,30

2,Jane Smith, 15

3,Bob Yellow,20

4,Roshan Brown,18

5,Bam Black,55

After that, you can start the job and wait until it finishes. If it succeeds, as shown below:

you can check the latest result in your posgresql.

I think that’s it for now for this article. Leave a comment below about your thoughts! Thanks.