AWS Glue Script for Data Migration Explained - SimPPL/arbiter-documentation GitHub Wiki

This script automates the process of extracting data from Google BigQuery tables and storing it in Amazon S3 for further analysis or storage.

Importing Libraries

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame

Initializing AWS Glue Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

This section sets up the necessary environment for running the AWS Glue job, including initializing the Spark context, Glue context, and Spark session.

Defining Table Names

table_names = ["linksfinal", "postsfinal", "blocksfinal", "tagsfinal", "mentionsfinal", "followsfinal"]

A list of table names (table_names) is defined. These are the names of the tables from Google BigQuery that will be processed and migrated to Amazon S3.

Looping Through Tables

for table_name in table_names:

A loop is initiated to iterate over each table name in the table_names list.

Creating DynamicFrame from Google BigQuery

GoogleBigQuery_node1698466006405 = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": "Big Query Connection",
        "parentProject": "infinite-rope-363317",
        "sourceType": "table",
        "table": f"bluesky_social.{table_name}",
    },
    transformation_ctx="GoogleBigQuery_node1698466006405",
)

For each table, a DynamicFrame is created from Google BigQuery using the create_dynamic_frame.from_options function. It specifies the connection type as BigQuery and provides connection options such as the connection name, parent project, source type, and table name.

Writing DynamicFrame to Amazon S3

AmazonS3_node1698466010194 = glueContext.write_dynamic_frame.from_options(
    frame=GoogleBigQuery_node1698466006405,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": f"s3://arbiter.datasets/data/bluesky_social/{table_name}/",
        "partitionKeys": [],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="AmazonS3_node1698466010194",
)

The DynamicFrame obtained from Google BigQuery is then written to Amazon S3 using the write_dynamic_frame.from_options function. It specifies the connection type as S3 and provides connection options such as the S3 path where the data will be written, partition keys (if any), and format options (in this case, specifying the Glue Parquet format with Snappy compression).

Committing Job

job.commit()

Finally, the job is committed, indicating that all processing and writing tasks have been completed for the current iteration.