BigQuery to S3 Data Migration Documentation - SimPPL/arbiter-documentation GitHub Wiki
Introduction
This guide provides step-by-step instructions on how to migrate data from BigQuery to S3. It covers setting up the necessary credentials, storing them securely in AWS Secrets Manager, creating an AWS Glue connection, and using AWS Glue to create a transfer job. The guide also includes details on customizing the transfer script and provides a link to a video resource for further assistance.
⚠️ Steps 3 to 29 are a one-time setup process.
Table of Contents
- Navigation to Project on GCP
- Service Account Creation on GCP
- Adding Secret to AWS
- Connection Creation on AWS Glue
- Data Migration using AWS Glue
- Resources
Step-by-Step Guide
Navigation to Project on GCP
Step 1: Navigate to Google Cloud Console & login
Step 2: Switch to 'JournalismAI Parrot' project
Service Account Creation on GCP
⚠️ Service Account on GCP & Secret on AWS have already been created for Arbiter Data Migration. Can skip to Step 30.
Step 3: Click "IAM & Admin"
Step 4: Create a service account with BigQuery access & store the JSON credentials file
Step 5: Copy the contents of this file
This is an example of the JSON credentials file.
Step 6: Navigate to https://toolbox.googleapps.com/apps/encode_decode & paste the contents of the file. Convert it to Base64 format.
Step 7: Save the Base64 output for further use This is an example of such an output.
Adding Secret to AWS
⚠️ We have already created a "Big_Query_Service_Account_Key" for data migration.
Step 8: Navigate to AWS Console & login
Step 9: Search for and open "Secrets Manager"
Step 10: Click "Store a new secret"
Step 11: Click "Other type of secret"
Step 12: Click on the text field under 'Key/value pairs' under 'key/value' tab & type "credentials"
Step 13: Paste the base64 output from Step 7
Step 14: Click "Next"
Step 15: Click the "Secret name" field & name the secret appropriately
Step 16: Click "Next"
Step 17: Click "Next" again on the next page
Step 18: Click "Store"
Connection Creation on AWS Glue
Step 19: On AWS Console, search for & open "AWS Glue"
Step 20: Click "Data connections"
Step 21: Click "Create connection"
Step 22: Find "Google BigQuery" in the Data Sources
Step 23: Click "Next"
Step 24: Click "Choose one secret" to choose an AWS Secret
Step 25: Choose the secret previously created by you
Here, we use the "Big_Query_Service_Account_Key" secret created for data migration.
Step 26: Click "Next"
Step 27: Add a Name and Description for your new connection
Step 28: Click "Next"
Step 29: Click "Create connection"
Data Migration using AWS Glue
Step 30: Search for & click "AWS Glue"
Step 31: Go to "ETL jobs"
⚠️ You can create a custom transfer job using "Visual ETL" or "Script editor"
Step 32: Under 'Your Jobs', click on "truthsocial_data-transfer"
This is a script already created for migration. Few parameters need to be changed in the created script based on the dataset or table to be transferred. Following steps will demonstrate this in detail.
Step 33: Edit 'connectionName' for the GoogleBigQuery node, if you created a new connection Else, let it be 'Big Query Connection' as shown.
Step 34: Navigate to the Google Cloud Console
Step 35: Search and open "BigQuery"
Step 36: Click on the project name shown at the top
Step 37: Copy the parent project ID for 'JournalismAI Parrot' project
The parent project ID is "infinite-rope-363317" in our case.
Step 38: Paste the copied ID in the "parentProject" field in the script on AWS
Step 39: Go back to GCP & copy the name of the dataset to be transferred
Here "bluesky_social" is the dataset name.
Step 40: Paste the dataset's name under 'connection_options' for the BiqQuery Node in the AWS Glue Script
Step 41: Paste the dataset's name under 'connection_options' for the S3 Node in the AWS Glue Script
The folder structure depicted is that of the S3 bucket, in the manner which is currently being followed. The path can be customized as required.
Step 42: On GCP, copy the table names under the dataset to be transferred
Here "blocks", "follows", etc. are the table names.
Step 43: Paste copied table names into the "table_names" array
Step 44: Click "Save"
Step 45: Click "Run"
Step 46: Click "Run details" to view the logs of the running job
Resources
PDF of Scribe Documentation of the above steps: BigQuerytoS3DataMigrationDocumentation.pdf