Case: Teradata migrate to BigQuery - davidkhala/ETL GitHub Wiki
https://resources.pythian.com/hubfs/White-Papers/Migrate-Teradata-to-Google-BigQuery.pdf
pre-migration considerations
A very thorough understanding of the current Teradata environment
-
Business Process Logic
-
Teradata Data Model(s)
-
BTEQ, Stored procedures, and other code
-
Pipeline dependencies and scheduling requirements
-
Monitoring, auditing, and logging requirements
-
Security requirements for the entire stack
-
Reporting requirements including KPIs and SLAs
-
CICD and development processes
-
In BigQuery there is no direct equivalent for Teradata
Set Table
. We recommend data from all tables be deduplicated on fetch
It’s critical to review and plan how data will be transformed as it is migrated into the BigQuery environment.
- Data type mapping is challenging, there are some data type differences between Teradata and BigQuery
- Generally, the most defensive data type is used by default to be loaded on to BigQuery.
- In case a need to preserve to a compatible type, do some additional massaging to the data in BigQuery (T in ELT)
Timeline A realistic timeline for most migrations is almost certainly from 9 months up to 24 months
Details of the migration phase
Typical migration steps
- Identify the base data set and point-in-time data (Teradata Queries) for the initial load.
- Use the BigQuery Data Transfer Service or an equivalent tool/process to extract the base data to GCS and then import to BigQuery.
- Identify process/queries for delta data (while the rest of the migration process is ongoing) and Identify load of delta data to BigQuery.
- Identify Data concurrency and consistency test cases.
- Create the process for loading delta data from Teradata (via the extract to GCS) process to BigQuery.
- Execute the data concurrent & consistency test cases.
DATA TRANSFORMATIONS & INTEGRATION
Best practices
Pythian chose to use the Teradata Hadoop Connector to migrate the data from Teradata into BigQuery.
- We set up a Dataproc instance with the Hadoop connector installed and wrote scripts to pull the data from the Teradata instance.
- The data will be pulled over and stored in GCS Buckets, which will then be imported to BigQuery
Why Not Teradata Data Transfer Service
- Customer did not allow us to write custom queries to pull the data, but Teradata Data Transfer Service required a datetime column to be defined.
- Several tables within the Teradata instance did not have a datetime column we could use, and so the entire table would have been pulled every time.
When to use MS SQL server
Data source was archived data from Teradata that was not being changed in any way.
- The customer copied the data files onto the Google Transfer Appliance.
- The data was shipped to Google and copied to a GCS Bucket
- We created a SQL Server instance in GCP using the data files.
- We wrote sqoop jobs to export the data and load into BigQuery