GCP BigQuery - dennisholee/notes GitHub Wiki
select * from unnest(['a', 'b', 'c'])
Row | f0_ |
---|---|
1 | a |
2 | b |
3 | c |
select ['a', 'b', 'c'];
Row | f0_ |
---|---|
1 | a |
b | |
c |
SELECT *
FROM `bigquery-samples.airline_ontime_data.__TABLES__`
- COUNT(DISTINCT ) - statistical approximation and is not guaranteed to be exact
- COUNT(DISTINCT , n) - forces BigQuery to return an exact count for any number at or below that threshold n
- EXACT_COUNT_DISTINCT - returns the exact quantities
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
analytic_function_name ( [ argument_list ] )
OVER (
[ PARTITION BY partition_expression_list ]
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ window_frame_clause ]
)
https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
Three types of resources
- organisations
- projects
- datasets
- Create dataset in local project
bq mk pub_weather_data_samples
- Construct the new sql and save result to destination table Console
# User defined function to create a new query
create temp function celcius2fahrenheit(x FLOAT64) as ((x-32)*5/9);
select postal_code, celcius2fahrenheit(avg_of__daily_min_temperature_air_f) as min_temp
from `weathersource-com.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`;
CLI
bq --location=us query \
--destination_table playground-s-11-ed0924:pub_weather_data_samples_us.sample_climatology_great_britain_postalcode_daily_january \
--use_legacy_sql=false 'create temp function celcius2fahrenheit(x FLOAT64) as ((x-32)*5/9);
select postal_code, celcius2fahrenheit(avg_of__daily_min_temperature_air_f) as min_temp
from `weathersource-com.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`;'
- Create a second dataset
https://cloud.google.com/bigquery/docs/authorized-views
When you create the view, it must be created in a dataset separate from the source data queried by the view. Because you can assign access controls only at the dataset level, if the view is created in the same dataset as the source data, your users would have access to both the view and the data.
bq mk pri_weather_data_samples
- Create a view in the new dataset
select postal_code, min_temp
from `playground-s-11-60d131.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`
where postal_code like 'A%';
- Update IAM setting for the dataset thereby add access control to the view contained within.
bq mk {dataset}
bq mk {dataset}.{table}
bq mk --location asia-east2 --dataset --default_table_expiration 7200 --default_partition_expiration 3600 pub_weather_data_samples
bq rm {dataset}
bq rm {dataset}.{table}
bq update --expiration {seconds} {dataset}.{table}
bq update --schema name:STRING {dataset}.{table}
bq show --schema {dataset}.{table}
bq show {dataset}.{table}
bq show --format prettyjson weathersource-com:pub_weather_data_samples
# Useful for finding location for copying data to a different place.
NOTE: Currently, you can load data into BigQuery only from Cloud Storage or a readable data source (such as your local machine).
bq load --autodetect {dataset}.{table} gs://{bucket_name}/{data_file}
bq load --autodetect --time_partitioning_type DAY {dataset}.{table} gs://{bucket_name}/{data_file}
bq load --source_format={CSV|NEWLINE_DELIMITED_JSON|DATASTORE_BACKUP|AVRO|PARQUET|ORC} {dataset}.{table} gs://{bucket_name}/{data_file} {schema_file}
- Create data file containing JSON records (one record per line)
# cat data.json
{"name": "John","children":[{"name": "adamn"},{"name": "brian"}]}
- Upload file to bucket
gsutil mb gs://${DEVSHELL_PROJECT_ID}
gsutil cp data.json gs://${DEVSHELL_PROJECT_ID}
- Upload to BigQuery
bq mk dataset
bq mk testset.testtable
bq load --autodetect --source_format NEWLINE_DELIMITED_JSON testset.testtable gs://${DEVSHELL_PROJECT_ID}/data.json
- Verify schema
bq show --schema --format=prettyjson testset.testtable
Sample output:
[
{
"fields": [
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "children",
"type": "RECORD"
},
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
}
]
bq query 'select _PARTITIONTIME AS pt, string_field_0 , string_field_1 ol2, string_field_2 ol3 from mydataset.sample'
# Define file schema
cat fileSchema.json
[
{
"mode": "REQUIRED",
"name": "AUTHOR",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "TITLE",
"type": "STRING"
}
]
# Link cloud storage file to BigQuery - CSV Data format
bq mk --external_table_definition fileSchema.json@CSV=gs://{BUCKET}/{FILE} {DATASET}.{TABLE}
bq mk --expiration {seconds} {dataset}.{table}
bq mk --time_partitioning_type DAY --time_partitioning_expiration {seconds} {dataset}.{table}
# Create sample file
for l in {a..z}; do echo "2019-01-03 00:00,${l}1,${1}2,${l}3"; done > sample.dat.2019-01-03
## Load sample file to cloud storage
gsutil mk gs://{bucket_name}
gsutil cp sample.dat.2019-01-03 gs://{bucket_name}
## Create table and specify the partition column
bq mk mydataset
bq mk --time_partitioning_type DAY --time_partitioning_field ts --schema 'ts:TIMESTAMP,col1:STRING,col2:STRING,col3:String' mydataset.psample
## Load table
bq load mydataset.psample gs://{bucket_name}/sample.dat.2019-01-03
## Query partition
bq query "SELECT * FROM `{PROJECT}.mydataset.psample` WHERE ts = TIMESTAMP("2019-01-04") LIMIT 10"
https://cloud.google.com/bigquery/docs/working-with-transfers
# Transfer config for adwords
bq mk --transfer_config --target_dataset=company_adwords --display_name="Adwords" --params='{"customer_id":"111-111-1111","exclude_removed_items":"false"}' --data_source=adwords
# Transfer config for cloud storage
bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--data_source=google_cloud_storage --params='{"data_path_template":"gs://raw-data/311data_{run_time-24h|\"%Y%m%d\"}.json","destination_table_name_template":"DATA","file_format":"JSON"}'
bq ls --transfer_config --transfer_location {location i.e. us}
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
# Create cloud storage bucket
gsutil mb gs://${DEVSHELL_PROJECT_ID}
# Create dataset 'bq_transfer'
bq mk bq_transfer
# Create table 'book' with schema
bq mk bq_transfer.book Author:string,Title:string
# Setup transfer configuration
bq mk --transfer_config \
--target_dataset=bq_transfer \
--display_name='book transfer' \
--data_source=google_cloud_storage --params='{"data_path_template":"gs://{BUCKET_NAME}/*.csv","destination_table_name_template":"book","file_format":"CSV"}'
# Follow the prompt to register an authentication code
# Verify configuration is setup
bq ls --transfer_config --transfer_location us
# List the transfer configuration name
bq ls --transfer_config --transfer_location us
# Remove transfer configuration (name obtained from configuration listing)
bq rm --transfer_config {TRANSFER_CONFIG_NAME}
https://cloud.google.com/bigquery/docs/share-access-views
Create a "Author" view from table "books" that has schema "Author:string,Title:string"
- Create view
bq mk --use_legacy_sql=false --view='SELECT Author FROM `{PROJECT}.bq_transfer.book`' bq_transfer.author_vie
w
- Execute query to confirm view created successfully
bq query --nouse_legacy_sql 'SELECT Author FROM `{PROJECT}.bq_transfer.author_view` LIMIT 1000'
https://cloud.google.com/bigquery/streaming-data-into-bigquery
- Data can take up to 90 minutes to become available for copy and export operations.
- Use insertId property to de-duplicate data on a best effort basis
Src: https://panoply.io/data-warehouse-guide/bigquery-architecture/
- BigQuery client (typically BigQuery Web UI or bg command-line tool or REST APIs) interact with Dremel engine via a client interface.
- Borg allocates the compute capacity for the Dremel jobs.
- Dremel jobs read data from Google’s Colossus file systems using Jupiter network, perform various SQL operations and return results to the client.
- BigQuery stores data in a columnar format known as Capacitor.
- Each field of BigQuery table i.e. column is stored in a separate Capacitor file which enables BigQuery to achieve very high compression ratio and scan throughput.
- BigQuery encodes every column separately into Capacitor format. Once all column data is encoded, it’s written back to Colossus
- Colossus handles cluster-wide replication, recovery and distributed management.
- When writing data to Colossus, BigQuery makes some decision about initial sharding strategy which evolves based on the query and access patterns.
Colossus allows splitting of the data into multiple partitions to enable blazing fast parallel read whereas Capacitor reduces requires scan throughput.
Queries against the following:
- Google Cloud Bigtable
- Google Cloud Storage
- Google Drive
BigQuery performs on-the-fly loading of data into Dremel engine
- Dremel engine uses a multi-level serving tree for scaling out SQL queries.
- Root server receives incoming queries from clients and routes the queries to the next level
- Leaf nodes reads the data from Colossus and performing filters and partial aggregation
- To parallelize the query, each serving level performs (Root and Mixers) query rewrite
- Query is modified to include horizontal partitions of the table, i.e. shards
- Certain SQL clause can be stripped out before sending to leaf nodes
- To parallelize the query, each serving level performs (Root and Mixers) query rewrite
- Each leaf node provides execution thread or number of processing units often called as slots
- Co-locating all key-value pairs related to a specific key on a particular worker node, and then perform the grouping.
- key-value pairs need to be physically moved between processing nodes in a process that is called a “shuffle.”
- Takes advantage of Google’s Jupiter network to move data extremely rapidly from one place to another
Src: https://medium.com/@doctusoft/how-to-boost-bigquery-performance-33e952971760
- Divide datasets by frequency using filtered attributes.
- BQ is a columnar database without indexes so it always does a full table scan on the required columns.
- If the dataset is broken down into smaller parts, BQ has less data to read, which leads to faster and cheaper queries.
Available options:
- Sharded tables
- Ingestion-time partitioned tables
- Partitioned tables
Two types of table partitioning in BigQuery:
- Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date.
SELECT name, birth_place
FROM person
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(‘2001–09–24’) AND TIMESTAMP(‘2001–09–25’)
- Partitioned tables: Tables that are partitioned based on a TIMESTAMP or DATE column.
- Shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD (i.e. date-sharded tables)
SELECT name, birth_place
FROM person_20010924, person_20010925
- Partitioned tables perform better than tables sharded by date.
- Trim the data as early in the query as possible so that shuffling is applied to a limited data set
- Avoid scanning too much or too frequently i.e. partitioning tables by date
- Query only the columns that are need
- Denormalize data to take full advantage of nested and repeated fields offered by BigQuery data structure
https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview
A migration strategy that aims either to get the use case working for the end user in the new environment as quickly as possible, or to take advantage of extra capacity available in the new environment. Use cases are offloaded by doing the following: Copying and then synchronizing the schema and data from the legacy data warehouse. Migrating the downstream scripts, procedures, and business applications. Migration offloading can increase the complexity and work involved in migrating data pipelines.
A migration approach like an offload migration, but instead of copying and then synchronizing the schema and data, you configure the migration to ingest data directly into the new cloud data warehouse from the upstream > source systems. In other words, the data pipelines required for the use case are also migrated.
https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview