GCP BigQuery - dennisholee/notes GitHub Wiki

Queries

Google Cloud Public Datasets

Arrays

select * from unnest(['a', 'b', 'c'])
Row f0_
1 a
2 b
3 c
select ['a', 'b', 'c'];
Row f0_
1 a
b
c

Metadata queries

List tables in a given dataset

SELECT * 
FROM `bigquery-samples.airline_ontime_data.__TABLES__`

Count distinct options

  • COUNT(DISTINCT ) - statistical approximation and is not guaranteed to be exact
  • COUNT(DISTINCT , n) - forces BigQuery to return an exact count for any number at or below that threshold n
  • EXACT_COUNT_DISTINCT - returns the exact quantities

Analytical Functions

https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

analytic_function_name ( [ argument_list ] )
  OVER (
    [ PARTITION BY partition_expression_list ]
    [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
    [ window_frame_clause ]
  )

Performance Optimization

https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

Miscellaneous SQL Examples

Access Control

Three types of resources

  1. organisations
  2. projects
  3. datasets

Authorized views

  1. Create dataset in local project
bq mk pub_weather_data_samples
  1. Construct the new sql and save result to destination table Console
# User defined function to create a new query
create temp function celcius2fahrenheit(x FLOAT64) as ((x-32)*5/9);
select postal_code, celcius2fahrenheit(avg_of__daily_min_temperature_air_f) as min_temp
from `weathersource-com.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`;

CLI

bq --location=us query \
--destination_table playground-s-11-ed0924:pub_weather_data_samples_us.sample_climatology_great_britain_postalcode_daily_january \
--use_legacy_sql=false 'create temp function celcius2fahrenheit(x FLOAT64) as ((x-32)*5/9);
select postal_code, celcius2fahrenheit(avg_of__daily_min_temperature_air_f) as min_temp
from `weathersource-com.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`;'
  1. Create a second dataset

https://cloud.google.com/bigquery/docs/authorized-views

When you create the view, it must be created in a dataset separate from the source data queried by the view. Because you can assign access controls only at the dataset level, if the view is created in the same dataset as the source data, your users would have access to both the view and the data.

bq mk pri_weather_data_samples
  1. Create a view in the new dataset
select postal_code, min_temp
from `playground-s-11-60d131.pub_weather_data_samples.sample_climatology_great_britain_postalcode_daily_january`
where postal_code like 'A%';
  1. Update IAM setting for the dataset thereby add access control to the view contained within.

Command Lines

Create data set and table

bq mk {dataset}
bq mk {dataset}.{table}

bq mk --location asia-east2 --dataset --default_table_expiration 7200 --default_partition_expiration 3600 pub_weather_data_samples

Delete data set and table

bq rm {dataset}
bq rm {dataset}.{table}

Update table definition

bq update --expiration {seconds} {dataset}.{table}
bq update --schema name:STRING {dataset}.{table}

Describe table

bq show --schema {dataset}.{table}
bq show {dataset}.{table}

bq show --format prettyjson weathersource-com:pub_weather_data_samples
# Useful for finding location for copying data to a different place.

Loading data

NOTE: Currently, you can load data into BigQuery only from Cloud Storage or a readable data source (such as your local machine).

bq load --autodetect {dataset}.{table} gs://{bucket_name}/{data_file}
bq load --autodetect --time_partitioning_type DAY {dataset}.{table} gs://{bucket_name}/{data_file}

bq load --source_format={CSV|NEWLINE_DELIMITED_JSON|DATASTORE_BACKUP|AVRO|PARQUET|ORC} {dataset}.{table} gs://{bucket_name}/{data_file} {schema_file}

Repeated Values

  1. Create data file containing JSON records (one record per line)
# cat data.json
{"name": "John","children":[{"name": "adamn"},{"name": "brian"}]}
  1. Upload file to bucket
gsutil mb gs://${DEVSHELL_PROJECT_ID}
gsutil cp data.json gs://${DEVSHELL_PROJECT_ID}
  1. Upload to BigQuery
bq mk dataset
bq mk testset.testtable
bq load --autodetect --source_format NEWLINE_DELIMITED_JSON testset.testtable gs://${DEVSHELL_PROJECT_ID}/data.json
  1. Verify schema
bq show --schema --format=prettyjson testset.testtable

Sample output:

[
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "name",
        "type": "STRING"
      }
    ],
    "mode": "REPEATED",
    "name": "children",
    "type": "RECORD"
  },
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "STRING"
  }
]

Query table

bq query 'select _PARTITIONTIME AS pt, string_field_0 , string_field_1 ol2, string_field_2 ol3 from mydataset.sample'

Federated Queries - enable queries on files located in cloud storage

# Define file schema
cat fileSchema.json
[
      {
        "mode": "REQUIRED",
        "name": "AUTHOR",
        "type": "STRING"
      },
      {
        "mode": "REQUIRED",
        "name": "TITLE",
        "type": "STRING"
      }
]

# Link cloud storage file to BigQuery - CSV Data format
bq mk --external_table_definition fileSchema.json@CSV=gs://{BUCKET}/{FILE} {DATASET}.{TABLE}

Expiration

Table expiration

bq mk --expiration {seconds} {dataset}.{table}

Partition expiration

bq mk --time_partitioning_type DAY --time_partitioning_expiration {seconds} {dataset}.{table}

Query partitioned table

# Create sample file
for l in {a..z}; do echo "2019-01-03 00:00,${l}1,${1}2,${l}3"; done > sample.dat.2019-01-03

## Load sample file to cloud storage
gsutil mk gs://{bucket_name}
gsutil cp sample.dat.2019-01-03 gs://{bucket_name}

## Create table and specify the partition column
bq mk mydataset
bq mk --time_partitioning_type DAY --time_partitioning_field ts --schema 'ts:TIMESTAMP,col1:STRING,col2:STRING,col3:String' mydataset.psample

## Load table
bq load mydataset.psample gs://{bucket_name}/sample.dat.2019-01-03

## Query partition
bq query "SELECT * FROM `{PROJECT}.mydataset.psample` WHERE ts = TIMESTAMP("2019-01-04") LIMIT 10"

BigQuery Data Transfer Service

https://cloud.google.com/bigquery/docs/working-with-transfers

Create data transfer configuration

# Transfer config for adwords
bq mk --transfer_config --target_dataset=company_adwords --display_name="Adwords" --params='{"customer_id":"111-111-1111","exclude_removed_items":"false"}' --data_source=adwords

# Transfer config for cloud storage
bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--data_source=google_cloud_storage --params='{"data_path_template":"gs://raw-data/311data_{run_time-24h|\"%Y%m%d\"}.json","destination_table_name_template":"DATA","file_format":"JSON"}'

List transfer configurations

bq ls --transfer_config --transfer_location {location i.e. us}

Sample setup for transferring data from cloud storage to BigQuery via data transfer service

https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer

# Create cloud storage bucket
gsutil mb gs://${DEVSHELL_PROJECT_ID}

# Create dataset 'bq_transfer'
bq mk bq_transfer

# Create table 'book' with schema
bq mk bq_transfer.book Author:string,Title:string

# Setup transfer configuration
bq mk --transfer_config \
--target_dataset=bq_transfer \
--display_name='book transfer' \
--data_source=google_cloud_storage --params='{"data_path_template":"gs://{BUCKET_NAME}/*.csv","destination_table_name_template":"book","file_format":"CSV"}'

# Follow the prompt to register an authentication code

# Verify configuration is setup
bq ls --transfer_config --transfer_location us

# List the transfer configuration name
bq ls --transfer_config --transfer_location us

# Remove transfer configuration (name obtained from configuration listing)
bq rm --transfer_config {TRANSFER_CONFIG_NAME}

Authorized View

Create authorised view

https://cloud.google.com/bigquery/docs/share-access-views

Create a "Author" view from table "books" that has schema "Author:string,Title:string"

  1. Create view
bq mk --use_legacy_sql=false --view='SELECT Author FROM `{PROJECT}.bq_transfer.book`' bq_transfer.author_vie
w
  1. Execute query to confirm view created successfully
bq query --nouse_legacy_sql 'SELECT Author FROM `{PROJECT}.bq_transfer.author_view`  LIMIT 1000'

Streaming Data

https://cloud.google.com/bigquery/streaming-data-into-bigquery

  • Data can take up to 90 minutes to become available for copy and export operations.
  • Use insertId property to de-duplicate data on a best effort basis

Architecture

Src: https://panoply.io/data-warehouse-guide/bigquery-architecture/

  1. BigQuery client (typically BigQuery Web UI or bg command-line tool or REST APIs) interact with Dremel engine via a client interface.
  2. Borg allocates the compute capacity for the Dremel jobs.
  3. Dremel jobs read data from Google’s Colossus file systems using Jupiter network, perform various SQL operations and return results to the client.

Capacitor and Colossus

  • BigQuery stores data in a columnar format known as Capacitor.
  • Each field of BigQuery table i.e. column is stored in a separate Capacitor file which enables BigQuery to achieve very high compression ratio and scan throughput.
  • BigQuery encodes every column separately into Capacitor format. Once all column data is encoded, it’s written back to Colossus
  • Colossus handles cluster-wide replication, recovery and distributed management.
  • When writing data to Colossus, BigQuery makes some decision about initial sharding strategy which evolves based on the query and access patterns.

Colossus allows splitting of the data into multiple partitions to enable blazing fast parallel read whereas Capacitor reduces requires scan throughput.

Federated data source

Queries against the following:

  • Google Cloud Bigtable
  • Google Cloud Storage
  • Google Drive

BigQuery performs on-the-fly loading of data into Dremel engine

Query Execution

  • Dremel engine uses a multi-level serving tree for scaling out SQL queries.
  • Root server receives incoming queries from clients and routes the queries to the next level
  • Leaf nodes reads the data from Colossus and performing filters and partial aggregation
    • To parallelize the query, each serving level performs (Root and Mixers) query rewrite
      1. Query is modified to include horizontal partitions of the table, i.e. shards
      2. Certain SQL clause can be stripped out before sending to leaf nodes
  • Each leaf node provides execution thread or number of processing units often called as slots

Shuffling

https://cloud.google.com/blog/products/data-analytics/how-distributed-shuffle-improves-scalability-and-performance-cloud-dataflow-pipelines

  • Co-locating all key-value pairs related to a specific key on a particular worker node, and then perform the grouping.
  • key-value pairs need to be physically moved between processing nodes in a process that is called a “shuffle.”
  • Takes advantage of Google’s Jupiter network to move data extremely rapidly from one place to another

Partitioning

Src: https://medium.com/@doctusoft/how-to-boost-bigquery-performance-33e952971760

  • Divide datasets by frequency using filtered attributes.
  • BQ is a columnar database without indexes so it always does a full table scan on the required columns.
  • If the dataset is broken down into smaller parts, BQ has less data to read, which leads to faster and cheaper queries.

Available options:

  1. Sharded tables
  2. Ingestion-time partitioned tables
  3. Partitioned tables

Table Partitioning

Two types of table partitioning in BigQuery:

  1. Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date.
SELECT name, birth_place
FROM person
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(‘20010924’) AND TIMESTAMP(‘20010925’)
  1. Partitioned tables: Tables that are partitioned based on a TIMESTAMP or DATE column.

Sharding

  • Shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD (i.e. date-sharded tables)
SELECT name, birth_place
FROM person_20010924, person_20010925

Partitioning versus Sharding

  • Partitioned tables perform better than tables sharded by date.

Optimization

  • Trim the data as early in the query as possible so that shuffling is applied to a limited data set
  • Avoid scanning too much or too frequently i.e. partitioning tables by date
  • Query only the columns that are need
  • Denormalize data to take full advantage of nested and repeated fields offered by BigQuery data structure

Data Warehouse migration

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

Offload migration

A migration strategy that aims either to get the use case working for the end user in the new environment as quickly as possible, or to take advantage of extra capacity available in the new environment. Use cases are offloaded by doing the following: Copying and then synchronizing the schema and data from the legacy data warehouse. Migrating the downstream scripts, procedures, and business applications. Migration offloading can increase the complexity and work involved in migrating data pipelines.

Full migration

A migration approach like an offload migration, but instead of copying and then synchronizing the schema and data, you configure the migration to ingest data directly into the new cloud data warehouse from the upstream > source systems. In other words, the data pipelines required for the use case are also migrated.

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

⚠️ **GitHub.com Fallback** ⚠️