GCS BigQuery - ghdrako/doc_snipets GitHub Wiki

Data in a BigQuery table is encrypted using a data encryption key. Then, those data encryption keys are themselves encrypted with key encryption keys. This is known as envelope encryption and it provides additional security. BigQuery also allows you to provide your own encryption keys. These are called Customer Managed Encryption Keys. Inside BigQuery, you can monitor your team's BigQuery usage and running queries, and proactively limit access to data at a row and a column level.

BigQuery also provides support for customer managed encryption keys (CMEKs), which enables you to encrypt tables using specific encryption keys.

It's both a Fast SQL Query Engine, and also a fully managed storage layer for loading and storing your datasets. The two services are connected by Google's high-speed internal network. Recall that this super-fast network enables us to separate, compute and storage. BigQuery storage service. The storage service automatically manages the data that you ingest into the platform. Data is contained within a project, in what are called datasets, which could have zero to many tables or views. The tables are stored as highly-compressed columns, each column of that table highly compressed and Google's internal Colossus file system which provides durability and global availability. The storage service can do both bulk data ingestion and streaming data ingestion. So it'll work with huge amounts of data and also real-time data streams. Native BigQuery storage. The storage service and the Query service work together to internally organize the data to make your queries and run efficiently on terabytes and petabytes. They even optimize your SQL statements syntax, whenever possible

if you hold down at least on this Mac here, it's going to be the command key, the Windows key, it will highlight all the datasets in your query. Why is it useful? Because as soon as it's highlighted you can click on it. So if you have a bunch of different datasets that you just want to explore, or you inherited a query from somebody else, you can quickly get to the schema, the details in the preview of that particular table.

you can use BigQuery to directly query database data in Cloud SQL, that is managed relational databases like PostgreSQL, MySQL, and SQL Server. You can also use BigQuery to directly query files and Cloud Storage as long as those files are in formats like CSV Apache.

You can use SQL to setup scheduled queries by using the @run_time parameter, or the query scheduler and the BigQuery UI.

Set up very granular controls over your columns and rows of data in BigQuery using the new data catalog service and some of the advanced features in BigQuery, such as authorized views.

stream records into BigQuery via the API. Note that there are a few coder restrictions that you should be aware of. The max row size for a streaming insert is one megabyte and the maximum throughput is 100,000 records per second per project. If you need higher throughput, say in the order of millions of records per second, for use cases you can consider it like application logging or real-time events tracking. Consider using Cloud Bigtable as a data sync instead. Before you start streaming thousands of records into BigQuery and the API, consider the other options you could have for your streaming solution. If you have data that needs to be transformed or aggregated mid-flight into table and row format, or joined against other data sources as side inputs midstream, or if you want to take just a window or a segment of that data, you should really consider using Cloud Dataflow for your streaming data pipeline

Access Control is through Cloud IAM (organization->Project->Datasets) Access |Control is to Datasets not tables or Columns

A user has to run job on query datasets

External sources

Możliwa integracja z innymi źródłami danych:

Apache Spark,
Amazon S3,
Azure Blob
Cloud Spanner,
Cloud SQL, Koszt korzystania z zewnętrznego połączenia danych w BigQuery zależy od typu połączenia i ilości przetwarzanych danych. W przypadku połączeń Cloud SQL for Postgres cena jest oparta na liczbie bajtów przetwarzanych na zapytanie. Obecny cennik wynosi 0,05 USD za GB przetwarzanych danych.

Uwierzytelnienie w BQ External Connection nie wykorzystuje kont IAM, a bezpośrednio użytkownika i hasło co powoduje wymóg ukrycia tych zmiennych w vault.

Role - security - access control

https://cloud.google.com/bigquery/docs/access-control?utm_source

Storage model - cost

Terraformowy parametr storage_billing_model odnosi się do sposobu rozliczania za przechowywanie danych w tabelach BigQuery. Możliwe wartości:

LOGICAL:

Oznacza rozliczanie za logiczny rozmiar danych — czyli taki, jaki użytkownik widzi na poziomie zapytań SQL.
Ignoruje kompresję, format przechowywania itp.
Przykład: tabela z kolumną STRING, zawierająca tekst "hello" będzie rozliczana za 5 bajtów (plus narzut metadanych).
Domyślny model, bardziej przewidywalny, dobry jeśli dane są mało kompresowalne.

PHYSICAL:

Rozliczanie za fizyczny rozmiar danych, czyli ile faktycznie zajmuje tabela na dysku (po kompresji, w wewnętrznym formacie BigQuery).
Może być korzystniejsze cenowo, szczególnie przy używaniu formatów zoptymalizowanych (np. kolumnowych lub z dużą kompresją).
Lepszy przy dużych, dobrze kompresowalnych zbiorach danych — można zaoszczędzić pieniądze.

logical droższy, ale physical może nie działać na każdych dancyh dobrze, później więcej wydaje sie na zapytania jak nie będzie to optymalne przechowywane. Phisical nie ma time travel tej fukcji BQ która do 7 dni umożliwia odzyskanie skasowanych danych. Nie st to prawda ale dziala inaczej: W modelu rozliczeń PHYSICAL funkcja Time Travel w BigQuery działa inaczej niż w modelu LOGICAL. W obu przypadkach Time Travel pozwala na odzyskiwanie danych przez maksymalnie 7 dni, jednak różnica polega na sposobie naliczania opłat za przechowywanie tych danych:

Model LOGICAL: Koszty przechowywania danych obejmują zarówno aktywne dane, jak i te związane z Time Travel oraz Fail-safe. Opłaty są naliczane na podstawie logicznego (niekompresowanego) rozmiaru danych.
[Google Cloud+1Masthead+1](https://cloud.google.com/bigquery/docs/time-travel?utm_source=chatgpt.com)

Model PHYSICAL: Koszty przechowywania danych są naliczane osobno dla aktywnych danych oraz dla danych związanych z Time Travel i Fail-safe. Opłaty są oparte na fizycznym (skomprymowanym) rozmiarze danych.

W praktyce oznacza to, że w modelu PHYSICAL, choć płacisz mniej za aktywne dane dzięki kompresji, to jednak dodatkowe opłaty za przechowywanie danych Time Travel mogą wpłynąć na całkowite koszty. Dlatego ważne jest dostosowanie długości okna Time Travel do rzeczywistych potrzeb, aby zoptymalizować koszty.

Należy również pamiętać, że minimalny czas trwania okna Time Travel to 2 dni, a całkowite wyłączenie tej funkcji nie jest możliwe. Stack Overflow

Podsumowując, w modelu PHYSICAL funkcja Time Travel jest dostępna, ale wiąże się z dodatkowymi kosztami, które warto uwzględnić przy wyborze modelu rozliczeń.

resource "google_bigquery_table" "example" {
  dataset_id          = "your_dataset"
  table_id            = "your_table"
  deletion_protection = false

  schema = file("schema.json")

  storage_billing_model = "PHYSICAL"  # albo "LOGICAL"
}

Cloud Audit Log

Admin activity
System events
Access data

Authentication

User account
Services account

BigQuery Datasets can be regional or multi-regional

Data types

Load data to BigQuery - Limits directly - use ETL clod storage and piplines

There is a limit to the size of the data that can be loaded directly into BigQuery. This is because your network might affect that bottleneck. Rather than load the data directly into BigQuery, it can be much more convenient to first stage it and load it into Google Cloud Storage is that data lake, and then load Google Cloud Storage through a pipeline into BigQuery as your data warehouse. This is because Google Cloud Storage supports a multithreaded resumable loads of data. If you're using gsutil, you can literally just provide a -m option there. Loading data from Cloud Storage will also be much faster because of the high throughput it offers. A feature that has recently been released is the ability for BigQuery to query data files that live in Google Cloud Storage, directly query them without having to first load those files into BigQuery zone native storage. This is called a federated query or creating an external data source connection. A super popular use case is if you have your data files in a variety of formats like CSV, Avro, newly supported you can do Parquet or Apache ORC, and then query them directly into BigQuery.

External data sources

An external data source is a data source that you can query directly from BigQuery.4 BigQuery supports the following external data sources:

External tables

An external table is a table that acts like a standard BigQuery table. The table metadata, including the table schema, is stored in BigQuery storage, but the data itself resides in the external source.:

Federated queries

https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries

A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.

You can use federated queries with the following external databases:

Datastream

https://cloud.google.com/datastream/docs/overview

Expiration time

Best practice: Configure the default table expiration for your datasets, configure the expiration time for your tables, and configure the partition expiration for partitioned tables.

You can set a default table expiration time at the dataset level, or you can set a table's expiration time when the table is created. If you set the expiration when the table is created, the dataset's default table expiration is ignored.

bq update --default_table_expiration integer project_id:dataset
bq update --default_table_expiration 7200 mydataset
bq update --default_table_expiration 7200 myotherproject:mydataset
 ALTER SCHEMA mydataset
 SET OPTIONS(
     default_table_expiration_days=3.75
 )
bq update --default_partition_expiration integer project_id:dataset
bq update --default_partition_expiration 93600 mydataset
bq update --default_partition_expiration 93600 myotherproject:mydatase

ALTER SCHEMA mydataset
 SET OPTIONS(
     default_partition_expiration_days=3.75
 )

Long-term storage

If you have a table that is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn't been modified in the last 90 days, the data in that partition is considered long term storage and is charged at the discounted price.

Authorized Views

BigQuery organize data into containers called datasets. These datasets function like top-level folders taht organize and control access to underling tables. IAM can restrict acces to tables and datasets Restrica access to part of the tables - by creating authorized views Steps:

Create dataset to create a views Authorized Views required source data (table) sit in separate dataset from the view
Create view with only needed data
assign access control to the project. In order to query that view ther is need perition to run query - asign BigQuery user role
assign acces control to datasets with the view - need BigQuery data viewer role for datasets with view
authorize the view to access the source dataset

Query dialect

Set the query dialect when you query BigQuery data:

standard SQL (In the Cloud Console and the client libraries, standard SQL is the default.)
legacy SQL dialect (In the classic BigQuery web UI, the bq command-line tool, and the REST API, legacy SQL is the default.)

The following prefixes let you set the SQL dialect:

Prefix Description

#legacySQL Runs the query using legacy SQL
#standardSQL Runs the query using standard SQL

Example:

#legacySQL
SELECT
  weight_pounds, state, year, gestation_weeks
FROM
  [bigquery-public-data:samples.natality]
ORDER BY weight_pounds DESC
LIMIT 10;

bq query \
--use_legacy_sql=false \
'SELECT
  word
FROM
  `bigquery-public-data.samples.shakespeare`'

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    SELECT name, SUM(number) as total_people
    FROM `bigquery-public-data.usa_names.usa_1910_2013`
    WHERE state = 'TX'
    GROUP BY name, state
    ORDER BY total_people DESC
    LIMIT 20
"""
query_job = client.query(query)  # Make an API request.

print("The query data:")
for row in query_job:
    # Row values can be accessed by field name or index.
    print("name={}, count={}".format(row[0], row["total_people"]))

standard SQL vs legacy SQL

Previously, BigQuery executed queries using a non-standard SQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for standard SQL, and renamed BigQuery SQL to legacy SQL. Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.

Setting standard SQL as the default for the command-line tool

.bigqueryrc (set --use_legacy_sql=false in .bigqueryrc)
Add the following text to the file. This example sets standard SQL as the default syntax for queries and for the mk command (used when you create a view)

[query]
--use_legacy_sql=false

[mk]
--use_legacy_sql=false

PARTITION TABLE

https://cloud.google.com/bigquery/docs/partitioned-tables

You can partition BigQuery tables by:

Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.

CREATE TABLE
  mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
  transaction_date
OPTIONS(
  partition_expiration_days=3,
  require_partition_filter=true
)

CREATE TABLE
  mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
  DATE_TRUNC(transaction_date, MONTH)
OPTIONS(
  partition_expiration_days=3,
  require_partition_filter=true
)

SELECT * FROM dataset.table
WHERE transaction_date >= '2016-01-01'

Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.

CREATE TABLE
  mydataset.newtable (transaction_id INT64)
PARTITION BY
  _PARTITIONDATE
OPTIONS(
  partition_expiration_days=3,
  require_partition_filter=true
)

-- The default partitioning type for ingestion-time partitioning is daily partitioning. To specify a different partitioning type, include the DATE_TRUNC function in the PARTITION BY clause.
CREATE TABLE
  mydataset.newtable (transaction_id INT64)
PARTITION BY
  DATE_TRUNC(_PARTITIONTIME, MONTH)
OPTIONS(
  partition_expiration_days=3,
  require_partition_filter=true
)

SELECT
  column
FROM
  dataset.table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Integer range: Tables are partitioned based on an integer column.

CREATE TABLE mydataset.newtable (customer_id INT64, date1 DATE)
PARTITION BY
  RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10))
OPTIONS(
  require_partition_filter=true
)

SELECT * FROM dataset.table
WHERE customer_id BETWEEN 30 AND 50

You cannot use legacy SQL to query partitioned tables or to write query results to partitioned tables.

Create a new partitioned table based on date

Create a new dataset ecommerce

 #standardSQL
 CREATE OR REPLACE TABLE ecommerce.partition_by_day
 PARTITION BY date_formatted
 OPTIONS(
   description="a table partitioned by date"
 ) AS
 SELECT DISTINCT
 PARSE_DATE("%Y%m%d", date) AS date_formatted,
 fullvisitorId
 FROM `data-to-insights.ecommerce.all_sessions_raw`

option - PARTITION BY

The two options available to partition are DATE and TIMESTAMP. The PARSE_DATE function is used on the date field (stored as a string) to get it into the proper DATE type for partitioning.

Creating an auto-expiring partitioned table

In the left menu, in Resources, click on Add Data and select Explore public datasets. Search for "GSOD NOAA" then select the dataset. Click on View Dataset.

#standardSQL
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter stations/days with no precipitation
   AND CAST(_TABLE_SUFFIX AS int64) >= 2018
 ORDER BY date DESC -- Where has it rained/snowed recently
 LIMIT 10

Note that the table wildcard * used in the FROM clause to limit the amount of tables referred to in the TABLE_SUFFIX filter.

Note that although a LIMIT 10 was added, this still does not reduce the total amount of data scanned (about 1.83 GB) since there are no partitions yet.

#standardSQL
 CREATE OR REPLACE TABLE ecommerce.days_with_rain
 PARTITION BY date
 OPTIONS (
   partition_expiration_days=60,
   description="weather stations with precipitation, partitioned by day"
 ) AS


 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter
   AND CAST(_TABLE_SUFFIX AS int64) >= 2018

SELECT end_station_name FROM `bigquery-public-data.london_bicycles.cycle_hire`;

GCP Project → bigquery-public-data Dataset → london_bicycles Table → cycle_hire

bq show bigquery-public-data:samples.shakespeare

bq to invoke the BigQuery command line tool
show is the action
then you're listing the name of the project:public dataset.table in BigQuery that you want to see.

bq query --use_legacy_sql=false \
'SELECT
   word,
   SUM(word_count) AS count
 FROM
   `bigquery-public-data`.samples.shakespeare
 WHERE
   word LIKE "%raisin%"
 GROUP BY
   word'

Require a partition filter in queries

When you create a partitioned table, you can require the use of predicate filters by enabling the Require partition filter option. When this option is applied, attempts to query the partitioned table without specifying a WHERE clause produce the following error:

Cannot query over table 'project_id.dataset.table' without a filter that can be used for partition elimination.

For ingestion-time partitioned tables, use either the _PARTITIONTIME or _PARTITIONDATE pseudo-column.

CREATE TABLE
  mydataset.newtable (transaction_id INT64)
PARTITION BY
  _PARTITIONDATE
  OPTIONS (
    partition_expiration_days = 3,
    require_partition_filter = TRUE);


-- update partition table to add require_partition_filter
ALTER TABLE mydataset.mypartitionedtable
  SET OPTIONS (
    require_partition_filter = true);

Example

bq ls                        # list any existing datasets in your project:
bq ls bigquery-public-data:  # list the datasets in that specific project, followed by a colon (:)
bq mk babynames              # create a new dataset named babynames in your project -> gcloud config list project
wget http://www.ssa.gov/OACT/babynames/names.zip
unzip names.zip
bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer
bq ls babynames
bq show babynames.names2010
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"

BigQuery public datasets are not displayed by default in the BigQuery web UI. To open the public datasets project, open https://console.cloud.google.com/bigquery?p=data-to-insights&page=ecommerce in a new browser window.

#standardSQL
SELECT COUNT(*) as num_duplicate_rows, * FROM
`data-to-insights.ecommerce.all_sessions_raw`
GROUP BY
fullVisitorId, channelGrouping, time, country, city, totalTransactionRevenue, transactions, timeOnSite, pageviews, sessionQualityDim, date, visitId, type, productRefundAmount, productQuantity, productPrice, productRevenue, productSKU, v2ProductName, v2ProductCategory, productVariant, currencyCode, itemQuantity, itemRevenue, transactionRevenue, transactionId, pageTitle, searchKeyword, pagePathLevel1, eCommerceAction_type, eCommerceAction_step, eCommerceAction_option
HAVING num_duplicate_rows > 1;


WITH unique_product_views_by_person AS (
-- find each unique product viewed by each visitor
SELECT
 fullVisitorId,
 (v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY fullVisitorId, v2ProductName )


-- aggregate the top viewed products and sort them
SELECT
  COUNT(*) AS unique_view_count,
  ProductName
FROM unique_product_views_by_person
GROUP BY ProductName
ORDER BY unique_view_count DESC
LIMIT 5

#standardSQL

#standardSQL

# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_20170801
#schema
(
  fullVisitorId STRING OPTIONS(description="Unique visitor ID"),
  channelGrouping STRING OPTIONS(description="Channel e.g. Direct, Organic, Referral...")
)
 OPTIONS(
   description="Raw data from analyst team into our dataset for 08/01/2017"
 ) AS
 SELECT fullVisitorId, city FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE date = '20170801'  #56,989 records
;


#standardSQL

# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_20170801
#schema
(
  fullVisitorId STRING NOT NULL OPTIONS(description="Unique visitor ID"),
  channelGrouping STRING NOT NULL OPTIONS(description="Channel e.g. Direct, Organic, Referral..."),
  totalTransactionRevenue INT64 OPTIONS(description="Revenue * 10^6 for the transaction")
)
 OPTIONS(
   description="Raw data from analyst team into our dataset for 08/01/2017"
 ) AS
 SELECT fullVisitorId, channelGrouping, totalTransactionRevenue FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE date = '20170801'  #56,989 records
;


#standardSQL

# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.revenue_transactions_20170801
#schema
(
  fullVisitorId STRING NOT NULL OPTIONS(description="Unique visitor ID"),
  visitId STRING NOT NULL OPTIONS(description="ID of the session, not unique across all users"),
  channelGrouping STRING NOT NULL OPTIONS(description="Channel e.g. Direct, Organic, Referral..."),
  totalTransactionRevenue FLOAT64 NOT NULL OPTIONS(description="Revenue for the transaction")
)
 OPTIONS(
   description="Revenue transactions for 08/01/2017"
 ) AS
 SELECT DISTINCT
  fullVisitorId,
  CAST(visitId AS STRING) AS visitId,
  channelGrouping,
  totalTransactionRevenue / 1000000 AS totalTransactionRevenue
 FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE date = '20170801'
      AND totalTransactionRevenue IS NOT NULL #XX transactions
;

Tip: It is often difficult to know whether or not you are SELECTing from a Table or a View by just looking at the name. A simple convention is to prefix the view name with vw_ or add a suffix like _vw or _view.

Views

Creating views

#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_latest_transactions
AS
SELECT DISTINCT
  date,
  fullVisitorId,
  CAST(visitId AS STRING) AS visitId,
  channelGrouping,
  totalTransactionRevenue / 1000000 AS totalTransactionRevenue
 FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE totalTransactionRevenue IS NOT NULL
 ORDER BY date DESC # latest transactions
 LIMIT 100
;

#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_large_transactions
OPTIONS(
  description="large transactions for review",
  labels=[('org_unit','loss_prevention')]
)
AS
SELECT DISTINCT
  date,
  fullVisitorId,
  visitId,
  channelGrouping,
  totalTransactionRevenue / 1000000 AS totalTransactionRevenue,
  currencyCode,
  STRING_AGG(DISTINCT v2ProductName ORDER BY v2ProductName LIMIT 10) AS products_ordered
 FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE
  (totalTransactionRevenue / 1000000) > 1000
  AND currencyCode = 'USD'

 GROUP BY 1,2,3,4,5,6
  ORDER BY date DESC # latest transactions

  LIMIT 10

#standardSQL
SELECT
  SESSION_USER() AS viewer_ldap;   # To view your own session login information

#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_large_transactions
OPTIONS(
  description="large transactions for review",
  labels=[('org_unit','loss_prevention')],
  expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
)
AS
#standardSQL
SELECT DISTINCT
  SESSION_USER() AS viewer_ldap,
  REGEXP_EXTRACT(SESSION_USER(), r'@(.+)') AS domain,
  date,
  fullVisitorId,
  visitId,
  channelGrouping,
  totalTransactionRevenue / 1000000 AS totalTransactionRevenue,
  currencyCode,
  STRING_AGG(DISTINCT v2ProductName ORDER BY v2ProductName LIMIT 10) AS products_ordered
 FROM `data-to-insights.ecommerce.all_sessions_raw`
 WHERE
  (totalTransactionRevenue / 1000000) > 1000
  AND currencyCode = 'USD'   
  AND REGEXP_EXTRACT(SESSION_USER(), r'@(.+)') IN ('qwiklabs.net')  #  remove all domains from ser name
 GROUP BY 1,2,3,4,5,6,7,8
  ORDER BY date DESC # latest transactions
  LIMIT 10;

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#table_option_list

STRING_AGG

STRING_AGG - Returns a value (either STRING or BYTES) obtained by concatenating non-null values. If a delimiter is specified, concatenated values are separated by that delimiter; otherwise, a comma is used as a delimiter.

STRING_AGG([DISTINCT] expression [, delimiter]  [ORDER BY key [{ASC|DESC}] [, ... ]]  [LIMIT n])
[OVER (...)]

Example

SELECT STRING_AGG(fruit) AS string_agg
FROM UNNEST(["apple", NULL, "pear", "banana", "pear"]) AS fruit;

# apple,pear,banana,pear

SELECT STRING_AGG(fruit, " & ") AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;

# apple & pear & banana & pear

SELECT STRING_AGG(DISTINCT fruit, " & ") AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;

SELECT STRING_AGG(fruit, " & " ORDER BY LENGTH(fruit)) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;

SELECT STRING_AGG(fruit, " & " LIMIT 2) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;

SELECT STRING_AGG(DISTINCT fruit, " & " ORDER BY fruit DESC LIMIT 2) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;

SELECT
  fruit,
  STRING_AGG(fruit, " & ") OVER (ORDER BY LENGTH(fruit)) AS string_agg
FROM UNNEST(["apple", NULL, "pear", "banana", "pear"]) AS fruit;

STRUCT

 CREATE TABLE IF NOT EXISTS mydataset.newtable (x INT64, y STRUCT<a ARRAY<STRING>, b BOOL>)
 OPTIONS(
   expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
   description="a table that expires in 2025",
   labels=[("org_unit", "development")]
 )
 
CREATE OR REPLACE TABLE mydataset.newtable (x INT64, y STRUCT<a ARRAY<STRING>, b BOOL>)
 OPTIONS(
   expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
   description="a table that expires in 2025",
   labels=[("org_unit", "development")]
 )


CREATE TABLE mydataset.newtable
 (
   x INT64 OPTIONS(description="An optional INTEGER field"),
   y STRUCT<
     a ARRAY<STRING> OPTIONS(description="A repeated STRING field"),
     b BOOL
   >
 )
 PARTITION BY _PARTITIONDATE
 OPTIONS(
   expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
   partition_expiration_days=1,
   description="a table that expires in 2025, with each partition living for 24 hours",
   labels=[("org_unit", "development")]
 )
 

 CREATE TABLE my_dataset.new_table (
   x INT64 NOT NULL,
   y STRUCT<
     a ARRAY<STRING>,
     b BOOL NOT NULL,
     c FLOAT64
   > NOT NULL,
   z STRING
 )

#### Create clustered table
 CREATE TABLE mydataset.myclusteredtable
 (
   timestamp TIMESTAMP,
   customer_id STRING,
   transaction_amount NUMERIC
 )
 PARTITION BY DATE(timestamp)
 CLUSTER BY customer_id
 OPTIONS (
   partition_expiration_days=3,
   description="a table clustered by customer_id"
 )

SAFE_DIVIDE

#standardSQL
SELECT
  *,
  SAFE_DIVIDE(orderedQuantity,stockLevel) AS ratio
FROM
  ecommerce.products
WHERE
# include products that have been ordered and
# are 80% through their inventory
orderedQuantity > 0
AND SAFE_DIVIDE(orderedQuantity,stockLevel) >= .8
ORDER BY
  restockingLeadTime DESC

Arrays

Arrays - is an ordered list consisting of zero or more values of the same data type. https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

BigQuery natively supports arrays
Array values must share a data type
Arrays are called REPEATED fields in BigQuery
finding the number of elements with ARRAY_LENGTH()
deduplicating elements with ARRAY_AGG(DISTINCT )
ordering elements with ARRAY_AGG( ORDER BY )
limiting ARRAY_AGG( LIMIT 5)

Generating

SELECT ["apple", "pear", "orange"] as fruit;
SELECT ARRAY<FLOAT64>[1, 2, 3] as floats;
SELECT GENERATE_ARRAY(11, 33, 2) AS odds;       -- [11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33]
SELECT GENERATE_ARRAY(21, 14, -1) AS countdown; -- [21, 20, 19, 18, 17, 16, 15, 14]
SELECT
  GENERATE_DATE_ARRAY('2017-11-21', '2017-12-31', INTERVAL 1 WEEK)
    AS date_array;
SELECT ['raspberry', 'blackberry', 'strawberry', 'cherry', 1234567] AS fruit_array -- Error Array elements of types {INT64, STRING} do not have a common supertype

Accessing

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
       some_numbers[OFFSET(1)] AS offset_1,
       some_numbers[ORDINAL(1)] AS ordinal_1
FROM sequences;

Finding Lengths

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
       ARRAY_LENGTH(some_numbers) AS len
FROM sequences;

Flattening arrays - convert to rows

You need to UNNEST() arrays to bring the array elements back into rows
UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table)

SELECT DISTINCT
  visitId,
  h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10

                                                   ['homepage',
  ['homepage','product page','checkout']    ->      'product page',
                             '                      checkout']


SELECT *
FROM UNNEST(['foo', 'bar', 'baz', 'qux', 'corge', 'garply', 'waldo', 'fred'])
  AS element
WITH OFFSET AS offset
ORDER BY offset;

WITH sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id, flattened_numbers
FROM sequences
CROSS JOIN UNNEST(sequences.some_numbers) AS flattened_numbers;

Querying Nested Arrays

WITH races AS (
  SELECT "800M" AS race,
    [STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
     STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
     STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
     STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
     STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
     STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
     STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
     STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
       AS participants)
SELECT
  race,
  participant
FROM races r
CROSS JOIN UNNEST(r.participants) as participant;

Creating Arrays From Subqueries

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
  UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
  UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
  ARRAY(SELECT x * 2
        FROM UNNEST(some_numbers) AS x) AS doubled
FROM sequences;

Filtering Arrays

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
  ARRAY(SELECT x * 2
        FROM UNNEST(some_numbers) AS x
        WHERE x < 5) AS doubled_less_than_five
FROM sequences;

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers)
SELECT ARRAY(SELECT DISTINCT x
             FROM UNNEST(some_numbers) AS x) AS unique_numbers
FROM sequences;

Scanning Arrays

SELECT 2 IN UNNEST([0, 1, 1, 2, 3, 5]) AS contains_value;   # returns true if the array contains the number 2

# returns the id value for the rows where the array column contains the value 2.
WITH sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id AS matching_rows
FROM sequences
WHERE 2 IN UNNEST(sequences.some_numbers)
ORDER BY matching_rows;

Arrays and Aggregation

WITH fruits AS
  (SELECT "apple" AS fruit
   UNION ALL SELECT "pear" AS fruit
   UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit) AS fruit_basket
FROM fruits;

WITH fruits AS
  (SELECT "apple" AS fruit
   UNION ALL SELECT "pear" AS fruit
   UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit ORDER BY fruit) AS fruit_basket
FROM fruits;

Converting Arrays to Strings

WITH greetings AS
  (SELECT ["Hello", "World"] AS greeting)
SELECT ARRAY_TO_STRING(greeting, " ") AS greetings
FROM greetings;

Combining Arrays

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5, 6]) as count_to_six;  # return [1, 2, 3, 4, 5, 6]

Zipping Arrays

You can zip arrays with UNNEST and WITH OFFSET. In this example, each value pair is stored as a STRUCT in an array.

WITH combinations AS (
  SELECT
    ['a', 'b'] AS letters,
    [1, 2, 3] AS numbers
)
SELECT ARRAY_AGG(
  STRUCT(letter, numbers[OFFSET(letters_offset)] AS number)
) AS pairs
FROM combinations, UNNEST(letters) AS letter WITH OFFSET AS letters_offset;

Building Arrays of Arrays

BigQuery does not support building arrays of arrays directly. Instead, you must create an array of structs, with each struct containing a field of type ARRAY.

WITH points AS
  (SELECT [1, 5] as point
   UNION ALL SELECT [2, 8] as point
   UNION ALL SELECT [3, 7] as point
   UNION ALL SELECT [4, 1] as point
   UNION ALL SELECT [5, 7] as point)
SELECT ARRAY(
  SELECT STRUCT(point)
  FROM points)
  AS coordinates;

+-------------------+
| coordinates       |
+-------------------+
| [{point: [1,5]},  |
|  {point: [2,8]},  |
|  {point: [5,7]},  |
|  {point: [3,7]},  |
|  {point: [4,1]}]  |
+--------------------+

STRUCT STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table. A STRUCT can have:

one or many fields in it
the same or different data types for each field
it's own alias
as an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.

STRUCT as essentially a collection of other fields, kind of like a table. From a reporting standpoint, you can have many nested STRUCTs within a single table, which is conceptually like having many other tables pre-joined,

SELECT STRUCT("Rudisha" as name, 23.4 as split) as runner
SELECT STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner

In table colum have RECORD type when contain struct.

gs://data-insights-course/labs/optimizing-for-performance/race_results.json

#standardSQL
SELECT race, participants.name   -- dostanie sie do elementu w strukturze wymaga CROSS JOIN z ta strktura !!! 
FROM racing.race_results
CROSS JOIN
participants  # this is the STRUCT (it is like a table within a table)

#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN
race_results.participants # full STRUCT name

#standardSQL                                     -- najprosciej
SELECT race, participants.name
FROM racing.race_results AS r, r.participants


#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p

#standardSQL
SELECT
  p.name,
  SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;

#standardSQL
SELECT
  p.name,
  split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;

SELECT modifiers

SELECT DISTINCT - nie zwraca column typu STRUCT,ARRAY
SELECT * EXCEPT - specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output.

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;

Note: SELECT * EXCEPT does not exclude columns that do not have names

SELECT * REPLACE

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;

SELECT ALL
SELECT AS STRUCT

SELECT
  ARRAY(SELECT AS STRUCT t.f1, t.f2 WHERE t.f3=true)
FROM
  Table t

SELECT AS VALUE

SELECT AS VALUE STRUCT(1 a, 2 b) xyz FROM Table;

FROM

FOR SYSTEM_TIME AS OF - references the historical versions of the table definition and rows that were current at timestamp_expression.

SELECT *
FROM t
  FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR); #  returns a historical version of the table from one hour ago.

SELECT *
FROM t
  FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';

Examle:

DECLARE before_replace_timestamp TIMESTAMP;

-- Create table books.
CREATE TABLE books AS
SELECT 'Hamlet' title, 'William Shakespeare' author;

-- Get current timestamp before table replacement.
SET before_replace_timestamp = CURRENT_TIMESTAMP();

-- Replace table with different schema(title and release_date).
CREATE OR REPLACE TABLE books AS
SELECT 'Hamlet' title, DATE '1603-01-01' release_date;

SELECT * FROM books;

-- This query returns Hamlet, William Shakespeare as result.
SELECT * FROM books FOR SYSTEM_TIME AS OF before_replace_timestamp;

JOIN

ON and USING Comparision

FROM A JOIN B ON A.x = B.x
FROM A JOIN B USING (x)

Table A   Table B   Result ON     Result USING
+---+     +---+     +-------+     +---+
| x |  *  | x |  =  | x | x |     | x |
+---+     +---+     +-------+     +---+
| 1 |     | 2 |     | 2 | 2 |     | 2 |
| 2 |     | 3 |     | 3 | 3 |     | 3 |
| 3 |     | 4 |     +-------+     +---+
+---+     +---+

SQL UDF

CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y);
WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val, addFourAndDivide(val, 2) AS result



CREATE TEMP FUNCTION lastArrayElement(arr ANY TYPE) AS (
  arr[ORDINAL(ARRAY_LENGTH(arr))]
);
SELECT
  names[OFFSET(0)] AS first_name,
  lastArrayElement(names) AS last_name
FROM (
  SELECT ['Fred', 'McFeely', 'Rogers'] AS names UNION ALL
  SELECT ['Marie', 'Skłodowska', 'Curie']
);

JavaScript UDF

CREATE TEMP FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  return x*y;
""";
WITH numbers AS
  (SELECT 1 AS x, 5 as y
  UNION ALL
  SELECT 2 AS x, 10 as y
  UNION ALL
  SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;

Wildcard tables - musb be quotet by backtick (`) characters.Quotation marks do not work !!!

SELECT
FROM
  `<project-id>.<dataset-id>.<table-prefix>*`
WHERE
  bool_expression


SELECT
  max
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  max != 9999.9 # code for missing data
  AND _TABLE_SUFFIX = '1929'
ORDER BY
  max DESC


SELECT
  ROUND(AVG(TemperatureF),1) AS AVG_TEMP_F
FROM
  `mydataset.sea_weather_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160501' AND '20160509'


SELECT
  max,
  ROUND((max-32)*5/9,1) celsius,
  year
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  max != 9999.9 # code for missing data
  AND max > 100 # to improve ORDER BY performance
  AND REGEXP_CONTAINS(_TABLE_SUFFIX, r"0$")
ORDER BY
  max DESC

ML model in BigQuery.

BQML supports the two listed above.

There are the three model types to choose from:

Model	Model Type	Label Data type	Example
Forecasting	linear_reg	Numeric value (typically an integer or floating point)	Forecast sales figures for next year given historical sales data.
Classification	logistic_reg	0 or 1 for binary classification	Classify an email as spam or not spam given the context.
k-Means Clustering

Forecasting numeric values like next month's sales with Linear Regression (linear_reg).
Binary or Multiclass Classification like spam or not spam email by using Logistic Regression (logistic_reg).
k-Means Clustering for when you want unsupervised learning for exploration (kmeans).

CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS

#standardSQL
SELECT
  * EXCEPT(fullVisitorId)
FROM

  # features
  (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
  FROM
    `data-to-insights.ecommerce.web_analytics`
  WHERE
    totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
  JOIN
  (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM
      `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid)
  USING (fullVisitorId)
;

n BQML, roc_auc is simply a queryable field when evaluating your trained ML model.

Now that training is complete, run this query to evaluate how well the model performs using ML.EVALUATE:

SELECT
  roc_auc,
  CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'decent'
    WHEN roc_auc > .6 THEN 'not great'
  ELSE 'poor' END AS model_quality
FROM
  ML.EVALUATE(MODEL ecommerce.classification_model,  (

SELECT
  * EXCEPT(fullVisitorId)
FROM

  # features
  (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
  FROM
    `data-to-insights.ecommerce.web_analytics`
  WHERE
    totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
  JOIN
  (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM
      `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid)
  USING (fullVisitorId)

));

You should see the following result:

Row    roc_auc  model_quality

1      0.724588 decent

After evaluating your model you get a roc_auc of 0.72, which shows the model has decent, but not great, predictive power. Since the goal is to get the area under the curve as close to 1.0 as possible, there is room for improvement.

CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
  (model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS

WITH all_visitor_stats AS (
SELECT
  fullvisitorid,
  IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid
)

# add in new features
SELECT * EXCEPT(unique_session_id) FROM (

  SELECT
      CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

      # labels
      will_buy_on_return_visit,

      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

      # behavior on the site
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      IFNULL(totals.pageviews, 0) AS pageviews,

      # where the visitor came from
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,

      # mobile or desktop
      device.deviceCategory,

      # geographic
      IFNULL(geoNetwork.country, "") AS country

  FROM `data-to-insights.ecommerce.web_analytics`,
     UNNEST(hits) AS h

    JOIN all_visitor_stats USING(fullvisitorid)

  WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430' # train 9 months

  GROUP BY
  unique_session_id,
  will_buy_on_return_visit,
  bounces,
  time_on_site,
  totals.pageviews,
  trafficSource.source,
  trafficSource.medium,
  channelGrouping,
  device.deviceCategory,
  country
);

Evaluate this new model to see if there is better predictive power:

#standardSQL
SELECT
  roc_auc,
  CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'decent'
    WHEN roc_auc > .6 THEN 'not great'
  ELSE 'poor' END AS model_quality
FROM
  ML.EVALUATE(MODEL ecommerce.classification_model_2,  (

WITH all_visitor_stats AS (
SELECT
  fullvisitorid,
  IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid
)

# add in new features
SELECT * EXCEPT(unique_session_id) FROM (

  SELECT
      CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

      # labels
      will_buy_on_return_visit,

      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

      # behavior on the site
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      totals.pageviews,

      # where the visitor came from
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,

      # mobile or desktop
      device.deviceCategory,

      # geographic
      IFNULL(geoNetwork.country, "") AS country

  FROM `data-to-insights.ecommerce.web_analytics`,
     UNNEST(hits) AS h

    JOIN all_visitor_stats USING(fullvisitorid)

  WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630' # eval 2 months

  GROUP BY
  unique_session_id,
  will_buy_on_return_visit,
  bounces,
  time_on_site,
  totals.pageviews,
  trafficSource.source,
  trafficSource.medium,
  channelGrouping,
  device.deviceCategory,
  country
)
));

Predict which new visitors will come back and purchase

SELECT
*
FROM
  ml.PREDICT(MODEL `ecommerce.classification_model_2`,
   (

WITH all_visitor_stats AS (
SELECT
  fullvisitorid,
  IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid
)

  SELECT
      CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,

      # labels
      will_buy_on_return_visit,

      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

      # behavior on the site
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      totals.pageviews,

      # where the visitor came from
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,

      # mobile or desktop
      device.deviceCategory,

      # geographic
      IFNULL(geoNetwork.country, "") AS country

  FROM `data-to-insights.ecommerce.web_analytics`,
     UNNEST(hits) AS h

    JOIN all_visitor_stats USING(fullvisitorid)

  WHERE
    # only predict for new visits
    totals.newVisits = 1
    AND date BETWEEN '20170701' AND '20170801' # test 1 month

  GROUP BY
  unique_session_id,
  will_buy_on_return_visit,
  bounces,
  time_on_site,
  totals.pageviews,
  trafficSource.source,
  trafficSource.medium,
  channelGrouping,
  device.deviceCategory,
  country
)

)

ORDER BY
  predicted_will_buy_on_return_visit DESC;

Scheduling queries

Scheduled queries must be written in standard SQL which can include Data Definition Language (DDL) and Data Manipulation Language (DML) statements.

Required permissions: Either bigquery.jobs.create or bigquery.transfers.update permissions to create the transfer. bigquery.datasets.update permissions on the target dataset. The bigquery.jobUser predefined IAM role includes bigquery.jobs.create permissions.

run_time - represents the intended time of execution. For example, if the scheduled query is set to "every 24 hours",
run_date - parameter in the following format: %Y%m%d; for example, 20180101. This format is compatible with ingestion-time partitioned tables.

Templating system

{run_date}
{run_time[+\-offset]|"time_format"}

Examle
mytable_{run_time+25h|"%Y%m%d"}
mytable_{run_time+90m|"%Y%m%d;%H"}
{run_time+97s|"%Y%m%d"}_mytable_{run_time|"%H%M%s"}



SELECT @run_time AS time,
  title,
  author,
  text
FROM `bigquery-public-data.hacker_news.stories`
LIMIT
  1000

bq

Option 1: Use the bq query command.

bq query \
--display_name=name \
--destination_table=table \
--schedule=interval  

# Example interval
--schedule='every 24 hours'
--schedule='every 3 hours'
  
# Example:
bq query \
    --use_legacy_sql=false \
    --destination_table=mydataset.mytable \
    --display_name='My Scheduled Query' \
    --schedule='every 24 hours' \
    --replace=true \
    'SELECT
      1
    FROM
      mydataset.test'

Option 2: Use the bq mk command. Scheduled queries are a kind of transfer. To schedule a query, you can use the BigQuery Data Transfer Service command-line interface to make a transfer configuration.

bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source

For example, the following command creates a scheduled query transfer configuration named My Scheduled Query using the simple query SELECT 1 from mydataset.test. The destination table mytable is truncated for every write, and the target dataset is mydataset. The scheduled query is created in the default project, and authenticates with a service account:

bq mk \
--transfer_config \
--target_dataset=mydataset \
--display_name='My Scheduled Query' \
--params='{"query":"SELECT 1 from mydataset.test","destination_table_name_template":"mytable","write_disposition":"WRITE_TRUNCATE"}' \
--data_source=scheduled_query \
--service_account_name=abcdef-test-sa@abcdef-test.iam.gserviceaccount.com

Viewing the status of a scheduled query

bq ls \
--transfer_config \
--transfer_location=us \

bq show \
--transfer_config \
projects/862514376110/locations/us/transferConfigs/5dd12f26-0000-262f-bc38-089e0820fe38 \

Setting up a manual run on historical dates

In exampe when scheduled query not proper executed in previous days

bq mk \
--transfer_run \
--start_time='start_time' \
--end_time='end_time' \
resource_name

  bq mk \
  --transfer_run \
  --start_time 2017-05-25T00:00:00Z \
  --end_time 2017-05-25T00:00:00Z \
  projects/mypro

Approximate Aggregate Functions in Standard SQL

Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions. This makes approximate aggregation appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

APPROX_COUNT_DISTINCT

SELECT APPROX_COUNT_DISTINCT(x) as approx_distinct
FROM UNNEST([0, 1, 1, 2, 3, 5]) as x;

APPROX_QUANTILES Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.

APPROX_QUANTILES([DISTINCT] expression, number [{IGNORE|RESPECT} NULLS])


SELECT APPROX_QUANTILES(x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

SELECT APPROX_QUANTILES(x, 100)[OFFSET(90)] AS percentile_90
FROM UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS x;

SELECT APPROX_QUANTILES(DISTINCT x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;


 SELECT 
  MIN(start_station_name) AS start_station_name,
  MIN(end_station_name) AS end_station_name,
  APPROX_QUANTILES(tripduration, 10)[OFFSET (5)] AS typical_duration,
  COUNT(tripduration) AS num_trips
FROM
  `bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
  start_station_id != end_station_id
GROUP BY
  start_station_id,
  end_station_id
ORDER BY
  num_trips DESC
LIMIT
  10

APPROX_TOP_COUNT APPROX_TOP_COUNT(expression, number)

Returns the approximate top elements of expression.

SELECT APPROX_TOP_COUNT(x, 2) as approx_top_count
FROM UNNEST(["apple", "apple", "pear", "pear", "pear", "banana"]) as x;

+-------------------------+
| approx_top_count        |
+-------------------------+
| [{pear, 3}, {apple, 2}] |
+-------------------------+

NULL handling

APPROX_TOP_COUNT does not ignore NULLs in the input. For example:

SELECT APPROX_TOP_COUNT(x, 2) as approx_top_count
FROM UNNEST([NULL, "pear", "pear", "pear", "apple", NULL]) as x;

+------------------------+
| approx_top_count       |
+------------------------+
| [{pear, 3}, {NULL, 2}] |
+------------------------+

APPROX_TOP_SUM APPROX_TOP_SUM(expression, weight, number)

Description

Returns the approximate top elements of expression, based on the sum of an assigned weight. The number parameter specifies the number of elements returned.

If the weight input is negative or NaN, this function returns an error.

SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([
  STRUCT("apple" AS x, 3 AS weight),
  ("pear", 2),
  ("apple", 0),
  ("banana", 5),
  ("pear", 4)
]);

+--------------------------+
| approx_top_sum           |
+--------------------------+
| [{pear, 6}, {banana, 5}] |
+--------------------------+



SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, NULL AS weight), ("pear", 0), ("pear", NULL)]);

+----------------------------+
| approx_top_sum             |
+----------------------------+
| [{pear, 0}, {apple, NULL}] |
+----------------------------+

SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, 0 AS weight), (NULL, 2)]);

+-------------------------+
| approx_top_sum          |
+-------------------------+
| [{NULL, 2}, {apple, 0}] |
+-------------------------+

SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, 0 AS weight), (NULL, NULL)]);

+----------------------------+
| approx_top_sum             |
+----------------------------+
| [{apple, 0}, {NULL, NULL}] |
+----------------------------+

GCS BigQuery - ghdrako/doc_snipets GitHub Wiki

External sources

Role - security - access control

Storage model - cost

Cloud Audit Log

Authentication

Load data to BigQuery - Limits directly - use ETL clod storage and piplines

External data sources

External tables

Federated queries

Datastream

Expiration time

Long-term storage

Authorized Views

Query dialect

standard SQL vs legacy SQL

PARTITION TABLE

Create a new partitioned table based on date

Creating an auto-expiring partitioned table

Require a partition filter in queries

Example

Views

Creating views

STRING_AGG

STRUCT

SAFE_DIVIDE

Arrays

SELECT modifiers

FROM

JOIN

SQL UDF

JavaScript UDF

ML model in BigQuery.

Scheduling queries

Templating system

bq

Viewing the status of a scheduled query

Setting up a manual run on historical dates

Approximate Aggregate Functions in Standard SQL

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️