GCS BigQuery - ghdrako/doc_snipets GitHub Wiki
Data in a BigQuery table is encrypted using a data encryption key. Then, those data encryption keys are themselves encrypted with key encryption keys. This is known as envelope encryption and it provides additional security. BigQuery also allows you to provide your own encryption keys. These are called Customer Managed Encryption Keys. Inside BigQuery, you can monitor your team's BigQuery usage and running queries, and proactively limit access to data at a row and a column level.
BigQuery also provides support for customer managed encryption keys (CMEKs), which enables you to encrypt tables using specific encryption keys.
It's both a Fast SQL Query Engine, and also a fully managed storage layer for loading and storing your datasets. The two services are connected by Google's high-speed internal network. Recall that this super-fast network enables us to separate, compute and storage. BigQuery storage service. The storage service automatically manages the data that you ingest into the platform. Data is contained within a project, in what are called datasets, which could have zero to many tables or views. The tables are stored as highly-compressed columns, each column of that table highly compressed and Google's internal Colossus file system which provides durability and global availability. The storage service can do both bulk data ingestion and streaming data ingestion. So it'll work with huge amounts of data and also real-time data streams. Native BigQuery storage. The storage service and the Query service work together to internally organize the data to make your queries and run efficiently on terabytes and petabytes. They even optimize your SQL statements syntax, whenever possible
if you hold down at least on this Mac here, it's going to be the command key, the Windows key, it will highlight all the datasets in your query. Why is it useful? Because as soon as it's highlighted you can click on it. So if you have a bunch of different datasets that you just want to explore, or you inherited a query from somebody else, you can quickly get to the schema, the details in the preview of that particular table.
you can use BigQuery to directly query database data in Cloud SQL, that is managed relational databases like PostgreSQL, MySQL, and SQL Server. You can also use BigQuery to directly query files and Cloud Storage as long as those files are in formats like CSV Apache.
You can use SQL to setup scheduled queries by using the @run_time parameter, or the query scheduler and the BigQuery UI.
Set up very granular controls over your columns and rows of data in BigQuery using the new data catalog service and some of the advanced features in BigQuery, such as authorized views.
stream records into BigQuery via the API. Note that there are a few coder restrictions that you should be aware of. The max row size for a streaming insert is one megabyte and the maximum throughput is 100,000 records per second per project. If you need higher throughput, say in the order of millions of records per second, for use cases you can consider it like application logging or real-time events tracking. Consider using Cloud Bigtable as a data sync instead. Before you start streaming thousands of records into BigQuery and the API, consider the other options you could have for your streaming solution. If you have data that needs to be transformed or aggregated mid-flight into table and row format, or joined against other data sources as side inputs midstream, or if you want to take just a window or a segment of that data, you should really consider using Cloud Dataflow for your streaming data pipeline
Access Control is through Cloud IAM (organization->Project->Datasets) Access |Control is to Datasets not tables or Columns A user has to run job on query datasets
- Admin activity
- System events
- Access data
- User account
- Services account
BigQuery Datasets can be regional or multi-regional
There is a limit to the size of the data that can be loaded directly into BigQuery. This is because your network might affect that bottleneck. Rather than load the data directly into BigQuery, it can be much more convenient to first stage it and load it into Google Cloud Storage is that data lake, and then load Google Cloud Storage through a pipeline into BigQuery as your data warehouse. This is because Google Cloud Storage supports a multithreaded resumable loads of data. If you're using gsutil, you can literally just provide a -m option there. Loading data from Cloud Storage will also be much faster because of the high throughput it offers. A feature that has recently been released is the ability for BigQuery to query data files that live in Google Cloud Storage, directly query them without having to first load those files into BigQuery zone native storage. This is called a federated query or creating an external data source connection. A super popular use case is if you have your data files in a variety of formats like CSV, Avro, newly supported you can do Parquet or Apache ORC, and then query them directly into BigQuery.
An external data source is a data source that you can query directly from BigQuery.4 BigQuery supports the following external data sources:
An external table is a table that acts like a standard BigQuery table. The table metadata, including the table schema, is stored in BigQuery storage, but the data itself resides in the external source.:
A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.
You can use federated queries with the following external databases:
Best practice: Configure the default table expiration for your datasets, configure the expiration time for your tables, and configure the partition expiration for partitioned tables.
You can set a default table expiration time at the dataset level, or you can set a table's expiration time when the table is created. If you set the expiration when the table is created, the dataset's default table expiration is ignored.
bq update --default_table_expiration integer project_id:dataset
bq update --default_table_expiration 7200 mydataset
bq update --default_table_expiration 7200 myotherproject:mydataset
ALTER SCHEMA mydataset
SET OPTIONS(
default_table_expiration_days=3.75
)
bq update --default_partition_expiration integer project_id:dataset
bq update --default_partition_expiration 93600 mydataset
bq update --default_partition_expiration 93600 myotherproject:mydatase
ALTER SCHEMA mydataset
SET OPTIONS(
default_partition_expiration_days=3.75
)
If you have a table that is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn't been modified in the last 90 days, the data in that partition is considered long term storage and is charged at the discounted price.
BigQuery organize data into containers called datasets. These datasets function like top-level folders taht organize and control access to underling tables. IAM can restrict acces to tables and datasets Restrica access to part of the tables - by creating authorized views Steps:
- Create dataset to create a views Authorized Views required source data (table) sit in separate dataset from the view
- Create view with only needed data
- assign access control to the project. In order to query that view ther is need perition to run query - asign BigQuery user role
- assign acces control to datasets with the view - need BigQuery data viewer role for datasets with view
- authorize the view to access the source dataset
Set the query dialect when you query BigQuery data:
- standard SQL (In the Cloud Console and the client libraries, standard SQL is the default.)
- legacy SQL dialect (In the classic BigQuery web UI, the bq command-line tool, and the REST API, legacy SQL is the default.)
The following prefixes let you set the SQL dialect:
Prefix Description
-
#legacySQL
Runs the query using legacy SQL -
#standardSQL
Runs the query using standard SQL
Example:
#legacySQL
SELECT
weight_pounds, state, year, gestation_weeks
FROM
[bigquery-public-data:samples.natality]
ORDER BY weight_pounds DESC
LIMIT 10;
bq query \
--use_legacy_sql=false \
'SELECT
word
FROM
`bigquery-public-data.samples.shakespeare`'
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("name={}, count={}".format(row[0], row["total_people"]))
Previously, BigQuery executed queries using a non-standard SQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for standard SQL, and renamed BigQuery SQL to legacy SQL. Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.
Setting standard SQL as the default for the command-line tool
-
.bigqueryrc
(set --use_legacy_sql=false in .bigqueryrc) - Add the following text to the file. This example sets standard SQL as the default syntax for queries and for the mk command (used when you create a view)
[query]
--use_legacy_sql=false
[mk]
--use_legacy_sql=false
https://cloud.google.com/bigquery/docs/partitioned-tables
You can partition BigQuery tables by:
- Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=3,
require_partition_filter=true
)
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
DATE_TRUNC(transaction_date, MONTH)
OPTIONS(
partition_expiration_days=3,
require_partition_filter=true
)
SELECT * FROM dataset.table
WHERE transaction_date >= '2016-01-01'
- Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
CREATE TABLE
mydataset.newtable (transaction_id INT64)
PARTITION BY
_PARTITIONDATE
OPTIONS(
partition_expiration_days=3,
require_partition_filter=true
)
-- The default partitioning type for ingestion-time partitioning is daily partitioning. To specify a different partitioning type, include the DATE_TRUNC function in the PARTITION BY clause.
CREATE TABLE
mydataset.newtable (transaction_id INT64)
PARTITION BY
DATE_TRUNC(_PARTITIONTIME, MONTH)
OPTIONS(
partition_expiration_days=3,
require_partition_filter=true
)
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
- Integer range: Tables are partitioned based on an integer column.
CREATE TABLE mydataset.newtable (customer_id INT64, date1 DATE)
PARTITION BY
RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10))
OPTIONS(
require_partition_filter=true
)
SELECT * FROM dataset.table
WHERE customer_id BETWEEN 30 AND 50
You cannot use legacy SQL to query partitioned tables or to write query results to partitioned tables.
- Create a new dataset ecommerce
#standardSQL
CREATE OR REPLACE TABLE ecommerce.partition_by_day
PARTITION BY date_formatted
OPTIONS(
description="a table partitioned by date"
) AS
SELECT DISTINCT
PARSE_DATE("%Y%m%d", date) AS date_formatted,
fullvisitorId
FROM `data-to-insights.ecommerce.all_sessions_raw`
option - PARTITION BY
The two options available to partition are DATE and TIMESTAMP. The PARSE_DATE function is used on the date field (stored as a string) to get it into the proper DATE type for partitioning.
- In the left menu, in Resources, click on Add Data and select Explore public datasets. Search for "GSOD NOAA" then select the dataset. Click on View Dataset.
#standardSQL
SELECT
DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
(SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
WHERE stations.usaf = stn) AS station_name, -- Stations may have multiple names
prcp
FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
WHERE prcp < 99.9 -- Filter unknown values
AND prcp > 0 -- Filter stations/days with no precipitation
AND CAST(_TABLE_SUFFIX AS int64) >= 2018
ORDER BY date DESC -- Where has it rained/snowed recently
LIMIT 10
Note that the table wildcard * used in the FROM clause to limit the amount of tables referred to in the TABLE_SUFFIX filter.
Note that although a LIMIT 10 was added, this still does not reduce the total amount of data scanned (about 1.83 GB) since there are no partitions yet.
#standardSQL
CREATE OR REPLACE TABLE ecommerce.days_with_rain
PARTITION BY date
OPTIONS (
partition_expiration_days=60,
description="weather stations with precipitation, partitioned by day"
) AS
SELECT
DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
(SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
WHERE stations.usaf = stn) AS station_name, -- Stations may have multiple names
prcp
FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
WHERE prcp < 99.9 -- Filter unknown values
AND prcp > 0 -- Filter
AND CAST(_TABLE_SUFFIX AS int64) >= 2018
SELECT end_station_name FROM `bigquery-public-data.london_bicycles.cycle_hire`;
GCP Project → bigquery-public-data Dataset → london_bicycles Table → cycle_hire
bq show bigquery-public-data:samples.shakespeare
bq to invoke the BigQuery command line tool
show is the action
then you're listing the name of the project:public dataset.table in BigQuery that you want to see.
bq query --use_legacy_sql=false \
'SELECT
word,
SUM(word_count) AS count
FROM
`bigquery-public-data`.samples.shakespeare
WHERE
word LIKE "%raisin%"
GROUP BY
word'
When you create a partitioned table, you can require the use of predicate filters by enabling the Require partition filter option. When this option is applied, attempts to query the partitioned table without specifying a WHERE clause produce the following error:
Cannot query over table 'project_id.dataset.table' without a filter that can be used for partition elimination.
For ingestion-time partitioned tables, use either the _PARTITIONTIME or _PARTITIONDATE pseudo-column.
CREATE TABLE
mydataset.newtable (transaction_id INT64)
PARTITION BY
_PARTITIONDATE
OPTIONS (
partition_expiration_days = 3,
require_partition_filter = TRUE);
-- update partition table to add require_partition_filter
ALTER TABLE mydataset.mypartitionedtable
SET OPTIONS (
require_partition_filter = true);
bq ls # list any existing datasets in your project:
bq ls bigquery-public-data: # list the datasets in that specific project, followed by a colon (:)
bq mk babynames # create a new dataset named babynames in your project -> gcloud config list project
wget http://www.ssa.gov/OACT/babynames/names.zip
unzip names.zip
bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer
bq ls babynames
bq show babynames.names2010
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"
BigQuery public datasets are not displayed by default in the BigQuery web UI. To open the public datasets project, open https://console.cloud.google.com/bigquery?p=data-to-insights&page=ecommerce in a new browser window.
#standardSQL
SELECT COUNT(*) as num_duplicate_rows, * FROM
`data-to-insights.ecommerce.all_sessions_raw`
GROUP BY
fullVisitorId, channelGrouping, time, country, city, totalTransactionRevenue, transactions, timeOnSite, pageviews, sessionQualityDim, date, visitId, type, productRefundAmount, productQuantity, productPrice, productRevenue, productSKU, v2ProductName, v2ProductCategory, productVariant, currencyCode, itemQuantity, itemRevenue, transactionRevenue, transactionId, pageTitle, searchKeyword, pagePathLevel1, eCommerceAction_type, eCommerceAction_step, eCommerceAction_option
HAVING num_duplicate_rows > 1;
WITH unique_product_views_by_person AS (
-- find each unique product viewed by each visitor
SELECT
fullVisitorId,
(v2ProductName) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY fullVisitorId, v2ProductName )
-- aggregate the top viewed products and sort them
SELECT
COUNT(*) AS unique_view_count,
ProductName
FROM unique_product_views_by_person
GROUP BY ProductName
ORDER BY unique_view_count DESC
LIMIT 5
#standardSQL
#standardSQL
# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_20170801
#schema
(
fullVisitorId STRING OPTIONS(description="Unique visitor ID"),
channelGrouping STRING OPTIONS(description="Channel e.g. Direct, Organic, Referral...")
)
OPTIONS(
description="Raw data from analyst team into our dataset for 08/01/2017"
) AS
SELECT fullVisitorId, city FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801' #56,989 records
;
#standardSQL
# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_20170801
#schema
(
fullVisitorId STRING NOT NULL OPTIONS(description="Unique visitor ID"),
channelGrouping STRING NOT NULL OPTIONS(description="Channel e.g. Direct, Organic, Referral..."),
totalTransactionRevenue INT64 OPTIONS(description="Revenue * 10^6 for the transaction")
)
OPTIONS(
description="Raw data from analyst team into our dataset for 08/01/2017"
) AS
SELECT fullVisitorId, channelGrouping, totalTransactionRevenue FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801' #56,989 records
;
#standardSQL
# copy one day of ecommerce data to explore
CREATE OR REPLACE TABLE ecommerce.revenue_transactions_20170801
#schema
(
fullVisitorId STRING NOT NULL OPTIONS(description="Unique visitor ID"),
visitId STRING NOT NULL OPTIONS(description="ID of the session, not unique across all users"),
channelGrouping STRING NOT NULL OPTIONS(description="Channel e.g. Direct, Organic, Referral..."),
totalTransactionRevenue FLOAT64 NOT NULL OPTIONS(description="Revenue for the transaction")
)
OPTIONS(
description="Revenue transactions for 08/01/2017"
) AS
SELECT DISTINCT
fullVisitorId,
CAST(visitId AS STRING) AS visitId,
channelGrouping,
totalTransactionRevenue / 1000000 AS totalTransactionRevenue
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
AND totalTransactionRevenue IS NOT NULL #XX transactions
;
Tip: It is often difficult to know whether or not you are SELECTing from a Table or a View by just looking at the name. A simple convention is to prefix the view name with vw_ or add a suffix like _vw or _view.
#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_latest_transactions
AS
SELECT DISTINCT
date,
fullVisitorId,
CAST(visitId AS STRING) AS visitId,
channelGrouping,
totalTransactionRevenue / 1000000 AS totalTransactionRevenue
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE totalTransactionRevenue IS NOT NULL
ORDER BY date DESC # latest transactions
LIMIT 100
;
#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_large_transactions
OPTIONS(
description="large transactions for review",
labels=[('org_unit','loss_prevention')]
)
AS
SELECT DISTINCT
date,
fullVisitorId,
visitId,
channelGrouping,
totalTransactionRevenue / 1000000 AS totalTransactionRevenue,
currencyCode,
STRING_AGG(DISTINCT v2ProductName ORDER BY v2ProductName LIMIT 10) AS products_ordered
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE
(totalTransactionRevenue / 1000000) > 1000
AND currencyCode = 'USD'
GROUP BY 1,2,3,4,5,6
ORDER BY date DESC # latest transactions
LIMIT 10
#standardSQL
SELECT
SESSION_USER() AS viewer_ldap; # To view your own session login information
#standardSQL
CREATE OR REPLACE VIEW ecommerce.vw_large_transactions
OPTIONS(
description="large transactions for review",
labels=[('org_unit','loss_prevention')],
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
)
AS
#standardSQL
SELECT DISTINCT
SESSION_USER() AS viewer_ldap,
REGEXP_EXTRACT(SESSION_USER(), r'@(.+)') AS domain,
date,
fullVisitorId,
visitId,
channelGrouping,
totalTransactionRevenue / 1000000 AS totalTransactionRevenue,
currencyCode,
STRING_AGG(DISTINCT v2ProductName ORDER BY v2ProductName LIMIT 10) AS products_ordered
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE
(totalTransactionRevenue / 1000000) > 1000
AND currencyCode = 'USD'
AND REGEXP_EXTRACT(SESSION_USER(), r'@(.+)') IN ('qwiklabs.net') # remove all domains from ser name
GROUP BY 1,2,3,4,5,6,7,8
ORDER BY date DESC # latest transactions
LIMIT 10;
STRING_AGG - Returns a value (either STRING or BYTES) obtained by concatenating non-null values. If a delimiter is specified, concatenated values are separated by that delimiter; otherwise, a comma is used as a delimiter.
STRING_AGG([DISTINCT] expression [, delimiter] [ORDER BY key [{ASC|DESC}] [, ... ]] [LIMIT n])
[OVER (...)]
Example
SELECT STRING_AGG(fruit) AS string_agg
FROM UNNEST(["apple", NULL, "pear", "banana", "pear"]) AS fruit;
# apple,pear,banana,pear
SELECT STRING_AGG(fruit, " & ") AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;
# apple & pear & banana & pear
SELECT STRING_AGG(DISTINCT fruit, " & ") AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;
SELECT STRING_AGG(fruit, " & " ORDER BY LENGTH(fruit)) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;
SELECT STRING_AGG(fruit, " & " LIMIT 2) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;
SELECT STRING_AGG(DISTINCT fruit, " & " ORDER BY fruit DESC LIMIT 2) AS string_agg
FROM UNNEST(["apple", "pear", "banana", "pear"]) AS fruit;
SELECT
fruit,
STRING_AGG(fruit, " & ") OVER (ORDER BY LENGTH(fruit)) AS string_agg
FROM UNNEST(["apple", NULL, "pear", "banana", "pear"]) AS fruit;
CREATE TABLE IF NOT EXISTS mydataset.newtable (x INT64, y STRUCT<a ARRAY<STRING>, b BOOL>)
OPTIONS(
expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
description="a table that expires in 2025",
labels=[("org_unit", "development")]
)
CREATE OR REPLACE TABLE mydataset.newtable (x INT64, y STRUCT<a ARRAY<STRING>, b BOOL>)
OPTIONS(
expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
description="a table that expires in 2025",
labels=[("org_unit", "development")]
)
CREATE TABLE mydataset.newtable
(
x INT64 OPTIONS(description="An optional INTEGER field"),
y STRUCT<
a ARRAY<STRING> OPTIONS(description="A repeated STRING field"),
b BOOL
>
)
PARTITION BY _PARTITIONDATE
OPTIONS(
expiration_timestamp=TIMESTAMP "2025-01-01 00:00:00 UTC",
partition_expiration_days=1,
description="a table that expires in 2025, with each partition living for 24 hours",
labels=[("org_unit", "development")]
)
CREATE TABLE my_dataset.new_table (
x INT64 NOT NULL,
y STRUCT<
a ARRAY<STRING>,
b BOOL NOT NULL,
c FLOAT64
> NOT NULL,
z STRING
)
#### Create clustered table
CREATE TABLE mydataset.myclusteredtable
(
timestamp TIMESTAMP,
customer_id STRING,
transaction_amount NUMERIC
)
PARTITION BY DATE(timestamp)
CLUSTER BY customer_id
OPTIONS (
partition_expiration_days=3,
description="a table clustered by customer_id"
)
#standardSQL
SELECT
*,
SAFE_DIVIDE(orderedQuantity,stockLevel) AS ratio
FROM
ecommerce.products
WHERE
# include products that have been ordered and
# are 80% through their inventory
orderedQuantity > 0
AND SAFE_DIVIDE(orderedQuantity,stockLevel) >= .8
ORDER BY
restockingLeadTime DESC
Arrays - is an ordered list consisting of zero or more values of the same data type. https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
- BigQuery natively supports arrays
- Array values must share a data type
- Arrays are called REPEATED fields in BigQuery
- finding the number of elements with ARRAY_LENGTH()
- deduplicating elements with ARRAY_AGG(DISTINCT )
- ordering elements with ARRAY_AGG( ORDER BY )
- limiting ARRAY_AGG( LIMIT 5)
- Generating
SELECT ["apple", "pear", "orange"] as fruit;
SELECT ARRAY<FLOAT64>[1, 2, 3] as floats;
SELECT GENERATE_ARRAY(11, 33, 2) AS odds; -- [11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33]
SELECT GENERATE_ARRAY(21, 14, -1) AS countdown; -- [21, 20, 19, 18, 17, 16, 15, 14]
SELECT
GENERATE_DATE_ARRAY('2017-11-21', '2017-12-31', INTERVAL 1 WEEK)
AS date_array;
SELECT ['raspberry', 'blackberry', 'strawberry', 'cherry', 1234567] AS fruit_array -- Error Array elements of types {INT64, STRING} do not have a common supertype
- Accessing
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
some_numbers[OFFSET(1)] AS offset_1,
some_numbers[ORDINAL(1)] AS ordinal_1
FROM sequences;
- Finding Lengths
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
ARRAY_LENGTH(some_numbers) AS len
FROM sequences;
- Flattening arrays - convert to rows
- You need to
UNNEST()
arrays to bring the array elements back into rows -
UNNEST()
always follows the table name in your FROM clause (think of it conceptually like a pre-joined table)
SELECT DISTINCT
visitId,
h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10
['homepage',
['homepage','product page','checkout'] -> 'product page',
' checkout']
SELECT *
FROM UNNEST(['foo', 'bar', 'baz', 'qux', 'corge', 'garply', 'waldo', 'fred'])
AS element
WITH OFFSET AS offset
ORDER BY offset;
WITH sequences AS
(SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id, flattened_numbers
FROM sequences
CROSS JOIN UNNEST(sequences.some_numbers) AS flattened_numbers;
- Querying Nested Arrays
WITH races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
AS participants)
SELECT
race,
participant
FROM races r
CROSS JOIN UNNEST(r.participants) as participant;
- Creating Arrays From Subqueries
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
ARRAY(SELECT x * 2
FROM UNNEST(some_numbers) AS x) AS doubled
FROM sequences;
- Filtering Arrays
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
ARRAY(SELECT x * 2
FROM UNNEST(some_numbers) AS x
WHERE x < 5) AS doubled_less_than_five
FROM sequences;
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers)
SELECT ARRAY(SELECT DISTINCT x
FROM UNNEST(some_numbers) AS x) AS unique_numbers
FROM sequences;
- Scanning Arrays
SELECT 2 IN UNNEST([0, 1, 1, 2, 3, 5]) AS contains_value; # returns true if the array contains the number 2
# returns the id value for the rows where the array column contains the value 2.
WITH sequences AS
(SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id AS matching_rows
FROM sequences
WHERE 2 IN UNNEST(sequences.some_numbers)
ORDER BY matching_rows;
- Arrays and Aggregation
WITH fruits AS
(SELECT "apple" AS fruit
UNION ALL SELECT "pear" AS fruit
UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit) AS fruit_basket
FROM fruits;
WITH fruits AS
(SELECT "apple" AS fruit
UNION ALL SELECT "pear" AS fruit
UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit ORDER BY fruit) AS fruit_basket
FROM fruits;
- Converting Arrays to Strings
WITH greetings AS
(SELECT ["Hello", "World"] AS greeting)
SELECT ARRAY_TO_STRING(greeting, " ") AS greetings
FROM greetings;
- Combining Arrays
SELECT ARRAY_CONCAT([1, 2], [3, 4], [5, 6]) as count_to_six; # return [1, 2, 3, 4, 5, 6]
- Zipping Arrays
You can zip arrays with UNNEST and WITH OFFSET. In this example, each value pair is stored as a STRUCT in an array.
WITH combinations AS (
SELECT
['a', 'b'] AS letters,
[1, 2, 3] AS numbers
)
SELECT ARRAY_AGG(
STRUCT(letter, numbers[OFFSET(letters_offset)] AS number)
) AS pairs
FROM combinations, UNNEST(letters) AS letter WITH OFFSET AS letters_offset;
- Building Arrays of Arrays
BigQuery does not support building arrays of arrays directly. Instead, you must create an array of structs, with each struct containing a field of type ARRAY.
WITH points AS
(SELECT [1, 5] as point
UNION ALL SELECT [2, 8] as point
UNION ALL SELECT [3, 7] as point
UNION ALL SELECT [4, 1] as point
UNION ALL SELECT [5, 7] as point)
SELECT ARRAY(
SELECT STRUCT(point)
FROM points)
AS coordinates;
+-------------------+
| coordinates |
+-------------------+
| [{point: [1,5]}, |
| {point: [2,8]}, |
| {point: [5,7]}, |
| {point: [3,7]}, |
| {point: [4,1]}] |
+--------------------+
- STRUCT STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table. A STRUCT can have:
- one or many fields in it
- the same or different data types for each field
- it's own alias
- as an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
- STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.
STRUCT as essentially a collection of other fields, kind of like a table. From a reporting standpoint, you can have many nested STRUCTs within a single table, which is conceptually like having many other tables pre-joined,
SELECT STRUCT("Rudisha" as name, 23.4 as split) as runner
SELECT STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
In table colum have RECORD type when contain struct.
gs://data-insights-course/labs/optimizing-for-performance/race_results.json
#standardSQL
SELECT race, participants.name -- dostanie sie do elementu w strukturze wymaga CROSS JOIN z ta strktura !!!
FROM racing.race_results
CROSS JOIN
participants # this is the STRUCT (it is like a table within a table)
#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN
race_results.participants # full STRUCT name
#standardSQL -- najprosciej
SELECT race, participants.name
FROM racing.race_results AS r, r.participants
#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p
#standardSQL
SELECT
p.name,
SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;
#standardSQL
SELECT
p.name,
split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;
-
SELECT DISTINCT
- nie zwraca column typu STRUCT,ARRAY -
SELECT * EXCEPT
- specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output.
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;
Note: SELECT * EXCEPT
does not exclude columns that do not have names
SELECT * REPLACE
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;
-
SELECT ALL
-
SELECT AS STRUCT
SELECT
ARRAY(SELECT AS STRUCT t.f1, t.f2 WHERE t.f3=true)
FROM
Table t
SELECT AS VALUE
SELECT AS VALUE STRUCT(1 a, 2 b) xyz FROM Table;
FOR SYSTEM_TIME AS OF
- references the historical versions of the table definition and rows that were current at timestamp_expression.
SELECT *
FROM t
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR); # returns a historical version of the table from one hour ago.
SELECT *
FROM t
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
Examle:
DECLARE before_replace_timestamp TIMESTAMP;
-- Create table books.
CREATE TABLE books AS
SELECT 'Hamlet' title, 'William Shakespeare' author;
-- Get current timestamp before table replacement.
SET before_replace_timestamp = CURRENT_TIMESTAMP();
-- Replace table with different schema(title and release_date).
CREATE OR REPLACE TABLE books AS
SELECT 'Hamlet' title, DATE '1603-01-01' release_date;
SELECT * FROM books;
-- This query returns Hamlet, William Shakespeare as result.
SELECT * FROM books FOR SYSTEM_TIME AS OF before_replace_timestamp;
ON and USING Comparision
FROM A JOIN B ON A.x = B.x
FROM A JOIN B USING (x)
Table A Table B Result ON Result USING
+---+ +---+ +-------+ +---+
| x | * | x | = | x | x | | x |
+---+ +---+ +-------+ +---+
| 1 | | 2 | | 2 | 2 | | 2 |
| 2 | | 3 | | 3 | 3 | | 3 |
| 3 | | 4 | +-------+ +---+
+---+ +---+
CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y);
WITH numbers AS
(SELECT 1 as val
UNION ALL
SELECT 3 as val
UNION ALL
SELECT 4 as val
UNION ALL
SELECT 5 as val)
SELECT val, addFourAndDivide(val, 2) AS result
CREATE TEMP FUNCTION lastArrayElement(arr ANY TYPE) AS (
arr[ORDINAL(ARRAY_LENGTH(arr))]
);
SELECT
names[OFFSET(0)] AS first_name,
lastArrayElement(names) AS last_name
FROM (
SELECT ['Fred', 'McFeely', 'Rogers'] AS names UNION ALL
SELECT ['Marie', 'Skłodowska', 'Curie']
);
CREATE TEMP FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x*y;
""";
WITH numbers AS
(SELECT 1 AS x, 5 as y
UNION ALL
SELECT 2 AS x, 10 as y
UNION ALL
SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;
Wildcard tables - musb be quotet by backtick (`) characters.Quotation marks do not work !!!
SELECT
FROM
`<project-id>.<dataset-id>.<table-prefix>*`
WHERE
bool_expression
SELECT
max
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX = '1929'
ORDER BY
max DESC
SELECT
ROUND(AVG(TemperatureF),1) AS AVG_TEMP_F
FROM
`mydataset.sea_weather_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160501' AND '20160509'
SELECT
max,
ROUND((max-32)*5/9,1) celsius,
year
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # code for missing data
AND max > 100 # to improve ORDER BY performance
AND REGEXP_CONTAINS(_TABLE_SUFFIX, r"0$")
ORDER BY
max DESC
BQML supports the two listed above.
There are the three model types to choose from:
Model | Model Type | Label Data type | Example |
---|---|---|---|
Forecasting | linear_reg | Numeric value (typically an integer or floating point) | Forecast sales figures for next year given historical sales data. |
Classification | logistic_reg | 0 or 1 for binary classification | Classify an email as spam or not spam given the context. |
k-Means Clustering |
- Forecasting numeric values like next month's sales with Linear Regression (linear_reg).
- Binary or Multiclass Classification like spam or not spam email by using Logistic Regression (logistic_reg).
- k-Means Clustering for when you want unsupervised learning for exploration (kmeans).
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;
n BQML, roc_auc is simply a queryable field when evaluating your trained ML model.
Now that training is complete, run this query to evaluate how well the model performs using ML.EVALUATE:
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model, (
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
));
You should see the following result:
Row roc_auc model_quality
1 0.724588 decent
After evaluating your model you get a roc_auc of 0.72, which shows the model has decent, but not great, predictive power. Since the goal is to get the area under the curve as close to 1.0 as possible, there is room for improvement.
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
IFNULL(totals.pageviews, 0) AS pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
Evaluate this new model to see if there is better predictive power:
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));
Predict which new visitors will come back and purchase
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
Scheduled queries must be written in standard SQL which can include Data Definition Language (DDL) and Data Manipulation Language (DML) statements.
Required permissions:
Either bigquery.jobs.create
or bigquery.transfers.update
permissions to create the transfer.
bigquery.datasets.update permissions
on the target dataset.
The bigquery.jobUser
predefined IAM role includes bigquery.jobs.create
permissions.
- run_time - represents the intended time of execution. For example, if the scheduled query is set to "every 24 hours",
- run_date - parameter in the following format: %Y%m%d; for example, 20180101. This format is compatible with ingestion-time partitioned tables.
{run_date}
{run_time[+\-offset]|"time_format"}
Examle
mytable_{run_time+25h|"%Y%m%d"}
mytable_{run_time+90m|"%Y%m%d;%H"}
{run_time+97s|"%Y%m%d"}_mytable_{run_time|"%H%M%s"}
SELECT @run_time AS time,
title,
author,
text
FROM `bigquery-public-data.hacker_news.stories`
LIMIT
1000
- Option 1: Use the bq query command.
bq query \
--display_name=name \
--destination_table=table \
--schedule=interval
# Example interval
--schedule='every 24 hours'
--schedule='every 3 hours'
# Example:
bq query \
--use_legacy_sql=false \
--destination_table=mydataset.mytable \
--display_name='My Scheduled Query' \
--schedule='every 24 hours' \
--replace=true \
'SELECT
1
FROM
mydataset.test'
- Option 2: Use the bq mk command. Scheduled queries are a kind of transfer. To schedule a query, you can use the BigQuery Data Transfer Service command-line interface to make a transfer configuration.
bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source
For example, the following command creates a scheduled query transfer configuration named My Scheduled Query using the simple query SELECT 1 from mydataset.test. The destination table mytable is truncated for every write, and the target dataset is mydataset. The scheduled query is created in the default project, and authenticates with a service account:
bq mk \
--transfer_config \
--target_dataset=mydataset \
--display_name='My Scheduled Query' \
--params='{"query":"SELECT 1 from mydataset.test","destination_table_name_template":"mytable","write_disposition":"WRITE_TRUNCATE"}' \
--data_source=scheduled_query \
--service_account_name=abcdef-test-sa@abcdef-test.iam.gserviceaccount.com
bq ls \
--transfer_config \
--transfer_location=us \
bq show \
--transfer_config \
projects/862514376110/locations/us/transferConfigs/5dd12f26-0000-262f-bc38-089e0820fe38 \
In exampe when scheduled query not proper executed in previous days
bq mk \
--transfer_run \
--start_time='start_time' \
--end_time='end_time' \
resource_name
bq mk \
--transfer_run \
--start_time 2017-05-25T00:00:00Z \
--end_time 2017-05-25T00:00:00Z \
projects/mypro
Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions. This makes approximate aggregation appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
- APPROX_COUNT_DISTINCT
SELECT APPROX_COUNT_DISTINCT(x) as approx_distinct
FROM UNNEST([0, 1, 1, 2, 3, 5]) as x;
- APPROX_QUANTILES Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.
APPROX_QUANTILES([DISTINCT] expression, number [{IGNORE|RESPECT} NULLS])
SELECT APPROX_QUANTILES(x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;
SELECT APPROX_QUANTILES(x, 100)[OFFSET(90)] AS percentile_90
FROM UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS x;
SELECT APPROX_QUANTILES(DISTINCT x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;
SELECT
MIN(start_station_name) AS start_station_name,
MIN(end_station_name) AS end_station_name,
APPROX_QUANTILES(tripduration, 10)[OFFSET (5)] AS typical_duration,
COUNT(tripduration) AS num_trips
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
start_station_id != end_station_id
GROUP BY
start_station_id,
end_station_id
ORDER BY
num_trips DESC
LIMIT
10
- APPROX_TOP_COUNT
APPROX_TOP_COUNT(expression, number)
Returns the approximate top elements of expression.
SELECT APPROX_TOP_COUNT(x, 2) as approx_top_count
FROM UNNEST(["apple", "apple", "pear", "pear", "pear", "banana"]) as x;
+-------------------------+
| approx_top_count |
+-------------------------+
| [{pear, 3}, {apple, 2}] |
+-------------------------+
NULL handling
APPROX_TOP_COUNT does not ignore NULLs in the input. For example:
SELECT APPROX_TOP_COUNT(x, 2) as approx_top_count
FROM UNNEST([NULL, "pear", "pear", "pear", "apple", NULL]) as x;
+------------------------+
| approx_top_count |
+------------------------+
| [{pear, 3}, {NULL, 2}] |
+------------------------+
- APPROX_TOP_SUM
APPROX_TOP_SUM(expression, weight, number)
Description
Returns the approximate top elements of expression, based on the sum of an assigned weight. The number parameter specifies the number of elements returned.
If the weight input is negative or NaN, this function returns an error.
SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([
STRUCT("apple" AS x, 3 AS weight),
("pear", 2),
("apple", 0),
("banana", 5),
("pear", 4)
]);
+--------------------------+
| approx_top_sum |
+--------------------------+
| [{pear, 6}, {banana, 5}] |
+--------------------------+
SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, NULL AS weight), ("pear", 0), ("pear", NULL)]);
+----------------------------+
| approx_top_sum |
+----------------------------+
| [{pear, 0}, {apple, NULL}] |
+----------------------------+
SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, 0 AS weight), (NULL, 2)]);
+-------------------------+
| approx_top_sum |
+-------------------------+
| [{NULL, 2}, {apple, 0}] |
+-------------------------+
SELECT APPROX_TOP_SUM(x, weight, 2) AS approx_top_sum FROM
UNNEST([STRUCT("apple" AS x, 0 AS weight), (NULL, NULL)]);
+----------------------------+
| approx_top_sum |
+----------------------------+
| [{apple, 0}, {NULL, NULL}] |
+----------------------------+