GCS BigQuery CSV - ghdrako/doc_snipets GitHub Wiki
- https://cloud.google.com/bigquery/docs/external-data-cloud-storage
- https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement
If your CSV files in cloud storage are also organized in a format compatible with a hive partitioning layout, you can also use that to help reduce scanned data or leverage it as an additional pseudocolumn of the table. If not, note the guidance around _FILE_NAME which can also be of use if the filename of the CSV is relevant to your queries.
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales
OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv']);
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
Region STRING,
Quarter STRING,
Total_Sales INT64
) OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv'],
skip_leading_rows = 1);
Examples show valid uris values:
['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify uris values that target multiple files, all of those files must share a compatible schema.
- https://cloud.google.com/bigquery/docs/hive-partitioned-queries
- https://athena.guide/articles/hive-style-partitioning/
CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
WITH PARTITION COLUMNS (<PARTITION_COLUMN_LIST>)
OPTIONS (
format = 'SOURCE_FORMAT',
uris = ['GCS_URIS'],
hive_partition_uri_prefix = 'GCS_URI_SHARED_PREFIX',
require_hive_partition_filter = BOOLEAN
);
Option | Type | Description |
---|---|---|
hive_partition_uri_prefix | STRING | A common prefix for all source URIs before the partition key encoding begins. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data. Example: "gs://bucket/path". |
require_hive_partition_filter | BOOL | If true, all queries over this table require a partition filter that can be used to eliminate partitions when reading data. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data. |
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-09.07.02.40_28237260_pc68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-09.07.02.40_28237260_cj22v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-10.07.02.40_28237260_xy68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-10.07.02.40_28237260_xy65v.txt
gs://bucket/path/field_1=first/field_2=1/data.parquet.
CREATE EXTERNAL TABLE dataset.CustomHivePartitionedTable
WITH PARTITION COLUMNS (
field_1 STRING, -- column order must match the external path
field_2 INT64)
OPTIONS (
uris = ['gs://bucket/path/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://bucket/path',
require_hive_partition_filter = false);
Hive partition keys appear as normal columns when you query data from Cloud Storage. The data must follow a default Hive partitioned layout. For example, the following files follow the default layout—the key-value pairs are configured as directories with an equal sign (=) as a separator, and the partition keys are always in the same order:
gs://myBucket/myTable/dt=2019-10-31/lang=en/partitionKey
gs://myBucket/myTable/dt=2018-10-31/lang=fr/partitionKey
The common source URI prefix in this example is gs://myBucket/myTable
.
Detection modes
BigQuery supports three modes of Hive partition schema detection:
-
AUTO
: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP. -
STRINGS
: Key names are automatically converted to STRING type. -
CUSTOM
: Partition key schema is encoded as specified in the source URI prefix.
Limitation
- The Hive partitioning keys and the columns in the underlying files cannot overlap.
- Support is for GoogleSQL only.
- All limitations for querying external data sources stored on Cloud Storage apply.
SELECT
table_name,
ddl
FROM
`dev-biz-acp-gcp-pr.sprawozdawczosc.INFORMATION_SCHEMA.TABLES`
WHERE
table_name="RL-BAZA-TRANSAKCJI-POS-TEST"