GCS BigQuery CSV - ghdrako/doc_snipets GitHub Wiki

If your CSV files in cloud storage are also organized in a format compatible with a hive partitioning layout, you can also use that to help reduce scanned data or leverage it as an additional pseudocolumn of the table. If not, note the guidance around _FILE_NAME which can also be of use if the filename of the CSV is relevant to your queries.

CREATE OR REPLACE EXTERNAL TABLE mydataset.sales
  OPTIONS (
  format = 'CSV',
  uris = ['gs://mybucket/sales.csv']);
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
  Region STRING,
  Quarter STRING,
  Total_Sales INT64
) OPTIONS (
    format = 'CSV',
    uris = ['gs://mybucket/sales.csv'],
    skip_leading_rows = 1);

Examples show valid uris values:

['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

When you specify uris values that target multiple files, all of those files must share a compatible schema.

External partitioning layout

CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
WITH PARTITION COLUMNS (<PARTITION_COLUMN_LIST>)
OPTIONS (
format = 'SOURCE_FORMAT',
uris = ['GCS_URIS'],
hive_partition_uri_prefix = 'GCS_URI_SHARED_PREFIX',
require_hive_partition_filter = BOOLEAN
);
Option Type Description
hive_partition_uri_prefix STRING A common prefix for all source URIs before the partition key encoding begins. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data. Example: "gs://bucket/path".
require_hive_partition_filter BOOL If true, all queries over this table require a partition filter that can be used to eliminate partitions when reading data. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data.
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-09.07.02.40_28237260_pc68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-09.07.02.40_28237260_cj22v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-10.07.02.40_28237260_xy68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-10.07.02.40_28237260_xy65v.txt
gs://bucket/path/field_1=first/field_2=1/data.parquet.

CREATE EXTERNAL TABLE dataset.CustomHivePartitionedTable
WITH PARTITION COLUMNS (
field_1 STRING, -- column order must match the external path
field_2 INT64)
OPTIONS (
uris = ['gs://bucket/path/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://bucket/path',
require_hive_partition_filter = false);

Hive partition keys appear as normal columns when you query data from Cloud Storage. The data must follow a default Hive partitioned layout. For example, the following files follow the default layout—the key-value pairs are configured as directories with an equal sign (=) as a separator, and the partition keys are always in the same order:

gs://myBucket/myTable/dt=2019-10-31/lang=en/partitionKey
gs://myBucket/myTable/dt=2018-10-31/lang=fr/partitionKey

The common source URI prefix in this example is gs://myBucket/myTable.

Detection modes

BigQuery supports three modes of Hive partition schema detection:

  • AUTO: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.
  • STRINGS: Key names are automatically converted to STRING type.
  • CUSTOM: Partition key schema is encoded as specified in the source URI prefix.

Limitation

  • The Hive partitioning keys and the columns in the underlying files cannot overlap.
  • Support is for GoogleSQL only.
  • All limitations for querying external data sources stored on Cloud Storage apply.

Generate table schema

SELECT
  table_name,
  ddl
FROM
  `dev-biz-acp-gcp-pr.sprawozdawczosc.INFORMATION_SCHEMA.TABLES`
WHERE
  table_name="RL-BAZA-TRANSAKCJI-POS-TEST"
⚠️ **GitHub.com Fallback** ⚠️