GCS BigQuery CSV - ghdrako/doc

Querying dirtectly over csv files in Google Storage

If your CSV files in cloud storage are also organized in a format compatible with a hive partitioning layout, you can also use that to help reduce scanned data or leverage it as an additional pseudocolumn of the table. If not, note the guidance around _FILE_NAME which can also be of use if the filename of the CSV is relevant to your queries.

CREATE OR REPLACE EXTERNAL TABLE mydataset.sales
  OPTIONS (
  format = 'CSV',
  uris = ['gs://mybucket/sales.csv']);

CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
  Region STRING,
  Quarter STRING,
  Total_Sales INT64
) OPTIONS (
    format = 'CSV',
    uris = ['gs://mybucket/sales.csv'],
    skip_leading_rows = 1);

Examples show valid uris values:

['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

When you specify uris values that target multiple files, all of those files must share a compatible schema.

External partitioning layout

CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
WITH PARTITION COLUMNS (<PARTITION_COLUMN_LIST>)
OPTIONS (
format = 'SOURCE_FORMAT',
uris = ['GCS_URIS'],
hive_partition_uri_prefix = 'GCS_URI_SHARED_PREFIX',
require_hive_partition_filter = BOOLEAN
);

Option	Type	Description
hive_partition_uri_prefix	STRING	A common prefix for all source URIs before the partition key encoding begins. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data. Example: "gs://bucket/path".
require_hive_partition_filter	BOOL	If true, all queries over this table require a partition filter that can be used to eliminate partitions when reading data. Applies only to hive-partitioned external tables. Applies to Avro, CSV, JSON, Parquet, and ORC data.

gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-09.07.02.40_28237260_pc68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-08/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-09.07.02.40_28237260_cj22v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI_2023-09-10.07.02.40_28237260_xy68v.txt
gs:/<bucket>/sprawozdawczosc/data/day=2020-09-09/dev_AcpBatchDataReplicationProcess_DataExtract_RL-BAZA-TRANSAKCJI-POS_2023-09-10.07.02.40_28237260_xy65v.txt

gs://bucket/path/field_1=first/field_2=1/data.parquet.

CREATE EXTERNAL TABLE dataset.CustomHivePartitionedTable
WITH PARTITION COLUMNS (
field_1 STRING, -- column order must match the external path
field_2 INT64)
OPTIONS (
uris = ['gs://bucket/path/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://bucket/path',
require_hive_partition_filter = false);

Hive partition keys appear as normal columns when you query data from Cloud Storage. The data must follow a default Hive partitioned layout. For example, the following files follow the default layout—the key-value pairs are configured as directories with an equal sign (=) as a separator, and the partition keys are always in the same order:

gs://myBucket/myTable/dt=2019-10-31/lang=en/partitionKey
gs://myBucket/myTable/dt=2018-10-31/lang=fr/partitionKey

The common source URI prefix in this example is gs://myBucket/myTable.

Detection modes

BigQuery supports three modes of Hive partition schema detection:

AUTO: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.
STRINGS: Key names are automatically converted to STRING type.
CUSTOM: Partition key schema is encoded as specified in the source URI prefix.

Limitation

The Hive partitioning keys and the columns in the underlying files cannot overlap.
Support is for GoogleSQL only.
All limitations for querying external data sources stored on Cloud Storage apply.

Generate table schema

SELECT
  table_name,
  ddl
FROM
  `dev-biz-acp-gcp-pr.sprawozdawczosc.INFORMATION_SCHEMA.TABLES`
WHERE
  table_name="RL-BAZA-TRANSAKCJI-POS-TEST"

GCS BigQuery CSV - ghdrako/doc_snipets GitHub Wiki

Querying dirtectly over csv files in Google Storage

External partitioning layout

Generate table schema

⚠️ GitHub.com Fallback ⚠️

GCS BigQuery CSV - ghdrako/doc_snipets GitHub Wiki

Querying dirtectly over csv files in Google Storage

External partitioning layout

Generate table schema

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️