In the study note series, this post covers Google BigQuery. All details are accurate at the time of writing, please refer to Google for current details.
- BQ has all window functions, string, array and struct, user defined functions etc.
- Access control is via views that live in a separate dataset. Access Control Lists (ACL) are applied to all tables and views in a dataset
- BigQuery has CTE type concept WITH Clause
- ARRAYS and STRUCTS are Standard SQL stores data in arrays natively
- BigQuery can work with both native, internal storage and external files.
- In BQ, every column is in a separate file. Columnar
- Multiple owners in a project
- BQ ML enables users to create and execute machine learning models in BigQuery by using SQL queries
- BQ was written by Dremel
- Immutable audit logs
- Older data is discounted
- Storage cost is similar to GCP
- You can ingest JSON
Compatible Analytic Tools
Following are examples of tools with connectors to BigQuery:
- Data Studio
BQ is a fully managed, petabyte scale, analytics data warehouse, it gives way to:
- BigTable (cheap if it suits)
- Spanner (expensive, many nodes)
BigQuery allows for nesting of data within the same field but relies on the use of JSON, stored in a RECORD type column with a REPEAT mode
User Defined Functions
- Standard SQL UDFs are scalar
- Legacy SQL UDFs are tabular
To ensure performance use SQL based UDFs finally JS
To ensure BQ performs:
- Don’t SELECT *
- Filter early and often
- Biggest join first, don’t self-join
- Low cardinality ‘group bys’ are faster
- Use approx. where possible
- Sort at the last step
- Use inbuilt functionality first, then SQL based UDFs finally JS
- Partitioning is good but introduces an overhead for users to understand
- Long prefixes are better for performance when using _TABLE_SUFFIX
To performance analyse a query use:
- Per-query explain plans (what did my query do?)
- Stackdriver (what is going on with all my resources in this project?)
BigQuery slot is a unit of computational capacity required to execute SQL queries.
The following should be considered when estimating cost:
- Data in table
- Ingest rate
- Discount of older data
- On Demand or Flat Rate
- On Demand based on the data processed
- 1 TB/month free
- Opting in to run high compute queries
The following are free:
- Queries on Metadata
- Cached queries
- Queries with errors
- Use the query validator with pricing estimates
You manage costs by requesting a custom quota that specifies a limit on the amount of query data processed per day. You can control costs at the project-level or at the user level.
- Project-level custom quotas limit the total project usage.
- User-level custom quotas are separately applied to each user or service account
Renaming a column is not supported by the GCP Console, the classic BigQuery web UI, the command-line tool, or the API. You can ADD columns without needing to recreate the table
The query prefixes #legacySQL and #standardSQL can be used when writing SQL. They:
- Are NOT case-sensitive
- Must precede the query
- Must be separated from the query by a newline character
Standard vs. Legacy SQL
BQ has two types of SQL syntax:
SELECTProduct_code, Item_Qty, Item_ValueFROM [bigquery-public-data:samples.orders]ORDER BY order_value DESCLIMIT 10;
SELECTProduct_code, Item_Qty, Item_ValueFROM bigquery-public-data.samples.ordersORDER BY order_value DESCLIMIT 10;
SELECTProduct_code, Item_Qty, Item_ValueFROM ‘bigquery-public-data.samples.orders’ORDER BY order_value DESCLIMIT 10;
Other examples with different levels of qualifying:
SELECT * FROM Order;
SELECT * FROM dataset.Order;
SELECT * FROM project.dataset.Order;
BQ has the following commands:
BQ allows the use of wild cards when writing SQL. The following are examples in standard SQL:
|Streaming But Not Partitioned||WHERE _PARTITIONTIME IS NULL|
|Date Partition||WHERE _PARTITIONTIME = TIMESTAMP(‘2016-03-28’)) t1|
|Date Partition||WHERE _PARTITIONDATE = ‘2016-03-28’) t1|
|Multiple Tables in Dataset||FROM `bigquery-public-data.noaa_gsod.*`|
|Multiple Tables in Dataset||FROM `bigquery-public-data.noaa_gsod.gsod194*`|
With legacy SQL you need to do something like this
SELECT nameFROMTABLE_DATE_RANGE([myproject-1234:mydata.people],TIMESTAMP(‘2014-03-25’), TIMESTAMP(‘2014-03-27’)) WHERE age >= 35
Partitioned tables are possible and perform well, especially when using a date column for partitioning
Batch or Stream
BigQuery supports both batch & streaming data from DataFlow. Batching data to BigQuery is free of charge. Streaming data on the other hand is charged by size.
Schema Auto Detection
Schema auto-detection: Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.
BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files. By default, the BigQuery service expects all source data to be UTF-8 encoded. Explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process.
An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. Takes this approach when
- Perform ETL operations on data.
- Frequently changed data.
- Data is being ingested periodically.
Query results are not cached:
- When a destination table is specified in the job configuration, the GCP Console, the classic web UI, the command line, or the API
- If any of the referenced tables or logical views have changed since the results were previously cached
- When any of the tables referenced by the query have recently received streaming inserts
- If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP()
- If you are querying multiple tables using a wildcard
- If the cached results have expired
- If the query runs against an external data source
If you are using the GCP Console or the classic BigQuery web UI, the result does not contain information about the number of processed bytes, and displays the word “cached”.
Typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner
|Admin||Provides permissions to manage all resources within the project. Can manage all data within the project and can cancel jobs from other users running within the project.||Project|
|Data Owner||Read, update, and delete the dataset. Create, update, get, and delete the dataset’s tables. When applied at the project or organization level, this role can also create new datasets.||Data Set|
|Data Editor||Read the dataset’s metadata and to list tables in the dataset. Create, update, get, and delete the dataset’s tables.||Data Set|
|BigQuery User||Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets.||Project|
|Job User||Job User role can enumerate their own jobs, Run and cancel their own jobs.||Project|
|Data Viewer||Read the dataset’s metadata and to list tables in the dataset. Read data and metadata from the dataset’s tables.||Data Set|
|Metadata Viewer||List all datasets and read metadata for all datasets in the project. List all tables and views and read metadata for all tables and views in the project.||Project|