Bigtable - bobbae/gcp GitHub Wiki

Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.

Overview

https://cloud.google.com/bigtable/docs/overview

Scalable Key-value store

Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies. Bigtable also underlies Google Cloud Datastore, which is available as a part of the Google Cloud Platform.

https://cloud.google.com/bigtable/docs

How Google scales ad personalization with Bigtable

https://cloud.google.com/blog/products/data-analytics/how-google-scales-ad-personalization-with-bigtable

History and Usage

Bigtable development began in 2004 and is now used by a number of Google applications, such as web indexing, MapReduce, which is often used for generating and modifying data stored in Bigtable, Google Maps, Google Book Search, "My Search History", Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail. Google's reasons for developing its own database include scalability and better control of performance characteristics.

https://www.youtube.com/watch?v=Lq9uDOM4whI

Bigtable basics

https://medium.com/google-cloud/getting-started-with-bigtable-on-gcp-adfb896e0b26

Migrating from HBase to Big table

https://medium.com/box-tech-blog/zero-downtime-migration-from-hbase-to-bigtable-4f00f835755f

https://cloud.google.com/blog/products/databases/create-bigtable-tables-from-existing-apache-hbase-tables

Bigtable code samples

https://cloud.google.com/bigtable/docs/samples

Bigtable Quickstarts

https://cloud.google.com/bigtable/docs/quickstarts

Import CSV data

https://medium.com/google-cloud/easy-csv-importing-into-cloud-bigtable-ed3f62139b89

https://cloud.google.com/blog/products/databases/bigtable-command-line-now-supports-csv-file-importing

HBase tutorial

https://www.guru99.com/hbase-tutorials.html

https://cloud.google.com/bigtable/docs/quickstart-hbase

Architecture

Wide Column Storage

Bigtable is one of the prototypical examples of a wide column store. It maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated arbitrary byte array. It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map.

Scale

Bigtable is designed to scale into the petabyte range across hundreds or thousands of machines, and to make it easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration.

Autoscaling

https://cloud.google.com/blog/products/databases/cloud-bigtable-now-even-easier-to-manage-with-autoscaling

Storing a copy of the Web

For example, Google's copy of the web can be stored in a bigtable where the row key is a domain-reversed URL, and columns describe various properties of a web page, with one particular column holding the page itself. The page column can have several timestamped versions describing different copies of the web page timestamped by when they were fetched. Each cell of a bigtable can have zero or more timestamped versions of the data. Another function of the timestamp is to allow for both versioning and garbage collection of expired data.

Sparsely populated

Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, enabling you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key. Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.

Client API

Bigtable is exposed to applications through multiple client libraries, including a supported extension to the Apache HBase library for Java. As a result, it integrates with the existing Apache ecosystem of open-source Big Data software.

Backup

Bigtable backups let you save a copy of a table's schema and data, then restore from the backup to a new table at a later time.

Backups can help you recover from application-level data corruption or from operator errors such as accidentally deleting a table. You can restore from a backup to a new table in either the instance the backup was created in or a different instance.

Backups are not intended for protection against regional or zonal failures. Use replication if you need the ability to fail over to different regions or zones.

https://cloud.google.com/bigtable/docs/backups

HBase

Bigtable's powerful back-end servers offer several key advantages over a self-managed HBase installation.

Incredible scalability

Bigtable scales in direct proportion to the number of machines in your cluster. A self-managed HBase installation has a design bottleneck that limits the performance after a certain threshold is reached. Bigtable does not have this bottleneck, so you can scale your cluster up to handle more reads and writes.

Simple administration

Bigtable handles upgrades and restarts transparently, and it automatically maintains high data durability. To replicate your data, add a second cluster to your instance, and replication starts automatically. No more managing replicas or regions; just design your table schemas, and Bigtable will handle the rest for you.

Instances, clusters, nodes

To use Cloud Bigtable, you create instances, which contain up to 4 clusters that your applications can connect to. Each cluster contains nodes, the compute units that manage your data and perform maintenance tasks.

A Bigtable instance is a container for your data. Instances have one or more clusters, located in different zones. Each cluster has at least 1 node.

A table belongs to an instance, not to a cluster or node. If you have an instance with more than one cluster, you are using replication. This means you can't assign a table to an individual cluster or create unique garbage collection policies for each cluster in an instance. You also can't make each cluster store a different set of data in the same table.

Storage Types

When you create an instance, you must choose whether the instance's clusters will store data on solid-state drives (SSD) or hard disk drives (HDD).

Choosing between SSD and HDD

https://cloud.google.com/bigtable/docs/choosing-ssd-hdd

Application Profiles

An app profile specifies the settings that Bigtable uses to handle an instance's incoming requests.

https://cloud.google.com/bigtable/docs/app-profiles

Replication

Replication for Cloud Bigtable enables you to increase the availability and durability of your data by copying it across multiple regions or multiple zones within the same region. You can also isolate workloads by routing different types of requests to different clusters.

Bigtable supports up to 4 replicated clusters located in Google Cloud zones where Bigtable is available. An instance's clusters must each be in unique zones. You can create an additional cluster in any zone where Bigtable is available. Placing clusters in different zones or regions enables you to access your data even if one Google Cloud zone or region becomes unavailable.

https://cloud.google.com/bigtable/docs/replication-overview

Cluster resizing without downtime

You can increase the size of a Bigtable cluster for a few hours to handle a large load, then reduce the cluster's size again—all without any downtime. After you change a cluster's size, it typically takes just a few minutes under load for Bigtable to balance performance across all of the nodes in your cluster.

What it's good for

Bigtable is ideal for applications that need very high throughput and scalability for key/value data, where each value is typically no larger than 10 MB. Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.

You can use Bigtable to store and query all of the following types of data:

Time-series data, such as CPU and memory usage over time for multiple servers.
Marketing data, such as purchase histories and customer preferences.
Financial data, such as transaction histories, stock prices, and currency exchange rates.
Internet of Things data, such as usage reports from energy meters and home appliances.
Graph data, such as information about how users are connected to one another.

Bigtable storage model

Tables

Bigtable stores data in massively scalable tables, each of which is a sorted key/value map. The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row. Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.

Tablets

Tables are split into multiple tablets – segments of the table are split at certain row keys so that each tablet is a few hundred megabytes or a few gigabytes in size.

When Table size threaten to grow beyond a specified limit, the tablets may be compressed using the algorithm BMDiff and the Zippy compression algorithm publicly known and open-sourced as Snappy, which is a less space-optimal variation of LZ77 but more efficient in terms of computing time.

Hot Tablets and hotspots

https://cloud.google.com/blog/products/databases/hotspots-and-performance-debugging-in-cloud-bigtable

META Tablets

The locations in the GFS of tablets are recorded as database entries in multiple special tablets, which are called "META1" tablets. META1 tablets are found by querying the single "META0" tablet, which typically resides on a server of its own since it is often queried by clients as to the location of the "META1" tablet which itself has the answer to the question of where the actual data is located. Like GFS's master server, the META0 server is not generally a bottleneck since the processor time and bandwidth necessary to discover and transmit META1 locations is minimal and clients aggressively cache locations to minimize queries.

Schema design

Designing a Bigtable schema is different than designing a schema for a relational database. In Bigtable, a schema is a blueprint or model of a table, including the structure of the following table components:

Row keys
Column families, including their garbage-collection policies
Columns

Schema design general concepts

The following general concepts apply to Bigtable schema design:

Bigtable is a key/value store, not a relational store. It does not support joins, and transactions are supported only within a single row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian byte order (sometimes called network byte order), the binary equivalent of alphabetical order.
Column families are not stored in any specific order.
Columns are grouped by column family and sorted in lexicographic order within the column family. For example, in a column family called SysMonitor with column qualifiers of ProcessName, User, %CPU, ID, Memory, DiskRead, and Priority, Bigtable stores the columns in this order: %CPU, DiskRead, ID, Memory, Priority, ProcessName, User.
The intersection of a row and column can contain multiple timestamped cells. Each cell contains a unique, timestamped version of the data for that row and column.
All operations are atomic at the row level. This means that an operation affects either an entire row or none of the row.
Ideally, both reads and writes should be distributed evenly across the row space of a table.
Bigtable tables are sparse. A column doesn't take up any space in a row that doesn't use the column.

Schema Best Practices

https://cloud.google.com/bigtable/docs/schema-design#best-practices

Schema for time series data

A time series is a collection of data that consists of measurements and the times when the measurements are recorded.

There are different schema design patterns for storing time series data in Cloud Bigtable.

key salting

https://cloud.google.com/blog/products/databases/cloud-bigtable-schema-optimization-key-salting/

Key Visualizer

Key Visualizer is a tool that helps you analyze your Bigtable usage patterns. It generates visual reports for your tables that break down your usage based on the row keys that you access.

Getting started with Key Visualizer

Use Key Visualizer to check for performance issues that are related to your Cloud Bigtable usage patterns.

Key Visualizer is available for tables that contain at least 30 GB of data. It can take up to an hour after a table reaches that size for scans to be available.

https://cloud.google.com/bigtable/docs/keyvis-getting-started