Create Table - xqrzd/kudu-client-net GitHub Wiki

This page assumes you are familiar with Kudu's schema design. This page only shows the C# client usage, refer to Kudu's documentation for an explanation of column types, partitioning, etc. Note that in this client, columns are created nullable by default.

Add Columns

A Kudu Table consists of one or more columns, each with a defined type. Columns that are not part of the primary key may be nullable. Every column needs a name and a data type. Columns can be further configured with an optional delegate. See Kudu's schema design for more about the options available (encoding, compression, etc.).

var builder = new TableBuilder("table_name")
    .AddColumn("key_1", KuduType.Int32, opt => opt.Key(true))
    .AddColumn("key_2", KuduType.String, opt => opt.Key(true))
    .AddColumn("column_1", KuduType.Float, opt => opt
        .DefaultValue(5.3f)
        .Encoding(EncodingType.BitShuffle)
        .Compression(CompressionType.Lz4))
    .AddColumn("column_2", KuduType.String, opt => opt
        .Nullable(false)
        .Comment("non-nullable string"));

await client.CreateTableAsync(builder);

Add a Decimal Column

Decimal columns require an extra piece of configuration: precision and scale. You can read more about these values in Kudu's decimal schema design.

  • Decimal32: values with precision of 9 or less
  • Decimal64: values with precision of 10 through 18
  • Decimal128: values with precision greater than 18
var builder = new TableBuilder("table_name")
    .AddColumn("key_1", KuduType.Int32, opt => opt.Key(true))
    .AddColumn("decimal_32", KuduType.Decimal32, opt => opt
        .DecimalAttributes(precision: 5, scale: 2)
        .DefaultValue(123.99m))
    .AddColumn("decimal_64", KuduType.Decimal64, opt => opt
        .DecimalAttributes(precision: 12, scale: 8)
        .DefaultValue(1234.12345678m))
    .AddColumn("decimal_128", KuduType.Decimal128, opt => opt
        .DecimalAttributes(precision: 24, scale: 10)
        .DefaultValue(12345678901234.0123456789m));

Note: Kudu's 128 bit decimal has more precision than .NET's 96 bit decimal. This isn't an issue when writing data, as a Kudu 128 bit decimal can store any .NET decimal (ignoring scale). However, when reading data from Kudu that was written from another source, the .NET decimal may not be able to store it. An OverflowException is thrown in this case. RowResult.GetRawFixed can be used to retrieve the raw value.

Platform Max Unscaled Decimal
Kudu 99999999999999999999999999999999999999
.NET 79228162514264337593543950335

Add a Varchar Column

Varchar columns require a length. Values with characters greater than the limit will be truncated. This value must be between 1 and 65535 and has no default. More info is available in Kudu's varchar schema design.

var builder = new TableBuilder("table_name")
    .AddColumn("key_1", KuduType.Int32, opt => opt.Key(true))
    .AddColumn("varchar", KuduType.Varchar, opt => opt
        .VarcharAttributes(length: 100));

Partitioning

This document doesn't explain when to use different partition strategies. See Kudu's Kudu's schema design for that. These examples come from that document. They all assume this table:

CREATE TABLE metrics (
    host STRING NOT NULL,
    metric STRING NOT NULL,
    time INT64 NOT NULL,
    value DOUBLE NOT NULL,
    PRIMARY KEY (host, metric, time),
);
var builder = new TableBuilder("metrics")
    .AddColumn("host", KuduType.String, opt => opt.Key(true))
    .AddColumn("metric", KuduType.String, opt => opt.Key(true))
    .AddColumn("time", KuduType.UnixtimeMicros, opt => opt.Key(true))
    .AddColumn("value", KuduType.Double, opt => opt.Nullable(false));

Range Partitioning

A natural way to partition the metrics table is to range partition on the time column. Let’s assume that we want to have a partition per year, and the table will hold data for 2014, 2015, and 2016. There are at least two ways that the table could be partitioned: with unbounded range partitions, or with bounded range partitions.

First, specify which column(s) should be included in the range partition, using SetRangePartitionColumns. Then, add individual ranges with AddRangePartition, or split existing ranges using AddSplitRow.

Example 1 (Unbounded)

var builder = new TableBuilder("metrics")
    ...
    .SetRangePartitionColumns("time")
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2015-01-01")))
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2016-01-01")));

The image above shows the two ways the metrics table can be range partitioned on the time column. In the first example (in blue), the default range partition bounds are used, with splits at 2015-01-01 and 2016-01-01. This results in three tablets: the first containing values before 2015, the second containing values in the year 2015, and the third containing values after 2016.

Example 2 (Bounded)

var builder = new TableBuilder("metrics")
    ...
    .SetRangePartitionColumns("time")
    .AddRangePartition((lower, upper) =>
    {
        lower.SetDateTime("time", DateTime.Parse("2014-01-01"));
        upper.SetDateTime("time", DateTime.Parse("2017-01-01"));
    })
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2015-01-01")))
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2016-01-01")));

The second example (in green) uses a range partition bound of [(2014-01-01), (2017-01-01)], and splits at 2015-01-01 and 2016-01-01. The second example could have equivalently been expressed through range partition bounds of [(2014-01-01), (2015-01-01)], [(2015-01-01), (2016-01-01)], and [(2016-01-01), (2017-01-01)], with no splits. The first example has unbounded lower and upper range partitions, while the second example includes bounds.

Hash Partitioning

Another way of partitioning the metrics table is to hash partition on the host and metric columns.

var builder = new TableBuilder("metrics")
    ...
    .AddHashPartitions(4, "host", "metric");

In the example above, the metrics table is hash partitioned on the host and metric columns into four buckets.

Hash and Range Partitioning

Hash partitioning is good at maximizing write throughput, while range partitioning avoids issues of unbounded tablet growth. Both strategies can take advantage of partition pruning to optimize scans in different scenarios. Using multilevel partitioning, it is possible to combine the two strategies in order to gain the benefits of both, while minimizing the drawbacks of each.

var builder = new TableBuilder("metrics")
    ...
    .AddHashPartitions(4, "host", "metric")
    .SetRangePartitionColumns("time")
    .AddRangePartition((lower, upper) =>
    {
        lower.SetDateTime("time", DateTime.Parse("2014-01-01"));
        upper.SetDateTime("time", DateTime.Parse("2017-01-01"));
    })
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2015-01-01")))
    .AddSplitRow(row => row.SetDateTime("time", DateTime.Parse("2016-01-01")));

In the example above, range partitioning on the time column is combined with hash partitioning on the host and metric columns.

Hash and Hash Partitioning

Kudu can support any number of hash partitioning levels in the same table, as long as the levels have no hashed columns in common.

var builder = new TableBuilder("metrics")
    ...
    .AddHashPartitions(4, "host")
    .AddHashPartitions(3, "metric");

In the example above, the table is hash partitioned on host into 4 buckets, and hash partitioned on metric into 3 buckets, resulting in 12 tablets.

Partitioning Examples from Apache Impala

Below are some examples from creating tables with Apache Impala, taken from using Impala with Kudu.

Partition by Range

CREATE TABLE cust_behavior (
  _id BIGINT PRIMARY KEY
)
PARTITION BY RANGE (id)
(
  PARTITION VALUES < 1439560049342,
  PARTITION 1439560049342 <= VALUES < 1439566253755,
  PARTITION 1439566253755 <= VALUES
)
STORED AS KUDU;

C#

var builder = new TableBuilder("cust_behavior")
    .AddColumn("id", KuduType.Int64, opt => opt.Key(true))
    .SetRangePartitionColumns("id")
    .AddRangePartition((lower, upper) =>
    {
        upper.SetInt64("id", 1439560049342);
    })
    .AddRangePartition((lower, upper) =>
    {
        lower.SetInt64("id", 1439560049342);
        upper.SetInt64("id", 1439566253755);
    })
    .AddRangePartition((lower, upper) =>
    {
        lower.SetInt64("id", 1439566253755);
    });

Partition by Single Value Range

CREATE TABLE customers (
  state STRING,
  name STRING,
  purchase_count int,
  PRIMARY KEY (state, name)
)
PARTITION BY RANGE (state)
(
  PARTITION VALUE = 'al',
  PARTITION VALUE = 'ak',
  PARTITION VALUE = 'ar',
  PARTITION VALUE = 'wv',
  PARTITION VALUE = 'wy'
)
STORED AS KUDU;

C#

var builder = new TableBuilder("customers")
    .AddColumn("state", KuduType.String, opt => opt.Key(true))
    .AddColumn("name", KuduType.String, opt => opt.Key(true))
    .AddColumn("purchase_count", KuduType.Int32)
    .SetRangePartitionColumns("state")
    .AddRangePartition(row => row.SetString("state", "al"))
    .AddRangePartition(row => row.SetString("state", "ak"))
    .AddRangePartition(row => row.SetString("state", "ar"))
    .AddRangePartition(row => row.SetString("state", "wv"))
    .AddRangePartition(row => row.SetString("state", "wy"));

Partition by Hash and Range

CREATE TABLE cust_behavior (
  id BIGINT,
  sku STRING,
  PRIMARY KEY (id, sku)
)
PARTITION BY HASH (id) PARTITIONS 4,
RANGE (sku)
(
  PARTITION VALUES < 'g',
  PARTITION 'g' <= VALUES < 'o',
  PARTITION 'o' <= VALUES < 'u',
  PARTITION 'u' <= VALUES
)
STORED AS KUDU;

C#

var builder = new TableBuilder("cust_behavior")
    .AddColumn("id", KuduType.Int64, opt => opt.Key(true))
    .AddColumn("sku", KuduType.String, opt => opt.Key(true))
    .AddHashPartitions(4, "id")
    .SetRangePartitionColumns("sku")
    .AddRangePartition((lower, upper) => upper.SetString("sku", "g"))
    .AddRangePartition((lower, upper) =>
    {
        lower.SetString("sku", "g");
        upper.SetString("sku", "o");
    })
    .AddRangePartition((lower, upper) =>
    {
        lower.SetString("sku", "o");
        upper.SetString("sku", "u");
    })
    .AddRangePartition((lower, upper) => lower.SetString("sku", "u"));