Staging schema - cockroachdb/cdc-sink GitHub Wiki

cdc-sink will automatically create a number of staging and metadata tables in a database named _cdc_sink within the staging CockroachDB cluster. This staging database must be manually created, using CREATE DATABASE _cdc_sink. It is recommended that you also ALTER DATABASE _cdc_sink CONFIGURE ZONE USING gc.ttlseconds=300, since cdc-sink is, essentially, a queue-like workload that produces a relatively large number of MVCC tombstones.

Mutations

cdc-sink automatically creates staging tables for each target table that act as temporary storage for un-applied mutations. Key and indexes have been omitted here for clarity. See internal/staging/stage package.

-- These tables are created automatically by cdc-sink and are documented
-- here for operator convenience.
CREATE TABLE _targetDB_targetSchema_targetTable
(
    nanos   INT    NOT NULL, -- Derived from changefeed updated timestamp.
    logical INT    NOT NULL, -- Derived from changefeed updated timestamp.
    key     STRING NOT NULL, -- A JSON representation of the mutation's primary-key columns.
    mut     JSONB  NOT NULL, -- The complete JSON blob of the mutation.
    before  BYTES      NULL, -- Supports conflict resolution
    applied BOOL   NOT NULL DEFAULT false, -- Improves idempotency
    lease   TIMESTAMPTZ NULL, -- Support for best-effort modes
)

Resolved timestamps

Incoming resolved timestamps are written to a table that effectively forms a queue. See internal/stage/checkpoint for additional details.

CREATE TABLE public.resolved_timestamps
(
    target_schema     STRING NOT NULL, -- Name of a schema within the target.
    source_nanos      INT8   NOT NULL, -- Derived from changefeed updated timestamp.
    source_logical    INT8   NOT NULL, -- Derived from changefeed updated timestamp.
    target_applied_at TIMESTAMP NULL,  -- Set once all mutations with lesser timestamps have been applied.
)

Auxiliary tables

There are several other auxiliary tables used for cdc-internal coordination:

  • leases ensures that only a single instance of cdc-sink will resolve timestamps for any particular target schema.
  • memo is a catch-all for managing transient state.