move data - GradedJestRisk/db-training GitHub Wiki

Move data

Introduction

Benchmark

Follow COPY progress : pg_stat_progress_copy Doc

UNLOGGED or TEMP

Change data, not moving data

Sometimes you need to:

  • modify existing data, eg. reset CGU acceptance in users table;
  • based on existing data, eg. all users whose logon date in more than 2 years;
  • on a large dataset, eg. 2 millions users;
  • in one-shot fashion.

This is usually called "data correction", or "historic takeover".

What I hear from time to time is

  • extract a CSV file identifying which data should be modified
  • load this file into a working table;
  • modify existing data in fixed-size loops ( eg. 100 00 users), driven by the working table

Because:

A CSV file is good for performance

I try to reformulate what "performance" means here, and which words to use.

We need to focus on speed, as we handle big data set, but:

  • the CSV format is not fast;
  • file usage is not fast (as writing/reading on file system is notoriously slow);
  • batch in small junks is not fast.

It is actually faster to delete/create/update data in one operation in a database. It will use more resources than to do in in several operations, it can therefore lower the service for other operations, but it is quicker.

Modify all data in one operation can be called batching : this a a single batch.

But we may also call "batching" modifying one row at a time, which is the minimum batch size. This will be slower than modifying all rows at once, but may not use as many resources in we use throttling, eg. if each batch is executed serially, with some inactivity time between batches. This way, we will lessen the impact on service level. The problem here is that batching usually mean "processing more than one row", but processing a few rows or all rows is not the same.

Worst, "batch" is also used to mean periodic automated processes (aka scheduling, or scripts), which is not what we were taking about in the first place : we need data to be modified only once.

Here, what we need is to get a snapshot of the data to be modified, to pin down the values, in order to modify each row once and no more. We can do that with extracting data in CSV and reload it, sure, but we can also:

  • feed working tables bypassing CSV, using CREATE TABLE AS SELECT or INSERT INTO SELECT;
  • start a transaction to pin down the state and use a cursor, or a subquery in UPDATE

We also need to be sure of what happened, in case we modified the wrong data. We think of CSV file as the only way, to be stored in our ticketing tool or (worst) in our laptop, but

  • we can create archive table, which can be stored on cold storage
  • we can extract a dump This would prevent leaking sensitive data.

How to move ?

Postgresql server structure

From doc

postgres is the PostgreSQL database server. In order for a client application to access a database it connects (over a network or locally) to a running postgres instance. The postgres instance then starts a separate server process to handle the connection.

One postgres instance always manages the data of exactly one database cluster. A database cluster is a collection of databases that is stored at a common file system location (the “data area”). More than one postgres instance can run on a system at one time, so long as they use different data areas and different communication ports.

Let's review this :

  • on a host, which can be physical (a laptop), virtual (virtual machine) or a container (docker);
  • on which postgresql binaries are installed;
  • a command launch an OS process on postgresql server binary;
  • the process name is postgres and called postgresql server in documentation (which is not the host itself);
  • but it is also called an postgresql (server) instance in documentation, as you can run several postgresql server;
  • this process can access some data collection, called database cluster;
  • this database cluster can be physically located outside the host, in a storage device as NAS;
  • it chooses by default a database from this database cluster;
  • and starts listening on a port.

Here are the constraints:

  • a postgresql server cannot be run on several hosts (unlike Oracle);
  • a postgresql server cannot open more than one database at once;
  • a database cluster cannot be open by several database server at once ?

A database (in cluster) is a collection of database schema. A database schema is a collection of tables. A database tables is a collection of rows following the table DDL.

Each database is created with a default schema named public. You can access a table using $SCHEMA_NAME.$TABLE_NAME, or skipping $SCHEMA_NAME using a search path.

While multiple databases can be created within a single cluster, it is advised to consider carefully whether the benefits outweigh the risks and limitations

Doc

Here is the simplest case:

  • a laptop host contains postgresql binaries and database cluster;
  • the database cluster contains one database;
  • the database contains the default public schema;
  • the public schema contains a single table;
  • the command start postgresql binary on the cluster, opening the single database;
  • the server listen on 5432 port;
  • the client on teh same host connect on 5432 port and query the table without using schema.

Here is the most complex case:

  • a virtual host contains postgresql binaries
  • a NAS contains several database cluster;
  • the database clusters contains several databases;
  • the database contains several schema;
  • each schema contains several table;
  • the command start postgresql binary on one cluster, opening one database;
  • the server listen on 5000 port;
  • another command start postgresql binary on another cluster, opening another database;
  • the server listen on 6000 port;
  • a client ona different host connect on 5000 port on the opened database, and query a table using a schema;
  • another client on another different host connect on 6000 port on the opened database, and query another table using another schema.

You have:

  • three host, one server and two clients;
  • two not-local clusters;
  • two postgresql server;
  • two opened database with several schemas;
  • two connexion from clients to server.

A cluster

backup PG files

Old-school. Here, data does NOT limit to table data, but also indexes, roles, code and include all cluster's database

Steps:

  • in pg_cli, get directory SHOW data_directory;
  • stop pg sudo systemctl stop postgresql
  • check it's stopped sudo systemctl status postgresql, should give (..) Stopped PostgreSQL RDBMS.
  • backup the directory
  • start pg sudo systemctl start postgresql
  • check it's started sudo systemctl status postgresql, should give (..) Started PostgreSQL RDBMS.

pg_dumpall

On each and every database of the cluster

Exports everything

Steps:

  • export pg_dumpall -h HOST_NAME -p PORT_NUMBER -U USER_NAME > BACKUP_FILE_NAME
  • remove cluster
  • import psql -f BACKUP_FILE_NAME postgres

A schema

Key points:

  • on a single database
  • exports : all instance data (table data + DDL (table structure - index definition) )
  • does NOT exports: database container, user, role, index data
  • produce SQL queries, so fully compatible between version/architectures..

Steps:

  • export pg_dump -h HOST_NAME -p PORT_NUMBER -U USER_NAME DATABASE_NAME > BACKUP_FILE_NAME
  • remove database
  • create database
  • import psql -h HOST_NAME -p PORT_NUMBER DATABASE_NAME < BACKUP_FILE_NAME
  • collect statistics

Several tables

pg_restore

Dump content is encrypted, so to search

  • objects name only : pg_restore -l ``<FILE-NAME>{=html}
  • search instruction : pg_restore ``<FILE-NAME>{=html}-f - | grep ``<INSTRUCTION>{=html}
  • extract all : pg_restore ``<ENCRYPTED_DUMP_FILENAME>{=html}.pgsql -f - > ``<HUMAN_READBALE_DUMP_FILENAME>{=html}.sql

Import specified: pg_restore ``<DUMP-FILE-NAME>{=html} --use-list ``<OBJECT-LIST-FILE-NAME>{=html}

A table

To copy all data from a table to another in the same database, use SQL commands :

  • INSERT INTO $TABLE (SELECT );
  • CREATE TABLE $TABLE AS SELECT aka CTAS.

To copy all data from a table to another in a different database/host, use :

  • dump
  • COPY
  • \COPY
  • pg_bulkload

Benchmark

Some rows of a table (data chunks)

Overview

Do you have shell access to remote server ?

  • yes : use COPY
  • no :
    • have psql access : \COPY
    • no psql access, only jdbc:

Import to add if data is in a database or a file.

Tools

pg_dump / pg_restore

pg_dump is a client program, that can connect to any server version. It has descending compatibility: a dump created using pg_dump v1 can be imported using any pg_restore starting from v1.

To understand which parameter are compatible, check doc then source

Formats

Dumps can be output in script or archive file formats.

Script dumps are plain-text files containing the SQL commands required to reconstruct the database to the state it was in at the time it was saved. To restore from such a script, feed it to psql. Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products.

So, to prevent compatibility issues, use plain format (script)

pg_dump --format=plain $FILE
psql --file=$FILE

The alternative archive file formats must be used with pg_restore to rebuild the database. They allow pg_restore to be selective about what is restored, or even to reorder the items prior to being restored. The archive file formats are designed to be portable across architectures.

pg_dump --format=plain $FILE
pg_restore $FILE

Errors you may come against :

Usual scenarios

Dump only data of a single table

This will extract data in plain format, then load on existing table. It will use several ($CONNEXION_COUNT) connexions to speed up extract operations.

Extract

pg_dump \ 
    --host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
    --jobs=$CONNEXION_COUNT \
    --format=plain \
    --encoding="UTF-8" \ 
    --file=$DUMP_FILE_PATH \
    --data-only \
    --table=$TABLE_NAME --verbose

Transfer file.

Load data in existing table

psql \ 
    --host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
    --file=$DUMP_FILE_PATH
Dump structure and data of a single database

This will extract data in compressed files in a single folder, then recreate it on another database. It will use $CONNEXION_COUNT count to speed up extract and load operations.

Extract

pg_dump \ 
    --host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
    --jobs=$CONNEXION_COUNT \
    --format=directory \
    --encoding="UTF-8" \ 
    --file=$DUMP_FILE_PATH --verbose

Transfer directory.

Create database

pg_restore \
    --host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
    --jobs=$CONNEXION_COUNT \
    $DUMP_FILE_PATH --verbose

COPY

Useful if :

  • you want to access server filesystem;
  • and you can access server filesystem.

slash-COPY

Useful if :

  • you want to use client filesystem.
  • you can't access server filesystem;

Official documentation contains great tips about performance.

Temporary file

No filter
 Connect remotely to source database using psql
\COPY users TO 'users.csv' CSV HEADER

// Connect remotely to target database using psql
\COPY users FROM 'users.csv' WITH CSV HEADER;
With filter

Use temporary file and a unix stream to filter data to be imported ?

psql
    -h localhost \
    -d your_primary_db \
    -U postgres -c \
    "\copy users (id, email, first_name, last_name) from STDIN
    with delimiter as ','" \ < /tmp/users.csv

Using streams (buffers)

No filter
Syntax

From a remote database to another remote database

// SOURCE
psql \
    --username=source_user_name \
    --host=source_host \
    --dbname=source_database_name \
    --command="\\copy users to stdout" | \
// TARGET
psql \
    --username=target_suer_name \
    --host=target_host \
    --dbname=target_database_name \
    --command="\\copy users from stdin"
Sample

Ie on localhost database, from source to target schema, on foo table

Create tables and data

CREATE DATABASE source; 
CONNECT source;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);

CREATE DATABASE target; 
CONNECT target;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);  
INSERT INTO foo VALUES (0);
INSERT INTO foo VALUES (1);

Copy

psql \
    -U postgres \
    -h localhost \
    -d source \
    -c "\\copy foo to stdout" | \
psql -U postgres \
    -h localhost \
    -d target \
    -c "\\copy foo from stdin"

Note: \COPY does NOT:

  • create table structure
  • disable constraints (especially PK)
  • ROLLBACK on error

So, in the example below, you get the error message and no row imported

ERROR:  duplicate key value violates unique constraint "foo_pkey"
DETAIL:  Key (id)=(1) already exists.
CONTEXT:  COPY foo, line 1

You need to delete the offending record to get it working:

DELETE FROM foo WHERE id=1;

Then try again

// launch copy, get 
COPY 2

// Check imported data
CONNECT target
SELECT * FROM foo;
|------|
| id   |
|------|
| 0    |
| 1    |
| 2    |
+------+
With filter
Syntax
\copy (SELECT * FROM foo WHERE id>=3) to stdout
Sample

Ramp up to avoid this ! Use SQL query as source.

CONNECT source;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);
INSERT INTO foo VALUES (1);
psql \
    -U postgres \
    -h localhost \
    -d source \
    -c "\\copy (SELECT * FROM foo WHERE id>=3) to stdout" | \
psql -U postgres \
    -h localhost \
    -d target \
    -c "\\copy foo from stdin"
COPY 1

Check imported data

\CONNECT target
SELECT * FROM foo;
+------+
| id   |
|------|
| 0    |
| 1    |
| 2    |
| 3    |
+------+

pg_bulkload

It is an extension to load file.

pg_bulkload is designed to load huge amount of data to a database. You can load data to table bypassing PostgreSQL shared buffers.

Other tools

See Capture Data Change, aka CDC

Performance issues in DIY

All tools specialized in data transfer/CDC are designed for performance.

If you move data by yourself, there are two paths:

  • you can afford service outage: do a big batch, which will use all database resources and finish quickly (so shut down service for all users); you'll write some code (DDL) but you'll be sure that you data will be consistent, test on production-like data to get an accurate outage duration;
  • you can't afford service outage: do a small batch ; you'll need to write much more code, be extra careful with consistency, and test on production-like platform to check it actually doesn't slow down service level.

If you choose service outage:

  • drop all indexes before loading table, even if it contains data before loading;
  • disable the constraint (primary key, foreign key);
  • load data;
  • recreate indexes;
  • reactivate constraints;
  • gather statistics ANALYZE $TABLE;
  • mark rows as frozen VACUUM FREEZE $TABLE.

If you choose no service outage:

  • do small batches;
  • move data in hidden table;
  • offload some task out of database;
  • create indexes concurrently;
  • validate constraints concurrently.

Here is a real-world example

⚠️ **GitHub.com Fallback** ⚠️