move data - GradedJestRisk/db-training GitHub Wiki
Follow COPY progress : pg_stat_progress_copy
Doc
Sometimes you need to:
- modify existing data, eg. reset CGU acceptance in
users
table; - based on existing data, eg. all users whose logon date in more than 2 years;
- on a large dataset, eg. 2 millions users;
- in one-shot fashion.
This is usually called "data correction", or "historic takeover".
What I hear from time to time is
- extract a CSV file identifying which data should be modified
- load this file into a working table;
- modify existing data in fixed-size loops ( eg. 100 00 users), driven by the working table
Because:
A CSV file is good for performance
I try to reformulate what "performance" means here, and which words to use.
We need to focus on speed, as we handle big data set, but:
- the CSV format is not fast;
- file usage is not fast (as writing/reading on file system is notoriously slow);
- batch in small junks is not fast.
It is actually faster to delete/create/update data in one operation in a database. It will use more resources than to do in in several operations, it can therefore lower the service for other operations, but it is quicker.
Modify all data in one operation can be called batching : this a a single batch.
But we may also call "batching" modifying one row at a time, which is the minimum batch size. This will be slower than modifying all rows at once, but may not use as many resources in we use throttling, eg. if each batch is executed serially, with some inactivity time between batches. This way, we will lessen the impact on service level. The problem here is that batching usually mean "processing more than one row", but processing a few rows or all rows is not the same.
Worst, "batch" is also used to mean periodic automated processes (aka scheduling, or scripts), which is not what we were taking about in the first place : we need data to be modified only once.
Here, what we need is to get a snapshot of the data to be modified, to pin down the values, in order to modify each row once and no more. We can do that with extracting data in CSV and reload it, sure, but we can also:
- feed working tables bypassing CSV, using
CREATE TABLE AS SELECT
orINSERT INTO SELECT
; - start a transaction to pin down the state and use a cursor, or a subquery in UPDATE
We also need to be sure of what happened, in case we modified the wrong data. We think of CSV file as the only way, to be stored in our ticketing tool or (worst) in our laptop, but
- we can create archive table, which can be stored on cold storage
- we can extract a dump This would prevent leaking sensitive data.
From doc
postgres is the PostgreSQL database server. In order for a client application to access a database it connects (over a network or locally) to a running postgres instance. The postgres instance then starts a separate server process to handle the connection.
One postgres instance always manages the data of exactly one database cluster. A database cluster is a collection of databases that is stored at a common file system location (the “data area”). More than one postgres instance can run on a system at one time, so long as they use different data areas and different communication ports.
Let's review this :
- on a
host
, which can be physical (a laptop), virtual (virtual machine) or a container (docker); - on which postgresql binaries are installed;
- a command launch an OS process on postgresql server binary;
- the process name is
postgres
and calledpostgresql server
in documentation (which is not the host itself); - but it is also called an
postgresql (server) instance
in documentation, as you can run severalpostgresql server
; - this process can access some data collection, called
database cluster
; - this
database cluster
can be physically located outside thehost
, in a storage device as NAS; - it chooses by default a
database
from thisdatabase cluster
; - and starts listening on a port.
Here are the constraints:
- a
postgresql server
cannot be run on several hosts (unlike Oracle); - a
postgresql server
cannot open more than one database at once; - a
database cluster
cannot be open by severaldatabase server
at once ?
A database
(in cluster) is a collection of database schema
.
A database schema
is a collection of tables
.
A database tables
is a collection of rows
following the table DDL.
Each database
is created with a default schema
named public
.
You can access a table using $SCHEMA_NAME.$TABLE_NAME
, or skipping $SCHEMA_NAME
using a search path.
While multiple databases can be created within a single cluster, it is advised to consider carefully whether the benefits outweigh the risks and limitations
Here is the simplest case:
- a laptop host contains postgresql binaries and database cluster;
- the database cluster contains one database;
- the database contains the default public schema;
- the public schema contains a single table;
- the command start postgresql binary on the cluster, opening the single database;
- the server listen on 5432 port;
- the client on teh same host connect on 5432 port and query the table without using schema.
Here is the most complex case:
- a virtual host contains postgresql binaries
- a NAS contains several database cluster;
- the database clusters contains several databases;
- the database contains several schema;
- each schema contains several table;
- the command start postgresql binary on one cluster, opening one database;
- the server listen on 5000 port;
- another command start postgresql binary on another cluster, opening another database;
- the server listen on 6000 port;
- a client ona different host connect on 5000 port on the opened database, and query a table using a schema;
- another client on another different host connect on 6000 port on the opened database, and query another table using another schema.
You have:
- three
host
, one server and two clients; - two not-local
clusters
; - two
postgresql server
; - two opened
database
with severalschemas
; - two connexion from clients to server.
Old-school. Here, data does NOT limit to table data, but also indexes, roles, code and include all cluster's database
Steps:
- in pg_cli, get directory
SHOW data_directory;
- stop pg
sudo systemctl stop postgresql
- check it's stopped
sudo systemctl status postgresql
, should give(..) Stopped PostgreSQL RDBMS.
- backup the directory
- start pg
sudo systemctl start postgresql
- check it's started
sudo systemctl status postgresql
, should give(..) Started PostgreSQL RDBMS.
On each and every database of the cluster
Exports everything
Steps:
- export
pg_dumpall -h HOST_NAME -p PORT_NUMBER -U USER_NAME > BACKUP_FILE_NAME
- remove cluster
- import
psql -f BACKUP_FILE_NAME postgres
Key points:
- on a single database
- exports : all instance data (table data + DDL (table structure - index definition) )
- does NOT exports: database container, user, role, index data
- produce SQL queries, so fully compatible between version/architectures..
Steps:
- export
pg_dump -h HOST_NAME -p PORT_NUMBER -U USER_NAME DATABASE_NAME > BACKUP_FILE_NAME
- remove database
- create database
- import
psql -h HOST_NAME -p PORT_NUMBER DATABASE_NAME < BACKUP_FILE_NAME
- collect statistics
pg_restore
Dump content is encrypted, so to search
- objects name only :
pg_restore -l ``<FILE-NAME>
{=html} - search instruction :
pg_restore ``<FILE-NAME>
{=html}-f - | grep ``<INSTRUCTION>
{=html} - extract all :
pg_restore ``<ENCRYPTED_DUMP_FILENAME>
{=html}.pgsql -f - > ``<HUMAN_READBALE_DUMP_FILENAME>
{=html}.sql
Import specified: pg_restore ``<DUMP-FILE-NAME>
{=html} --use-list ``<OBJECT-LIST-FILE-NAME>
{=html}
To copy all data from a table to another in the same database, use SQL commands :
-
INSERT INTO $TABLE (SELECT )
; -
CREATE TABLE $TABLE AS SELECT
aka CTAS.
To copy all data from a table to another in a different database/host, use :
- dump
COPY
\COPY
- pg_bulkload
Do you have shell access to remote server ?
- yes : use
COPY
- no :
- have
psql
access : \COPY - no
psql
access, onlyjdbc
:
- have
Import to add if data is in a database or a file.
pg_dump
is a client program, that can connect to any server version.
It has descending compatibility: a dump created using pg_dump
v1 can be imported using any pg_restore
starting from v1.
To understand which parameter are compatible, check doc then source
Dumps can be output in script or archive file formats.
Script dumps are plain-text files containing the SQL commands required to reconstruct the database to the state it was in at the time it was saved. To restore from such a script, feed it to psql. Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products.
So, to prevent compatibility issues, use plain format (script)
pg_dump --format=plain $FILE
psql --file=$FILE
The alternative archive file formats must be used with pg_restore to rebuild the database. They allow pg_restore to be selective about what is restored, or even to reorder the items prior to being restored. The archive file formats are designed to be portable across architectures.
pg_dump --format=plain $FILE
pg_restore $FILE
Errors you may come against :
This will extract data in plain format, then load on existing table.
It will use several ($CONNEXION_COUNT
) connexions to speed up extract operations.
Extract
pg_dump \
--host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
--jobs=$CONNEXION_COUNT \
--format=plain \
--encoding="UTF-8" \
--file=$DUMP_FILE_PATH \
--data-only \
--table=$TABLE_NAME --verbose
Transfer file.
Load data in existing table
psql \
--host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
--file=$DUMP_FILE_PATH
This will extract data in compressed files in a single folder, then recreate it on another database. It will use $CONNEXION_COUNT count to speed up extract and load operations.
Extract
pg_dump \
--host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
--jobs=$CONNEXION_COUNT \
--format=directory \
--encoding="UTF-8" \
--file=$DUMP_FILE_PATH --verbose
Transfer directory.
Create database
pg_restore \
--host=$HOST --dbname=$DATABASE_NAME --username=$USER_NAME \
--jobs=$CONNEXION_COUNT \
$DUMP_FILE_PATH --verbose
Useful if :
- you want to access server filesystem;
- and you can access server filesystem.
Useful if :
- you want to use client filesystem.
- you can't access server filesystem;
Official documentation contains great tips about performance.
Connect remotely to source database using psql
\COPY users TO 'users.csv' CSV HEADER
// Connect remotely to target database using psql
\COPY users FROM 'users.csv' WITH CSV HEADER;
Use temporary file and a unix stream to filter data to be imported ?
psql
-h localhost \
-d your_primary_db \
-U postgres -c \
"\copy users (id, email, first_name, last_name) from STDIN
with delimiter as ','" \ < /tmp/users.csv
From a remote database to another remote database
// SOURCE
psql \
--username=source_user_name \
--host=source_host \
--dbname=source_database_name \
--command="\\copy users to stdout" | \
// TARGET
psql \
--username=target_suer_name \
--host=target_host \
--dbname=target_database_name \
--command="\\copy users from stdin"
Ie on localhost database, from source to target schema, on foo
table
Create tables and data
CREATE DATABASE source;
CONNECT source;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
CREATE DATABASE target;
CONNECT target;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);
INSERT INTO foo VALUES (0);
INSERT INTO foo VALUES (1);
Copy
psql \
-U postgres \
-h localhost \
-d source \
-c "\\copy foo to stdout" | \
psql -U postgres \
-h localhost \
-d target \
-c "\\copy foo from stdin"
Note: \COPY
does NOT:
- create table structure
- disable constraints (especially PK)
- ROLLBACK on error
So, in the example below, you get the error message and no row imported
ERROR: duplicate key value violates unique constraint "foo_pkey"
DETAIL: Key (id)=(1) already exists.
CONTEXT: COPY foo, line 1
You need to delete the offending record to get it working:
DELETE FROM foo WHERE id=1;
Then try again
// launch copy, get
COPY 2
// Check imported data
CONNECT target
SELECT * FROM foo;
|------|
| id |
|------|
| 0 |
| 1 |
| 2 |
+------+
\copy (SELECT * FROM foo WHERE id>=3) to stdout
Ramp up to avoid this ! Use SQL query as source.
CONNECT source;
CREATE TABLE foo (ID INTEGER PRIMARY KEY);
INSERT INTO foo VALUES (1);
psql \
-U postgres \
-h localhost \
-d source \
-c "\\copy (SELECT * FROM foo WHERE id>=3) to stdout" | \
psql -U postgres \
-h localhost \
-d target \
-c "\\copy foo from stdin"
COPY 1
Check imported data
\CONNECT target
SELECT * FROM foo;
+------+
| id |
|------|
| 0 |
| 1 |
| 2 |
| 3 |
+------+
It is an extension to load file.
pg_bulkload is designed to load huge amount of data to a database. You can load data to table bypassing PostgreSQL shared buffers.
See Capture Data Change, aka CDC
All tools specialized in data transfer/CDC are designed for performance.
If you move data by yourself, there are two paths:
- you can afford service outage: do a big batch, which will use all database resources and finish quickly (so shut down service for all users); you'll write some code (DDL) but you'll be sure that you data will be consistent, test on production-like data to get an accurate outage duration;
- you can't afford service outage: do a small batch ; you'll need to write much more code, be extra careful with consistency, and test on production-like platform to check it actually doesn't slow down service level.
If you choose service outage:
- drop all indexes before loading table, even if it contains data before loading;
- disable the constraint (primary key, foreign key);
- load data;
- recreate indexes;
- reactivate constraints;
- gather statistics
ANALYZE $TABLE
; - mark rows as frozen
VACUUM FREEZE $TABLE
.
If you choose no service outage:
- do small batches;
- move data in hidden table;
- offload some task out of database;
- create indexes concurrently;
- validate constraints concurrently.
Here is a real-world example