postgresql load data - ghdrako/doc_snipets GitHub Wiki

COPY

file_fdw contrib module

it lets you create a virtual table that will parse the text file every time it is scanned.

pg_restore

The pg_restore utility has an option to reload data in parallel, -j number_of_threads, though this is only possible if the dump was produced using the custom pg_dump format.

EDB*Loader

https://www.enterprisedb.com/docs/epas/latest/epas_compat_tools_guide/02_edb_loader/.

pgloader

pgloader copes gracefully with errors. The COPY command loads all rows in a single transaction, so only a single error is enough to abort the load. pgloader breaks down an input file into reasonably sized chunks and loads them piece by piece. If some rows in a chunk cause errors, then pgloader will split it iteratively until it loads all the good rows and skips all the bad rows, which are then saved in a separate rejects file for later inspection. This behavior is very convenient if you have large data files with a small percentage of bad rows – for instance, you can edit the rejects, fix them, and finally, load them with another pgloader run.

Versions from the 2.x iteration of pgloader were written in Python and connected to PostgreSQL through the standard Python client interface. Version 3.x is written in Common Lisp. Yes, pgloader is less efficient than loading data files using a COPY command, but running a COPY command has many more restrictions: the file has to be in the right place on the server, has to be in the right format, and must be unlikely to throw errors on loading. pgloader has additional overhead, but it also has the ability to load data using multiple parallel threads, so it can be faster to use as well. The ability of pgloader to reformat the data via user-defined functions is often essential; a straight COPY command may not be enough.

pgloader also allows loading from fixed-width files, which COPY does not.

If you need to reload the table completely from scratch, then specify the WITH TRUNCATE clause in the pgloader script.

LOAD CSV
   FROM '/tmp/file.csv' (x, y, a, b, c, d)
   INTO postgresql://postgres@localhost:5432/postgres?csv (a, b, d, c)
     WITH truncate,
          skip header = 1,
          fields optionally enclosed by '"',
          fields escaped by double-quote,
          fields terminated by ','
      SET client_encoding to 'latin1',
          work_mem to '12MB',
          standard_conforming_strings to 'on'
   BEFORE LOAD DO
    $$ drop table if exists csv; $$,
    $$ create table csv (
        a bigint,
        b bigint,
        c char(2),
        d text
       );
  $$;
$ cat /tmp/file.csv
Header, with a © sign
"2.6.190.56","2.6.190.63","33996344","33996351","GB","United Kingdom"
"3.0.0.0","4.17.135.31","50331648","68257567","US","United States"
"4.17.135.32","4.17.135.63","68257568","68257599","CA","Canada"
"4.17.135.64","4.17.142.255","68257600","68259583","US","United States"
"4.17.143.0","4.17.143.15","68259584","68259599","CA","Canada"
"4.17.143.16","4.18.32.71","68259600","68296775","US","United States"
pgloader csv.load