postgres database nationality collation - ghdrako/doc_snipets GitHub Wiki

CREATE DATABASE ebkdb WITH OWNER postgres ENCODING 'UTF8' LC_COLLATE = 'pl_PL.UTF-8' LC_CTYPE = 'pl_PL.UTF-8' TEMPLATE template0;

Show encoding of your database

SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = 'yourdb';

select datname, datcollate from pg_database; -- SHOW COLLATION;
SHOW COLLATION LIKE 'utf32%';
SHOW COLLATION WHERE Charset = 'cp1251';

To check for non-default collations on columns, you can use the following query:

select table_schema, 
       table_name, 
       column_name,
       collation_name
from information_schema.columns
where collation_name is not null
order by table_schema,
         table_name,
         ordinal_position;

To find the collation of the database, you need to query pg_database:

select datname, 
       datcollate
from pg_database;

SELECT datcollate AS collation
FROM pg_database 
WHERE datname = current_database();

Change encoding/collation - safe way

$ pg_dump your_database > your_database.sql
psql> DROP DATABASE your_database  -- or ALTER DATABASE your_database RENAME TO your_database_backup;
psql> CREATE DATABASE your_database WITH ENCODING 'UNICODE' TEMPLATE=template0;
$ PGCLIENTENCODING=<YOUR_OLD_ENCODING psql> -f your_database.sql your_database

Collation

Collations in PostgreSQL are available depending on operating system support.

$ locale -a
select * from pg_collation;

LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation. LC_COLLATE affects comparisons between strings. In practice, the most visible effect is the sort order.

The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data
The upper, lower, and initcap functions
Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions
The to_char family of functions
The ability to use indexes with LIKE clauses

Locale support is initialized when the database is created. After that, you can no longer change the locale, because the default collation selects LC_COLLATE and LC_CTYPE values are specified at database creation time. Appropriate language settings are defined in the below subcategories.

Name	Desc
LC_COLLATE	String sort order
LC_CTYPE	Character classification (What is a letter? Its upper-case equivalent?)
LC_MESSAGES	Language of messages
LC_MONETARY	Formatting of currency amounts
LC_NUMERIC	Formatting of numbers
LC_TIME	Formatting of dates and times

Set Collation for a Database

CREATE DATABASE polish_database  LC_COLLATE 'pl_PL.UTF-8' LC_CTYPE 'pl_PL.UTF-8'; -- error because the new database is created as a clone of the standard system database template1, which may contain encoding-specific or locale-specific data, whereas template0 doesn’t.
CREATE DATABASE polish_database TEMPLATE template0 LC_COLLATE 'pl_PL.UTF-8' LC_CTYPE 'pl_PL.UTF-8';

Set Collation for a Table

It’s possible to set collation for each column using COLLATE clause. If the clause is omitted, the collation is the default for the new column type.

Create table collated_polish (some_text text collate "pl_PL.utf8");

Create collation

When we don’t have the appropriate collation when doing initdb, but we do have the appropriate locale intalled on the system, we need to create a collation:

Create collation ru (locale = "ru_RU.utf8");

After that the collation will be added to the pg_collation table.

Set Collation for the Column:

CREATE TABLE dictionary (
   a text COLLATE "fr_FR",
   b text COLLATE "pl_PL",
   c text COLLATE "de_DE",
     ...
   ...
);

Changing the Definition of Collation

PostgreSQL provides the ALTER COLLATION statement which can:

change the name of collations

ALTER COLLATION collation_name RENAME TO new_name;

change the owner of collations

ALTER COLLATION collation_name OWNER TO new_owner;

change the schema of collations

ALTER COLLATION collation_name SET SCHEMA new_schema

Example

ALTER COLLATION “ru_RU” RENAME TO russian

Example from utf8 database

select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris'))  AS l(firstname) order by firstname collate "fr_FR";
select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris'))  AS l(firstname) order by firstname collate "en_US";
select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris'))  AS l(firstname) order by firstname collate "C";

show lc_ctype;
select initcap('élysée');
select initcap('élysée' collate "C");
select 'élysée' ~ '^\w+$';
select 'élysée' COLLATE "C" ~ '^\w+$';

Having LC_CTYPE set to 'C' implies that C functions like isupper(c) or tolower(c) give expected results only for characters in the US-ASCII range (that is, up to codepoint 0x7F in Unicode).

locale -a   # list avaliable locale in system
sudo locale-gen es_PE.UTF-8    # locale-gen en_US.UTF-8  # create locale
locale -a|grep -i es
sudo systemctl restart postgresql # restart postgres
postgres=# create database db_izipay_prod with template=template0 encoding='utf8' lc_collate='es_ES.UTF-8' lc_ctype='es_ES.UTF-8' owner=postgres; # CREATE DATABASE db WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8';

CREATE COLLATION "pb_PB.utf8" (lc_collate = 'Portuguese_Brazil', lc_ctype = 'Portuguese_Brazil');
CREATE DATABASE pb WITH ENCODING 'utf8' LC_COLLATE='Portuguese_Brazil' LC_CTYPE='Portuguese_Brazil' TEMPLATE=template0;

CREATE DATABASE compte WITH
        TEMPLATE = template0
        encoding = 'utf8'
        LC_COLLATE = 'en_CA.utf8'
        LC_CTYPE = 'en_CA.utf8'
        ;

Change collation

update pg_database set datcollate='POSIX.UTF-8', datctype='POSIX.UTF-8' where datname='databasename';

After that rebuild of your string index. But not guarantee that will be ok.

A collation is a set of rules about how a set of objects should be ordered. PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.

Show collation for all column and index in table

https://stackoverflow.com/questions/56934368/collation-in-postgresql-db-level-table-level-column-level

The collation of a column is stored in the attcollation of the column's pg_attribute row. A value of 100 signifies the “default collation”, which is the database collation stored in pg_database.

This is a way to find the collation of all columns for a table:

WITH defcoll AS (
   SELECT datcollate AS coll
   FROM pg_database
   WHERE datname = current_database()
)
SELECT a.attname,
       CASE WHEN c.collname = 'default'
            THEN defcoll.coll
            ELSE c.collname
       END AS collation
FROM pg_attribute AS a
   CROSS JOIN defcoll
   LEFT JOIN pg_collation AS c ON a.attcollation = c.oid
WHERE a.attrelid = 'test1c'::regclass
  AND a.attnum > 0
ORDER BY attnum;

 attname | collation 
---------+-----------
 id      | 
 content | C
(2 rows)

For indexes, the collations are stored in the column pg_index.indcollation:

WITH defcoll AS (
   SELECT datcollate AS coll
   FROM pg_database
   WHERE datname = current_database()
)
SELECT icol.pos,
       CASE WHEN c.collname = 'default'
            THEN defcoll.coll
            ELSE c.collname
       END AS collation
FROM pg_index AS i
   CROSS JOIN unnest(i.indcollation) WITH ORDINALITY AS icol(coll, pos)
   CROSS JOIN defcoll
   LEFT JOIN pg_collation AS c ON c.oid = icol.coll
WHERE i.indexrelid = 'idx_test'::regclass
ORDER BY icol.pos;

 pos | collation 
-----+-----------
   1 | C
(1 row)

update pg_database set encoding = pg_char_to_encoding('SQL_ASCII') where datname = 'mytable'

How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

Null character

https://confluence.atlassian.com/stashkb/database-migration-to-postgresql-fails-invalid-byte-sequence-for-encoding-utf8-0x00-696386363.html