postgres database nationality collation - ghdrako/doc_snipets GitHub Wiki
- https://learnsql.com/blog/collations-in-postgresql/
- http://peter.eisentraut.org/blog/2022/09/26/icu-features-in-postgresql-15
CREATE DATABASE ebkdb WITH OWNER postgres ENCODING 'UTF8' LC_COLLATE = 'pl_PL.UTF-8' LC_CTYPE = 'pl_PL.UTF-8' TEMPLATE template0;
Show encoding of your database
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = 'yourdb';
select datname, datcollate from pg_database; -- SHOW COLLATION;
SHOW COLLATION LIKE 'utf32%';
SHOW COLLATION WHERE Charset = 'cp1251';
To check for non-default collations on columns, you can use the following query:
select table_schema,
table_name,
column_name,
collation_name
from information_schema.columns
where collation_name is not null
order by table_schema,
table_name,
ordinal_position;
To find the collation of the database, you need to query pg_database:
select datname,
datcollate
from pg_database;
SELECT datcollate AS collation
FROM pg_database
WHERE datname = current_database();
Change encoding/collation - safe way
$ pg_dump your_database > your_database.sql
psql> DROP DATABASE your_database -- or ALTER DATABASE your_database RENAME TO your_database_backup;
psql> CREATE DATABASE your_database WITH ENCODING 'UNICODE' TEMPLATE=template0;
$ PGCLIENTENCODING=<YOUR_OLD_ENCODING psql> -f your_database.sql your_database
Collation
- https://www.postgresql.org/docs/current/charset.html
- https://www.postgresql.org/docs/current/locale.html
- https://www.postgresql.org/docs/current/collation.html
- https://www.postgresql.org/docs/current/sql-createcollation.html
- https://pgpedia.info/l/lc_collate.html
- https://www.postgresql.org/docs/current/multibyte.html
Collations in PostgreSQL are available depending on operating system support.
$ locale -a
select * from pg_collation;
LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation. LC_COLLATE affects comparisons between strings. In practice, the most visible effect is the sort order.
The locale settings influence the following SQL features:
- Sort order in queries using ORDER BY or the standard comparison operators on textual data
- The upper, lower, and initcap functions
- Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions
- The to_char family of functions
- The ability to use indexes with LIKE clauses
Locale support is initialized when the database is created. After that, you can no longer change the locale, because the default collation selects LC_COLLATE and LC_CTYPE values are specified at database creation time. Appropriate language settings are defined in the below subcategories.
Name | Desc |
---|---|
LC_COLLATE | String sort order |
LC_CTYPE | Character classification (What is a letter? Its upper-case equivalent?) |
LC_MESSAGES | Language of messages |
LC_MONETARY | Formatting of currency amounts |
LC_NUMERIC | Formatting of numbers |
LC_TIME | Formatting of dates and times |
Set Collation for a Database
CREATE DATABASE polish_database LC_COLLATE 'pl_PL.UTF-8' LC_CTYPE 'pl_PL.UTF-8'; -- error because the new database is created as a clone of the standard system database template1, which may contain encoding-specific or locale-specific data, whereas template0 doesn’t.
CREATE DATABASE polish_database TEMPLATE template0 LC_COLLATE 'pl_PL.UTF-8' LC_CTYPE 'pl_PL.UTF-8';
Set Collation for a Table
It’s possible to set collation for each column using COLLATE clause. If the clause is omitted, the collation is the default for the new column type.
Create table collated_polish (some_text text collate "pl_PL.utf8");
Create collation
When we don’t have the appropriate collation when doing initdb, but we do have the appropriate locale intalled on the system, we need to create a collation:
Create collation ru (locale = "ru_RU.utf8");
After that the collation will be added to the pg_collation
table.
Set Collation for the Column:
CREATE TABLE dictionary (
a text COLLATE "fr_FR",
b text COLLATE "pl_PL",
c text COLLATE "de_DE",
...
...
);
Changing the Definition of Collation
PostgreSQL provides the ALTER COLLATION statement which can:
- change the name of collations
ALTER COLLATION collation_name RENAME TO new_name;
- change the owner of collations
ALTER COLLATION collation_name OWNER TO new_owner;
- change the schema of collations
ALTER COLLATION collation_name SET SCHEMA new_schema
Example
ALTER COLLATION “ru_RU” RENAME TO russian
Example from utf8 database
select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris')) AS l(firstname) order by firstname collate "fr_FR";
select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris')) AS l(firstname) order by firstname collate "en_US";
select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris')) AS l(firstname) order by firstname collate "C";
show lc_ctype;
select initcap('élysée');
select initcap('élysée' collate "C");
select 'élysée' ~ '^\w+$';
select 'élysée' COLLATE "C" ~ '^\w+$';
Having LC_CTYPE set to 'C' implies that C functions like isupper(c) or tolower(c) give expected results only for characters in the US-ASCII range (that is, up to codepoint 0x7F in Unicode).
locale -a # list avaliable locale in system
sudo locale-gen es_PE.UTF-8 # locale-gen en_US.UTF-8 # create locale
locale -a|grep -i es
sudo systemctl restart postgresql # restart postgres
postgres=# create database db_izipay_prod with template=template0 encoding='utf8' lc_collate='es_ES.UTF-8' lc_ctype='es_ES.UTF-8' owner=postgres; # CREATE DATABASE db WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8';
CREATE COLLATION "pb_PB.utf8" (lc_collate = 'Portuguese_Brazil', lc_ctype = 'Portuguese_Brazil');
CREATE DATABASE pb WITH ENCODING 'utf8' LC_COLLATE='Portuguese_Brazil' LC_CTYPE='Portuguese_Brazil' TEMPLATE=template0;
CREATE DATABASE compte WITH
TEMPLATE = template0
encoding = 'utf8'
LC_COLLATE = 'en_CA.utf8'
LC_CTYPE = 'en_CA.utf8'
;
Change collation
update pg_database set datcollate='POSIX.UTF-8', datctype='POSIX.UTF-8' where datname='databasename';
After that rebuild of your string index. But not guarantee that will be ok.
A collation is a set of rules about how a set of objects should be ordered. PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
Show collation for all column and index in table
The collation of a column is stored in the attcollation of the column's pg_attribute row. A value of 100 signifies the “default collation”, which is the database collation stored in pg_database.
This is a way to find the collation of all columns for a table:
WITH defcoll AS (
SELECT datcollate AS coll
FROM pg_database
WHERE datname = current_database()
)
SELECT a.attname,
CASE WHEN c.collname = 'default'
THEN defcoll.coll
ELSE c.collname
END AS collation
FROM pg_attribute AS a
CROSS JOIN defcoll
LEFT JOIN pg_collation AS c ON a.attcollation = c.oid
WHERE a.attrelid = 'test1c'::regclass
AND a.attnum > 0
ORDER BY attnum;
attname | collation
---------+-----------
id |
content | C
(2 rows)
For indexes, the collations are stored in the column pg_index.indcollation:
WITH defcoll AS (
SELECT datcollate AS coll
FROM pg_database
WHERE datname = current_database()
)
SELECT icol.pos,
CASE WHEN c.collname = 'default'
THEN defcoll.coll
ELSE c.collname
END AS collation
FROM pg_index AS i
CROSS JOIN unnest(i.indcollation) WITH ORDINALITY AS icol(coll, pos)
CROSS JOIN defcoll
LEFT JOIN pg_collation AS c ON c.oid = icol.coll
WHERE i.indexrelid = 'idx_test'::regclass
ORDER BY icol.pos;
pos | collation
-----+-----------
1 | C
(1 row)
update pg_database set encoding = pg_char_to_encoding('SQL_ASCII') where datname = 'mytable'