postgres indexes - ghdrako/doc_snipets GitHub Wiki
- https://www.postgresql.org/docs/current/btree-implementation.html
- https://www.postgresql.org/docs/current/sql-createindex.html
- https://www.postgresql.org/docs/current/indexes.html
- Percona - PostgreSQL Indexes Tutorial Course
An index in PostgreSQL can be built on a single column or multiple columns at once; PostgreSQL supports indexes with up to 32 columns.It is important to note that, when creating multi-column indexes, you should always place the most selective columns first. PostgreSQL will consider a multi-column index from the first column onward, so if the first columns are the most selective, the index access method will be the cheapest.
CREATE INDEX index_name ON table_name [USING method]
(
column_name [ASC | DESC] [NULLS {FIRST | LAST }],
...
);
method such as btree, hash, gist, spgist, gin, and brin. PostgreSQL uses btree by default.
two choices:
- Use the REINDEX command (not suggested);
- Drop the index and try to re-build it again (suggested).
- Multicolumn indexes Index that includes multiple columns in the definition
- Partial indexes — An expression that reduces the set of rows covered for a table, in the index definition Sshow using example:https://hakibenita.com/postgresql-unused-index-size
- Covering indexes — An index that covers all columns needed for a query
- Function index
Operator classes — Operators to be used by the indexes, specific to a column
-
https://www.postgresql.org/docs/current/indexes-index-only-scans.html
PostgreSQL version 11 added the
INCLUDE
keyword, which can be used to create covering indexes. TheINCLUDE
keyword specifies columns that only supply data, called “payload” columns. Payload columns cannot be used for filtering like inWHERE
clauses orJOIN
conditions, but supply data for columns listed in theSELECT
clause.
SELECT PG_SIZE_PRETTY(PG_TOTAL_RELATION_SIZE('index_name'));
SELECT
pg_tables.schemaname AS schema,
pg_tables.tablename AS table,
pg_stat_all_indexes.indexrelname AS index,
pg_stat_all_indexes.idx_scan AS number_of_scans,
pg_stat_all_indexes.idx_tup_read AS tuples_read,
pg_stat_all_indexes.idx_tup_fetch AS tuples_fetched
FROM pg_tables
LEFT JOIN pg_class ON(pg_tables.tablename = pg_class.relname)
LEFT JOIN pg_index ON(pg_class.oid = pg_index.indrelid)
LEFT JOIN pg_stat_all_indexes USING(indexrelid)
WHERE pg_tables.schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_tables.schemaname, pg_tables.tablename;
Warning: **Some index accesses may not update the statistics**. You should evaluate these results carefully and not blindly trust them.
SELECT *
FROM pg_class, pg_index
WHERE pg_index.indisvalid = false
AND pg_index.indexrelid = pg_class.oid;
SELECT pg_size_pretty( pg_relation_size( 'posts') ) AS table_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_date' ) ) AS idx_date_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_author' ) ) AS idx_
author_size;
pg_stat_user_indexes
view provides information about how many times an index has been used and how.
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'posts';
-
idx_scan
how many times index used
The PostgreSQL catalog view pg_stat_all_indexes
shows the total number of index uses (scans, reads, and fetches) since the last statistics reset.
Note that some primary key indexes are never used for data retrieval; however, they are vital for data integrity and should not be removed.
SELECT l.relation::regclass,
l.transactionid,
l.mode,
l.GRANTED,
s.query,
s.query_start,
age(now(), s.query_start) AS "age",
s.pid
FROM pg_stat_activity s JOIN pg_locks l
ON l.pid = s.pid
WHERE mode= 'ShareUpdateExclusiveLock'
ORDER BY s.query_start;
-
https://www.cybertec-postgresql.com/en/postgresql-parallel-create-index-for-better-performance/
The best tuning method for creating indexes is a very high value for
maintenance_work_mem
.
SET maintenance_work_mem TO '4 GB';
In PostgreSQL 11 parallel index creation is on by default. The parameter in charge for this issue is called max_parallel_maintenance_workers, which can be set in postgresql.conf:
SHOW max_parallel_maintenance_workers;
SET max_parallel_maintenance_workers TO 0; # no multicore indexing is available
SET max_parallel_maintenance_workers TO 2;
ALTER TABLE t_demo SET (parallel_workers = 4);
SET max_parallel_maintenance_workers TO 4;
Set PostgreSQL to use larger checkpoint distances postgresql.conf to the following values:
checkpoint_timeout = 120min
max_wal_size = 50GB
min_wal_size = 80MB
Those settings can be activated by reloading the config file:
SELECT pg_reload_conf();
Using tablespaces in PostgreSQL to speed up indexing
CREATE TABLESPACE indexspace LOCATION '/ssd1/tabspace1';
CREATE TABLESPACE sortspace LOCATION '/ssd2/tabspace2';
SET temp_tablespaces TO sortspace;
CREATE INDEX idx6 ON t_demo (data) TABLESPACE indexspace;
DDL changes like CREATE INDEX
on tables with high row counts and high transaction volume (transactions per second (TPS) or queries per second) should be created using the CONCURRENTLY
option. This applies to both creating an index and dropping an index. If your index creation gets canceled, you’ll need to raise your statement_timeout and try again.
An index created concurrently will not lock the table against writes, which means you can still insert or update new rows. In order to execute the (slightly slower) index creation, Postgres will do the following:
- Scans the table once to build the index;
- Runs the index a second time for things added or updated since the first pass.
CREATE INDEX CONCURRENTLY
works, without locking down the table and allowing concurrent updates to the table. Building an index consists of three phases.
- Phase 1: At the start of the first phase, the system catalogs are populated with the new index information. This obviously includes information about the columns used by the new index.As soon as information about the new index is available in the system catalog and is seen by other backends, they will start honouring the new index and ensure that the HOT chain’s property is preserved. CIC must wait for all existing transactions to finish before starting the second phase on index build. This guarantees that no new broken HOT chains are created after the second phase begins.
- Phase 2: So when the second phase starts, we guarantee that new transactions cannot create more broken HOT chains (i.e. HOT chains which do not satisfy the HOT property) with respect to the old indexes as well as the new index. We now take a new MVCC snapshot and start building the index by indexing every visible row in the table. While indexing we use the column value from the visible version and TID of the root of the HOT chain. Since all subsequent updates are guaranteed to see the new index, the HOT property is maintained beyond the version that we are indexing in the second phase. That means if a row is HOT updated, the new version will be reachable from the index entry just added (remember we indexed the root of the HOT chain).
During the second phase, if some other transaction updates the row such that neither the first not the second column is changed, a HOT update is possible. On the other hand, if the update changes the second column (or any other index column for that matter), then a non-HOT update is performed.
- Phase 3: You must have realised that while second phase was running, there could be transactions inserting new rows in the table or updating existing rows in a non-HOT manner. Since the index was not open for insertion during phase 2, it will be missing entries for all these new rows. This is fixed by taking a new MVCC snapshot and doing another pass over the table. During this pass, we index all rows which are visible to the new snapshot, but are not already in the index. Since the index is now actively maintained by other transactions, we only need to take care of the rows missed during the second phase.
The index is fully ready when the third pass finishes. It’s now being actively maintained by all other backends, following usual HOT rules. But the problem with old transactions, which could see rows which are neither indexed in the second or the third phase, remains. After all, their snapshots could see rows older than what our snapshots used for building the index could see.
The new index is not usable for such old transactions. CIC deals with the problem by waiting for all such old transactions to finish before marking the index ready for queries. Remember that the index was marked for insertion at the end of the second phase, but it becomes usable for reads only after the third phase finishes and all old transactions are gone.
Once all old transactions are gone, the index becomes fully usable by all future transactions. The catalogs are once again updated with the new information and cache invalidation messages are sent to other processes.
create index concurrently ''index11'' on test using btree (id);
create index concurrently "index12" on test using btree (id, subject_name);
create unique index concurrently "index14" on test using btree (id);
create index concurrently "index15" on toy using btree(availability) where availability is true;
create index concurrently on employee ((lower (name)));
SELECT
tablename,
indexname,
indexdef
FROM
pg_indexes
WHERE
tablename LIKE 'c%'
ORDER BY
tablename,
indexname;
\d employee;
Find out the current state of all your indexes
SELECT
t.tablename,
indexname,
c.reltuples AS num_rows,
pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
CASE WHEN indisunique THEN 'Y'
ELSE 'N'
END AS UNIQUE,
idx_scan AS number_of_scans,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
JOIN pg_class c ON c.oid = x.indrelid
JOIN pg_class ipg ON ipg.oid = x.indexrelid
JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid )
AS foo
ON t.tablename = foo.ctablename
WHERE t.schemaname='ebk'
ORDER BY 1,2;
When you declare a unique
constraint, a primary key
constraint or an exclusion constraint
PostgreSQL creates an index for You.
PostgreSQL implements several index Access Methods. An access method is a generic algorithm with a clean API that can be implemented for compatible data types. Each algorithm is well adapted to some use cases,
SELECT relname, relpages, reltuples,
i.indisunique, i.indisclustered, i.indisvalid,
pg_catalog.pg_get_indexdef(i.indexrelid, 0, true)
FROM pg_class c JOIN pg_index i on c.oid = i.indrelid
WHERE c.relname = 'posts';
The pg_get_indexdef()
special function provides a textual representation of the CREATE INDEX
statement used to produce every index and can be very useful to decode and learn how to build complex indexes.
To improve these filtering operations, using the WHERE
clause, postgres keeps statistics about the selectivity of the values contained per table
column. This information is stored in the pg_statistics
catalog, and there is a more human-readable view called pg_stats
.
SELECT
attname AS “column_name”,
n_distinct AS “distinct_rate”,
array_to_string(most_common_vals, E’\n’) AS
“most_common_values”,
array_to_string(most_common_freqs, E’\n’) AS
“most_common_frequencies”
FROM pg_stats
WHERE tablename = ‘address’
AND attname = ‘postal_code’;
- The
distinct_rate
shows the proportion of distinct values versus the total rows. In the case of a unique value column, such as the Primary Key, this rate is 100%, and it is represented with the value -1. The table column from the example is near -1, meaning almost all the values are distinct. - The
most_common_values
gives insight into the postal_code values that repeat the most. In this case, four values: null, 22474, 9668, and 52137. - Finally, the
most_common_frequencies
shows the frequency of each of the most common values, the nulls are 0.66% of the total values, and each of the other values represents 0.33% of the total values from the sample.
We can verify the information the next way:
pagila=# WITH total AS (SELECT count(*)::numeric
cnt FROM address)
SELECT
postal_code, count(address.*),
round(count(address.*)::numeric * 100 /
total.cnt, 2) AS “percentage”
FROM address, total
GROUP BY address.postal_code, total.cnt
HAVING count(address.*) > 1
ORDER BY 2 DESC;
You need to invalidate an index as an administrator user, even if you are the user that created the index. This is due to the fact that you need to manipulate the system catalog, which is an activity restricted to administrator users only.
UPDATE pg_index SET indisvalid = false
WHERE indexrelid = ( SELECT oid FROM pg_class
WHERE relkind = 'i'
AND relname = 'idx_author_created_on' );
Index is then marked as INVALID
to indicate that PostgreSQL will not ever try to consider it for its execution plans. You can, of course, reset the index to its original status, making the same update as the preceding and setting the indisvalid column to a true value.
REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
where option can be one of:
CONCURRENTLY [ boolean ] – Reindex can be created
concurrently online
TABLESPACE new_tablespace – Reindexing could be
done in new tablespace
VERBOSE [ boolean ] – Logs will be printed while
reindexing
REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [CONCURRENTLY ] name
REINDEX TABLE payment;
You can decide to rebuild a single index by means of the INDEX argument followed by the name of the index, or you can rebuild all the indexes of a table by means of the TABLE argument followed, as you can imagine, by the table name. Going further, you can rebuild all the indexes of all the tables within a specific schema by means of the SCHEMA argument (followed by the name of the schema) or the whole set of indexes of a database using the DATABASE argument and the name of the database you want to reindex. Lastly, you can also rebuild indexes on system catalog tables by means of the SYSTEM argument. You can execute REINDEX within a transaction block but only for a single index or table, which means only for the INDEX and TABLE options. All the other forms of the REINDEX command cannot be executed in a transaction block. The CONCURRENTLY option prevents the command from acquiring exclusive locks on the underlying table in a way similar to that of building a new index.
check for time spent doing sequential scans of your data, with a filtr step, as that’s the part that a proper index might be able to optimize.
If a column is of a numeric type, it can be modified by adding zero to its value. For example, the condition attr1+0=p_value will block the usage of an index on column attr1. For any data type, the coalesce() function will always block the usage of indexes, so assuming attr2 is not nullable, the condition can be modified to something like coalesce(t1.attr2, '0')=coalesce(t2.attr2, '0').
Operator Classes and Operator Families #
CREATE INDEX name ON table (column opclass [ ( opclass_options ) ] [sort options] [, ...]);
SELECT * FROM directories WHERE path LIKE '/1/2/3/%'
CREATE INDEX CONCURRENTLY ON directories (path);
Domyślny indeks B-drzewa, który jest tworzony w PostgreSQL, nie jest optymalny dla zapytań z LIKE na początku tekstu, ponieważ indeks ten działa najlepiej z zapytaniami wymagającymi pełnego porządku (od najmniejszej do największej wartości). W przypadku zapytań na zasadzie „prefiksu” (czyli wyszukiwania po początkowym fragmencie ciągu) taki indeks może nie być używany efektywnie przez PostgreSQL.
Indeks ten tworzy strukturę, która pomaga szybko znaleźć konkretną wartość lub zakres wartości w kolejności od początku do końca, co dobrze działa przy zapytaniach typu:
SELECT * FROM directories WHERE path = '/1/2/3/'
lub
SELECT * FROM directories WHERE path > '/1/2/3/'
W obu przypadkach PostgreSQL może skorzystać z indeksu B-drzewa, ponieważ operacje =
i ``> zachowują pełny porządek, w którym B-drzewo jest efektywne.
Dla zapytania LIKE '/1/2/3/%'
PostgreSQL potrzebuje sprawdzić wszystkie wartości, które zaczynają się od /1/2/3/
, ale bez sortowania dalej według kolejnych znaków po tym prefiksie. Domyślny indeks B-drzewa nie wspiera efektywnie wyszukiwań opartych tylko na początku tekstu, więc** PostgreSQL może zignorować taki indeks** i wykonać pełne przeszukanie tabeli (co jest wolniejsze).
Użycie klasy operatorów text_pattern_ops
informuje PostgreSQL, że chcesz używać indeksu tylko do dopasowywania prefiksów. Taki indeks jest zoptymalizowany do wyszukiwań w stylu „zaczyna się od”, dlatego PostgreSQL może szybciej znajdować wyniki, które mają konkretny prefiks, zamiast przeszukiwać całą tabelę.
CREATE INDEX CONCURRENTLY ON directories (path text_pattern_ops);
text_pattern_ops
, varchar_pattern_ops
, and bpchar_pattern_ops
support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard “C” locale.
Note that you should also create an index with the default operator class if you want queries involving ordinary <, <=, >, or >=
comparisons to use an index. Such queries cannot use the xxx_pattern_ops
operator classes. (Ordinary equality comparisons can use these operator classes, however.) It is possible to create multiple indexes on the same column with different operator classes. If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.
psql has commands \dAc
, \dAf
, and \dAo
to show operator classes
-- Find missing indexes (look at seq_scan counts)
-- https://stackoverflow.com/a/12818168/126688
SELECT
relname AS TableName,
TO_CHAR(seq_scan, '999,999,999,999') AS TotalSeqScan,
TO_CHAR(idx_scan, '999,999,999,999') AS TotalIndexScan,
TO_CHAR(n_live_tup, '999,999,999,999') AS TableRows,
PG_SIZE_PRETTY(PG_RELATION_SIZE(relname::REGCLASS)) AS TableSize
FROM pg_stat_all_tables
WHERE schemaname = 'public' -- change schema name, i.e. 'rideshare' if not 'public'
-- AND 50 * seq_scan > idx_scan -- more than 2%, add filters to narrow down results
-- AND n_live_tup > 10000 -- narrow down results for bigger tables
-- AND pg_relation_size(relname::REGCLASS) > 5000000
ORDER BY totalseqscan DESC;
-- missing indexes from GCP docs:
-- Optimize CPU usage
-- https://cloud.google.com/sql/docs/postgres/optimize-cpu-usage
SELECT
relname,
idx_scan,
seq_scan,
n_live_tup
FROM
pg_stat_user_tables
WHERE
seq_scan > 0
ORDER BY
n_live_tup DESC;
- https://www.cybertec-postgresql.com/en/b-tree-index-deduplication/ PostgreSQL 13 introduced index deduplication that helps remove nulls from indexes.
UPDATE creates a new version of a table row, and each row version has to be indexed. Consequently, even a unique index can contain the same index entry multiple times. The new feature is useful for unique indexes because it reduces the index bloat that normally occurs if the same table row is updated frequently
Deduplication results in a smaller index size for indexes with repeating entries. This saves disk space and, even more importantly, RAM when the index is cached in shared buffers. Scanning the index becomes faster, and index bloat is reduced.
If you want to make use of the new feature after upgrading with pg_upgrade, you'll have to REINDEX promising indexes that contain many duplicate entries. Upgrading with pg_dumpall and restore or using logical replication rebuilds the index, and deduplication is automatically enabled.
If you want to retain the old behavior, you can disable deduplication by setting deduplicate_items = off
on the index.