postgres indexes - ghdrako/doc_snipets GitHub Wiki
An index in PostgreSQL can be built on a single column or multiple columns at once; PostgreSQL supports indexes with up to 32 columns.It is important to note that, when creating multi-column indexes, you should always place the most selective columns first. PostgreSQL will consider a multi-column index from the first column onward, so if the first columns are the most selective, the index access method will be the cheapest.
CREATE INDEX index_name ON table_name [USING method]
(
column_name [ASC | DESC] [NULLS {FIRST | LAST }],
...
);
method such as btree, hash, gist, spgist, gin, and brin. PostgreSQL uses btree by default.
rebuild invalid indexes two choices:
- Use the REINDEX command (not suggested);
- Drop the index and try to re-build it again (suggested).
Detect invalid indexes
SELECT *
FROM pg_class, pg_index
WHERE pg_index.indisvalid = false
AND pg_index.indexrelid = pg_class.oid;
Index size
SELECT pg_size_pretty( pg_relation_size( 'posts') ) AS table_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_date' ) ) AS idx_date_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_author' ) ) AS idx_
author_size;
Index usage
pg_stat_user_indexes
view provides information about how many times an index has been used and how.
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'posts';
idx_scan
how many times index used
The PostgreSQL catalog view pg_stat_all_indexes
shows the total number of index uses (scans, reads, and fetches) since the last statistics reset.
Note that some primary key indexes are never used for data retrieval; however, they are vital for data integrity and should not be removed.
Verify the progress of the create index script
SELECT l.relation::regclass,
l.transactionid,
l.mode,
l.GRANTED,
s.query,
s.query_start,
age(now(), s.query_start) AS "age",
s.pid
FROM pg_stat_activity s JOIN pg_locks l
ON l.pid = s.pid
WHERE mode= 'ShareUpdateExclusiveLock'
ORDER BY s.query_start;
Performance index creation
- https://www.cybertec-postgresql.com/en/postgresql-parallel-create-index-for-better-performance/
The best tuning method for creating indexes is a very high value for
maintenance_work_mem
.
SET maintenance_work_mem TO '4 GB';
In PostgreSQL 11 parallel index creation is on by default. The parameter in charge for this issue is called max_parallel_maintenance_workers, which can be set in postgresql.conf:
SHOW max_parallel_maintenance_workers;
SET max_parallel_maintenance_workers TO 0; # no multicore indexing is available
SET max_parallel_maintenance_workers TO 2;
ALTER TABLE t_demo SET (parallel_workers = 4);
SET max_parallel_maintenance_workers TO 4;
Set PostgreSQL to use larger checkpoint distances postgresql.conf to the following values:
checkpoint_timeout = 120min
max_wal_size = 50GB
min_wal_size = 80MB
Those settings can be activated by reloading the config file:
SELECT pg_reload_conf();
Using tablespaces in PostgreSQL to speed up indexing
CREATE TABLESPACE indexspace LOCATION '/ssd1/tabspace1';
CREATE TABLESPACE sortspace LOCATION '/ssd2/tabspace2';
SET temp_tablespaces TO sortspace;
CREATE INDEX idx6 ON t_demo (data) TABLESPACE indexspace;
Create Index Concurrently (CIC)
An index created concurrently will not lock the table against writes, which means you can still insert or update new rows. In order to execute the (slightly slower) index creation, Postgres will do the following:
- Scans the table once to build the index;
- Runs the index a second time for things added or updated since the first pass.
CREATE INDEX CONCURRENTLY works, without locking down the table and allowing concurrent updates to the table. Building an index consists of three phases.
- Phase 1: At the start of the first phase, the system catalogs are populated with the new index information. This obviously includes information about the columns used by the new index.As soon as information about the new index is available in the system catalog and is seen by other backends, they will start honouring the new index and ensure that the HOT chain’s property is preserved. CIC must wait for all existing transactions to finish before starting the second phase on index build. This guarantees that no new broken HOT chains are created after the second phase begins.
- Phase 2: So when the second phase starts, we guarantee that new transactions cannot create more broken HOT chains (i.e. HOT chains which do not satisfy the HOT property) with respect to the old indexes as well as the new index. We now take a new MVCC snapshot and start building the index by indexing every visible row in the table. While indexing we use the column value from the visible version and TID of the root of the HOT chain. Since all subsequent updates are guaranteed to see the new index, the HOT property is maintained beyond the version that we are indexing in the second phase. That means if a row is HOT updated, the new version will be reachable from the index entry just added (remember we indexed the root of the HOT chain).
During the second phase, if some other transaction updates the row such that neither the first not the second column is changed, a HOT update is possible. On the other hand, if the update changes the second column (or any other index column for that matter), then a non-HOT update is performed.
- Phase 3: You must have realised that while second phase was running, there could be transactions inserting new rows in the table or updating existing rows in a non-HOT manner. Since the index was not open for insertion during phase 2, it will be missing entries for all these new rows. This is fixed by taking a new MVCC snapshot and doing another pass over the table. During this pass, we index all rows which are visible to the new snapshot, but are not already in the index. Since the index is now actively maintained by other transactions, we only need to take care of the rows missed during the second phase.
The index is fully ready when the third pass finishes. It’s now being actively maintained by all other backends, following usual HOT rules. But the problem with old transactions, which could see rows which are neither indexed in the second or the third phase, remains. After all, their snapshots could see rows older than what our snapshots used for building the index could see.
The new index is not usable for such old transactions. CIC deals with the problem by waiting for all such old transactions to finish before marking the index ready for queries. Remember that the index was marked for insertion at the end of the second phase, but it becomes usable for reads only after the third phase finishes and all old transactions are gone.
Once all old transactions are gone, the index becomes fully usable by all future transactions. The catalogs are once again updated with the new information and cache invalidation messages are sent to other processes.
create index concurrently ''index11'' on test using btree (id);
create index concurrently "index12" on test using btree (id, subject_name);
create unique index concurrently "index14" on test using btree (id);
create index concurrently "index15" on toy using btree(availability) where availability is true;
create index concurrently on employee ((lower (name)));
SELECT
tablename,
indexname,
indexdef
FROM
pg_indexes
WHERE
tablename LIKE 'c%'
ORDER BY
tablename,
indexname;
\d employee;
Find out the current state of all your indexes
SELECT
t.tablename,
indexname,
c.reltuples AS num_rows,
pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
CASE WHEN indisunique THEN 'Y'
ELSE 'N'
END AS UNIQUE,
idx_scan AS number_of_scans,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
JOIN pg_class c ON c.oid = x.indrelid
JOIN pg_class ipg ON ipg.oid = x.indexrelid
JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid )
AS foo
ON t.tablename = foo.ctablename
WHERE t.schemaname='ebk'
ORDER BY 1,2;
Indexing for constraints
When you declare a unique
constraint, a primary key
constraint or an exclusion constraint
PostgreSQL creates an index for You.
Index Access Methods
PostgreSQL implements several index Access Methods. An access method is a generic algorithm with a clean API that can be implemented for compatible data types. Each algorithm is well adapted to some use cases,
Index information
SELECT relname, relpages, reltuples,
i.indisunique, i.indisclustered, i.indisvalid,
pg_catalog.pg_get_indexdef(i.indexrelid, 0, true)
FROM pg_class c JOIN pg_index i on c.oid = i.indrelid
WHERE c.relname = 'posts';
The pg_get_indexdef()
special function provides a textual representation of the CREATE INDEX
statement used to produce every index and can be very useful to decode and learn how to build complex indexes.
To improve these filtering operations, using the WHERE
clause, postgres keeps statistics about the selectivity of the values contained per table
column. This information is stored in the pg_statistics
catalog, and there is a more human-readable view called pg_stats
.
SELECT
attname AS “column_name”,
n_distinct AS “distinct_rate”,
array_to_string(most_common_vals, E’\n’) AS
“most_common_values”,
array_to_string(most_common_freqs, E’\n’) AS
“most_common_frequencies”
FROM pg_stats
WHERE tablename = ‘address’
AND attname = ‘postal_code’;
- The
distinct_rate
shows the proportion of distinct values versus the total rows. In the case of a unique value column, such as the Primary Key, this rate is 100%, and it is represented with the value -1. The table column from the example is near -1, meaning almost all the values are distinct. - The
most_common_values
gives insight into the postal_code values that repeat the most. In this case, four values: null, 22474, 9668, and 52137. - Finally, the
most_common_frequencies
shows the frequency of each of the most common values, the nulls are 0.66% of the total values, and each of the other values represents 0.33% of the total values from the sample.
We can verify the information the next way:
pagila=# WITH total AS (SELECT count(*)::numeric
cnt FROM address)
SELECT
postal_code, count(address.*),
round(count(address.*)::numeric * 100 /
total.cnt, 2) AS “percentage”
FROM address, total
GROUP BY address.postal_code, total.cnt
HAVING count(address.*) > 1
ORDER BY 2 DESC;
Invalidating an index
You need to invalidate an index as an administrator user, even if you are the user that created the index. This is due to the fact that you need to manipulate the system catalog, which is an activity restricted to administrator users only.
UPDATE pg_index SET indisvalid = false
WHERE indexrelid = ( SELECT oid FROM pg_class
WHERE relkind = 'i'
AND relname = 'idx_author_created_on' );
Index is then marked as INVALID
to indicate that PostgreSQL will not ever try to consider it for its execution plans. You can, of course, reset the index to its original status, making the same update as the preceding and setting the indisvalid column to a true value.
Rebuilding an index
REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
where option can be one of:
CONCURRENTLY [ boolean ] – Reindex can be created
concurrently online
TABLESPACE new_tablespace – Reindexing could be
done in new tablespace
VERBOSE [ boolean ] – Logs will be printed while
reindexing
REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [CONCURRENTLY ] name
REINDEX TABLE payment;
You can decide to rebuild a single index by means of the INDEX argument followed by the name of the index, or you can rebuild all the indexes of a table by means of the TABLE argument followed, as you can imagine, by the table name. Going further, you can rebuild all the indexes of all the tables within a specific schema by means of the SCHEMA argument (followed by the name of the schema) or the whole set of indexes of a database using the DATABASE argument and the name of the database you want to reindex. Lastly, you can also rebuild indexes on system catalog tables by means of the SYSTEM argument. You can execute REINDEX within a transaction block but only for a single index or table, which means only for the INDEX and TABLE options. All the other forms of the REINDEX command cannot be executed in a transaction block. The CONCURRENTLY option prevents the command from acquiring exclusive locks on the underlying table in a way similar to that of building a new index.
Indexing strategy
check for time spent doing sequential scans of your data, with a filtr step, as that’s the part that a proper index might be able to optimize.
Avoid index
If a column is of a numeric type, it can be modified by adding zero to its value. For example, the condition attr1+0=p_value will block the usage of an index on column attr1. For any data type, the coalesce() function will always block the usage of indexes, so assuming attr2 is not nullable, the condition can be modified to something like coalesce(t1.attr2, '0')=coalesce(t2.attr2, '0').