postgres indexes - ghdrako/doc_snipets GitHub Wiki

An index in PostgreSQL can be built on a single column or multiple columns at once; PostgreSQL supports indexes with up to 32 columns.It is important to note that, when creating multi-column indexes, you should always place the most selective columns first. PostgreSQL will consider a multi-column index from the first column onward, so if the first columns are the most selective, the index access method will be the cheapest.

CREATE INDEX index_name ON table_name [USING method]
(
    column_name [ASC | DESC] [NULLS {FIRST | LAST }],
    ...
);

method such as btree, hash, gist, spgist, gin, and brin. PostgreSQL uses btree by default.

rebuild invalid indexes two choices:

  • Use the REINDEX command (not suggested);
  • Drop the index and try to re-build it again (suggested).

Detect invalid indexes

SELECT * 
  FROM pg_class, pg_index 
 WHERE pg_index.indisvalid = false 
   AND pg_index.indexrelid = pg_class.oid;

Index size

SELECT pg_size_pretty( pg_relation_size( 'posts') ) AS table_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_date' ) ) AS idx_date_
size,
pg_size_pretty( pg_relation_size( 'idx_posts_author' ) ) AS idx_
author_size;

Index usage

pg_stat_user_indexes view provides information about how many times an index has been used and how.

SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch 
  FROM pg_stat_user_indexes 
 WHERE relname = 'posts';
  • idx_scan how many times index used

Verify the progress of the create index script

SELECT l.relation::regclass,
       l.transactionid, 
       l.mode, 
       l.GRANTED, 
       s.query, 
       s.query_start, 
       age(now(), s.query_start) AS "age", 
       s.pid 
 FROM pg_stat_activity s JOIN pg_locks l 
   ON l.pid = s.pid 
WHERE mode= 'ShareUpdateExclusiveLock' 
ORDER BY s.query_start;

Performance index creation

SET maintenance_work_mem TO '4 GB';

In PostgreSQL 11 parallel index creation is on by default. The parameter in charge for this issue is called max_parallel_maintenance_workers, which can be set in postgresql.conf:

SHOW max_parallel_maintenance_workers;
SET max_parallel_maintenance_workers TO 0; # no multicore indexing is available
SET max_parallel_maintenance_workers TO 2;

ALTER TABLE t_demo SET (parallel_workers = 4);
SET max_parallel_maintenance_workers TO 4;

Set PostgreSQL to use larger checkpoint distances postgresql.conf to the following values:

checkpoint_timeout = 120min
max_wal_size = 50GB
min_wal_size = 80MB

Those settings can be activated by reloading the config file:

SELECT pg_reload_conf();

Using tablespaces in PostgreSQL to speed up indexing

CREATE TABLESPACE indexspace LOCATION '/ssd1/tabspace1';
CREATE TABLESPACE sortspace LOCATION '/ssd2/tabspace2';
SET temp_tablespaces TO sortspace;
CREATE INDEX idx6 ON t_demo (data) TABLESPACE indexspace;

Create Index Concurrently (CIC)

An index created concurrently will not lock the table against writes, which means you can still insert or update new rows. In order to execute the (slightly slower) index creation, Postgres will do the following:

  • Scans the table once to build the index;
  • Runs the index a second time for things added or updated since the first pass.

CREATE INDEX CONCURRENTLY works, without locking down the table and allowing concurrent updates to the table. Building an index consists of three phases.

  • Phase 1: At the start of the first phase, the system catalogs are populated with the new index information. This obviously includes information about the columns used by the new index.As soon as information about the new index is available in the system catalog and is seen by other backends, they will start honouring the new index and ensure that the HOT chain’s property is preserved. CIC must wait for all existing transactions to finish before starting the second phase on index build. This guarantees that no new broken HOT chains are created after the second phase begins.
  • Phase 2: So when the second phase starts, we guarantee that new transactions cannot create more broken HOT chains (i.e. HOT chains which do not satisfy the HOT property) with respect to the old indexes as well as the new index. We now take a new MVCC snapshot and start building the index by indexing every visible row in the table. While indexing we use the column value from the visible version and TID of the root of the HOT chain. Since all subsequent updates are guaranteed to see the new index, the HOT property is maintained beyond the version that we are indexing in the second phase. That means if a row is HOT updated, the new version will be reachable from the index entry just added (remember we indexed the root of the HOT chain).

During the second phase, if some other transaction updates the row such that neither the first not the second column is changed, a HOT update is possible. On the other hand, if the update changes the second column (or any other index column for that matter), then a non-HOT update is performed.

  • Phase 3: You must have realised that while second phase was running, there could be transactions inserting new rows in the table or updating existing rows in a non-HOT manner. Since the index was not open for insertion during phase 2, it will be missing entries for all these new rows. This is fixed by taking a new MVCC snapshot and doing another pass over the table. During this pass, we index all rows which are visible to the new snapshot, but are not already in the index. Since the index is now actively maintained by other transactions, we only need to take care of the rows missed during the second phase.

The index is fully ready when the third pass finishes. It’s now being actively maintained by all other backends, following usual HOT rules. But the problem with old transactions, which could see rows which are neither indexed in the second or the third phase, remains. After all, their snapshots could see rows older than what our snapshots used for building the index could see.

The new index is not usable for such old transactions. CIC deals with the problem by waiting for all such old transactions to finish before marking the index ready for queries. Remember that the index was marked for insertion at the end of the second phase, but it becomes usable for reads only after the third phase finishes and all old transactions are gone.

Once all old transactions are gone, the index becomes fully usable by all future transactions. The catalogs are once again updated with the new information and cache invalidation messages are sent to other processes.

create index concurrently ''index11'' on test using btree (id);
create index concurrently "index12" on test using btree (id, subject_name);
create unique index concurrently "index14" on test using btree (id);
create index concurrently "index15" on toy using btree(availability) where availability is true;
create index concurrently on employee ((lower (name)));
SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename LIKE 'c%'
ORDER BY
    tablename,
    indexname;

\d employee;  

Find out the current state of all your indexes

SELECT
  t.tablename,
  indexname,
  c.reltuples AS num_rows,
  pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
  pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
  CASE WHEN indisunique THEN 'Y'
    ELSE 'N'
  END AS UNIQUE,
  idx_scan AS number_of_scans,
  idx_tup_read AS tuples_read,
  idx_tup_fetch AS tuples_fetched
FROM pg_tables t
  LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
  LEFT OUTER JOIN
    ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
      JOIN pg_class c ON c.oid = x.indrelid
      JOIN pg_class ipg ON ipg.oid = x.indexrelid
      JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid )
    AS foo
  ON t.tablename = foo.ctablename
WHERE t.schemaname='ebk'
ORDER BY 1,2;

Indexing for constraints

When you declare a unique constraint, a primary key constraint or an exclusion constraint PostgreSQL creates an index for You.

Index Access Methods

PostgreSQL implements several index Access Methods. An access method is a generic algorithm with a clean API that can be implemented for compatible data types. Each algorithm is well adapted to some use cases,

Index information

SELECT relname, relpages, reltuples,
i.indisunique, i.indisclustered, i.indisvalid,
pg_catalog.pg_get_indexdef(i.indexrelid, 0, true)
FROM pg_class c JOIN pg_index i on c.oid = i.indrelid
WHERE c.relname = 'posts';

The pg_get_indexdef() special function provides a textual representation of the CREATE INDEX statement used to produce every index and can be very useful to decode and learn how to build complex indexes.

To improve these filtering operations, using the WHERE clause, postgres keeps statistics about the selectivity of the values contained per table column. This information is stored in the pg_statistics catalog, and there is a more human-readable view called pg_stats.

SELECT
attname AS “column_name”,
n_distinct AS “distinct_rate”,
array_to_string(most_common_vals, E’\n’) AS
“most_common_values”,
array_to_string(most_common_freqs, E’\n’) AS
“most_common_frequencies”
FROM pg_stats
WHERE tablename = ‘address’
AND attname = ‘postal_code’;
  • The distinct_rate shows the proportion of distinct values versus the total rows. In the case of a unique value column, such as the Primary Key, this rate is 100%, and it is represented with the value -1. The table column from the example is near -1, meaning almost all the values are distinct.
  • The most_common_values gives insight into the postal_code values that repeat the most. In this case, four values: null, 22474, 9668, and 52137.
  • Finally, the most_common_frequencies shows the frequency of each of the most common values, the nulls are 0.66% of the total values, and each of the other values represents 0.33% of the total values from the sample.

We can verify the information the next way:

pagila=# WITH total AS (SELECT count(*)::numeric
cnt FROM address)
SELECT
postal_code, count(address.*),
round(count(address.*)::numeric * 100 /
total.cnt, 2) AS “percentage”
FROM address, total
GROUP BY address.postal_code, total.cnt
HAVING count(address.*) > 1
ORDER BY 2 DESC;

Invalidating an index

You need to invalidate an index as an administrator user, even if you are the user that created the index. This is due to the fact that you need to manipulate the system catalog, which is an activity restricted to administrator users only.

UPDATE pg_index SET indisvalid = false
WHERE indexrelid = ( SELECT oid FROM pg_class
WHERE relkind = 'i'
AND relname = 'idx_author_created_on' );

Index is then marked as INVALID to indicate that PostgreSQL will not ever try to consider it for its execution plans. You can, of course, reset the index to its original status, making the same update as the preceding and setting the indisvalid column to a true value.

Rebuilding an index

REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
where option can be one of:
CONCURRENTLY [ boolean ] – Reindex can be created
concurrently online
TABLESPACE new_tablespace – Reindexing could be
done in new tablespace
VERBOSE [ boolean ] – Logs will be printed while
reindexing
REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [CONCURRENTLY ] name
REINDEX TABLE payment;

You can decide to rebuild a single index by means of the INDEX argument followed by the name of the index, or you can rebuild all the indexes of a table by means of the TABLE argument followed, as you can imagine, by the table name. Going further, you can rebuild all the indexes of all the tables within a specific schema by means of the SCHEMA argument (followed by the name of the schema) or the whole set of indexes of a database using the DATABASE argument and the name of the database you want to reindex. Lastly, you can also rebuild indexes on system catalog tables by means of the SYSTEM argument. You can execute REINDEX within a transaction block but only for a single index or table, which means only for the INDEX and TABLE options. All the other forms of the REINDEX command cannot be executed in a transaction block. The CONCURRENTLY option prevents the command from acquiring exclusive locks on the underlying table in a way similar to that of building a new index.

Indexing strategy

check for time spent doing sequential scans of your data, with a filtr step, as that’s the part that a proper index might be able to optimize.