postgres index types - ghdrako/doc_snipets GitHub Wiki
Advenced indexes
- Multicolumn indexes - upto 32 columns
- Indexes and ORDER BY
- Combining multiple indexes
- Unique indexes
- Indexes on expressions - In order to be used in an index, a user-defined function must be declared as
IMMUTABLE
, which means its output must be the same for the very same input. - Partial indexes
- Partial unique indexes
- Index-only scans
Algorithms supported:
- B‑tree (by defaut)
- Hash (dengerous in version < 10)
- GiST / SP‑GiST
- GIN – BRIN (version 9.5)
- Bloom (version 9.6)
Index types
- Balanced Tree (B-Tree) - default index PostgreSQL uses. Along with equal-to(=) BTREE works well with the below operators as well:
<
>
<=
>=
IS NULL
IS NOT NULL
BETWEEN
IN
B-tree indexes can also be used to retrieve data in sorted order.B-Tree indexes are sorted in ascending order by default.NULL values are ordered in the front by default. To put them in the back, use NULLS LAST in SQL
- https://evgeniydemin.medium.com/postgresql-indexes-hash-vs-b-tree-84b4f6aa6d61
- https://devcenter.heroku.com/articles/postgresql-indexes#b-trees-and-sorting
CREATE INDEX trips_completed_at_index ON trips(completed_at DESC NULLS LAST);
indeks B-drzewo może zawierać trzy rodzaje węzłów:
- korzeń: jest unikalny i stanowi podstawę drzewa;
- węzły wewnętrzne: może być kilka poziomów;
- liście: zawierają:
- indeksowane (posortowane!) wartości ;
- uwzględnione wartości (jeśli dotyczy);
- pozycje fizyczne (ctid), tutaj w nawiasach i w formie skróconej, ponieważ rzeczywista postać to (numer bloku, pozycja linii w bloku) ;
- adres poprzedniego arkusz i następny arkusz.
Korzenie i węzły wewnętrzne zawierają zapisy, które opisują minimalną wartość każdego bloku niższego poziomu i ich adres (CTID).
Kiedy indeks jest tworzony, zawiera tylko jeden liść. Kiedy ten liść się zapełnia, dzieli się na dwie części, a na nim tworzony jest węzeł główny. Liście następnie stopniowo wypełniają się i oddzielają, gdy są pełne. Proces ten stopniowo wypełnia korzeń. Gdy korzeń jest pełny, dzieli się na dwa wewnętrzne węzły, a na górze tworzy się nowy korzeń. Proces ten pomaga zachować równowagę drzewa.
Przeglądając korzeń, szukamy nagrania, którego wartość jest ściśle większa niż wartość, której szukamy. Tutaj 22 jest mniejsze niż 24: Dlatego badamy węzeł po lewej stronie. - To odniesienie do trzech dolnych węzłów (tutaj liście). Porównujemy ponownie wartość poszukiwaną (posortowaną) węzła: dla każdego przedziału wartości istnieje inny węzeł inny nos aarbre. Drzewo może oczywiście mieć większą głębokość, w którym to przypadku poprzedni krok jest powtarzany. Tutaj liść wskazuje nam, że przy wartości 22 cor - oddychaj dwie linie w pozycjach 2 i 17. Gdy poszukiwana wartość jest większa lub równa największej wartości bloku, PostgreSQL odczytuje również następujący blok. Ten scenariusz może wystąpić, jeśli PostgreSQL podzielił arkusz na pół przed lub nawet podczas wykonywanego wyszukiwania. Byłoby tak, na przykład, gdybyśmy szukali wartości 30. - Aby znaleźć nazwy nazwy, musisz iść i przeszukać tabelę nawet wierszy znalezione w indeksie. Z drugiej strony informacje o widoczności linii należy również znaleźć w tabeli. (Są przypadki, w których badania mogą uniknąć tego ostatniego kroku: są to jedyny wskaźnik skanowania).
W przypadku zapytań bez filtra index nadal może pomoc
SELECT id FROM my_table ORDER BY id ;
Indeks może nam pomóc odpowiedzieć na to pytanie. W rzeczywistości wszystkie liście są ze sobą połączone, co pozwala na zachowanie uporządkowanej ścieżki. Musimy więc zlokalizować pierwszy arkusz (ten najbardziej po lewej) i dla każdego klucza pobrać odpowiednie wiersze. Po przetworzeniu kluczy arkusza wystarczy podążać za wskaźnikiem do następnego arkusza i zacząć od nowa.
Alternatywa polegałaby na przeglądaniu całej tabeli i sortowaniu wszystkich wierszy w celu uzyskania ich we właściwej kolejności. Takie sortowanie może być bardzo drogie, w pamięci, jak w czasie procesora. Ponadto takie sortowania są bardzo często przechowywane na dysku (za pomocą plików tymczasowych), aby nie zachować wszystkich danych w pamięci.
Użycie indeksu w zakresie wartosci
SELECT * FROM ma_table WHERE id <= 10 AND id >= 4 ;
Wystarczy użyć właściwości sort indeksu, aby przejść przez liście, zaczynając od dolnej granicy, do górnej granicy. Ostatnia uwaga: ten diagram pokazuje tylko jeden wpis indeksu dla 22, chociaż wskazuje na dwie linie. W rzeczywistości przed wydaniem PostgreSQL 13 istniały rzeczywiście dwa wpisy dla 22. Od tej wersji PostgreSQL może usuwać duplikaty wpisów.
Indeks wielokolumnowh
-
bezpośredni dostęp do pierwszych kolumn indeksu
-
w przypadku pozostałych kolumn PostgreSQL odczyta cały indeks lub zignoruje indeks.
-
hash index - this index is built on the result of a 32-bit hash function for the value of the column(s). It is important to note that the hash index can be used only for equality operators, not for range nor disequality operators. In fact, being an index built on a hash function, the index cannot compare two hash values to understand their ordering; only the equality (which produces the very same hash value) can be evaluated.
CREATE INDEX invoices_uniqid ON invoices USING HASH (uniqid)
When you plan to search for a column only with equality checks like UUIDs or strings, you can optimize your indexes by using hash indexes. Compared to a standard b-tree index, a hash index will be a tiny bit faster for inserting and querying data. But the index will be much smaller, which is a significant improvement. However, unique hash indexes are not yet supported so you can not use them to enforce uniqueness constraints.
-
https://evgeniydemin.medium.com/postgresql-indexes-hash-vs-b-tree-84b4f6aa6d61
-
finally fixed in Postgres 10 Restriction
-
Equality comparisons (=) only operator support!
-
Can’t enforce unique constraints[215]
-
May be much slower to insert new entries
-
It is mutch smoller then B-Tree index , especially for large key values
-
Very good for long values on which = is primary operation in example index on url, sha-256 values
-
Block Range Index (BRIN) is a particular type of index that is based on the range of values in data blocks on storage. The idea is that every block has a minimal and maximal value, and the index then stores a couple of values for every data block on the storage. When a particular value is searched from a query, the index knows in which data block the values can be found, but all the tuples in the block must be evaluated.
A table for which a BRIN index is created is considered a sequence of block ranges, where each range consists of a fixed number of adjacent blocks. For each range, a BRIN index entry contains a summary of column values contained in the block range. For example, a summary may contain the minimum and maximum values of the timestamp column in an event log table. To find any value of the indexed attribute, it is sufficient to find an appropriate block range (using the index) and then scan all blocks in the range.
The structure of the summarization method depends on the type of the column being indexed. For intervals, a summary may be an interval containing all intervals contained in the block range. For spatial data, a summary can be a bounding box containing all boxes in the block range. If the column values are not ordered or rows are not ordered in the table, a scan of a BRIN index will return multiple block ranges to be scanned. The summarization is expensive. Therefore, PostgreSQL provides multiple choices for BRIN index maintenance: a BRIN index can be updated automatically with triggers; alternatively, delayed summarization can be done automatically together with vacuum or started manually.
-- Set a seed
SELECT SETSEED(0.5);
BEGIN ;
SET LOCAL statement_timeout = '60s' ;
INSERT INTO trip_positions ( position, trip_id, created_at, updated_at)
SELECT POINT ( '(37.769233' || FLOOR(RANDOM() * 10 + 1):: TEXT || ',-122.3890705)' ),
( SELECT MIN(id) FROM trips), seq, seq
FROM GENERATE_SERIES( '2023-08-01 0:00' ::TIMESTAMPTZ, '2023-10-01 0:00' ::TIMESTAMPTZ, '1 second' :: INTERVAL ) AS t(seq);
COMMIT;
CREATE INDEX trip_positions_created_at_brin ON trip_positions USING BRIN (created_at);
CREATE INDEX trip_positions_created_at_btree ON trip_positions (created_at);
-- B-Tree
SELECT PG_SIZE_PRETTY( PG_RELATION_SIZE( 'trip_positions_created_at_btree' ));
-- BRIN
SELECT PG_SIZE_PRETTY( PG_RELATION_SIZE( 'trip_positions_created_at_brin' ));
For this data set size, the B-Tree index was around 113 MB while the BRIN index is much smaller, just 32 KB. That’s 99.97 percent smaller!
- Generalized Inverted Index (GIN) is a type of index that instead of pointing to a single tuple points to multiple values, and to some extent, to an array of values. Usually, this kind of index is used in full-text search scenarios, where you are indexing a written text where there are multiple duplicated keys (for example, the same word or term) that point to different places (for example, the same word in different phrases and lines).
- Gin is designed for handling cases where the items to be indexed are com-posite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for docu-ments containing speci??c words.
- Gin indexes are “inverted indexes” which are appropriate for data values that contain multiple component values, such as arrays. An inverted index contains a separate entry for each component value. Such an index can efficiently handle queries that test for the presence of speciffc component values.
- The GIN access method is the foundation for the PostgreSQL Full Text Search support.
- https://www.postgresql.org/docs/current/gin-intro.html
- https://pganalyze.com/blog/gin-index
- common use: indexing JSONB columns, or to support Postgres full text search.
- The GIN index type was designed to deal with data types that are subdividable and you want to search for individual component values (array elements, lexemes in a text document, etc)”
- GIN indexes only support Bitmap Index Scans (not Index Scan or Index Only Scan), due to the fact that they only store parts of the row values in each index page. Don’t be surprised when EXPLAIN always shows Bitmap Index / Heap Scans for your GIN indexes.
- GIN index can have very different index contents depending on which data type and operator class you are using. Some data types, such as JSONB have more than one GIN operator class to support the most optimal index structure for specific query patterns.
- GIN index is balanced tree
- Efficient for
<@,&&,@@@
operators
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
As described in the Postgres documentation, the tsvector focused on lexemes:
“GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words.”
GIN indexes are the best starting point when using Postgres Full Text Search. There are situations where a GIST index might be preferred (see the Postgres documentation for details), and if you run your own server you could also consider the newer RUM index types available through an extension.
Gin index en be use in like search. In example:
SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
CREATE INDEX trgm_idx ON test_trgm USING gin (t gin_trgm_ops);
EXPLAIN SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
Postgres GIN index for JSONB columns using jsonb_ops
and jsonb_path_ops
CREATE TABLE test (
id bigserial PRIMARY KEY,
data jsonb
);
INSERT INTO test(data) VALUES ('{"field": "value1"}');
INSERT INTO test(data) VALUES ('{"field": "value2"}');
INSERT INTO test(data) VALUES ('{"other_field": "value42"}');
CREATE INDEX ON test USING gin(data);
EXPLAIN SELECT * FROM test WHERE data ? 'field';-- querying for all rows that have the field key defined:
default jsonb_ops
operator class would store the following values in the GIN index, as separate entries: field, other_field, value1, value2, value42.
The jsonb_path_ops class is intended to efficiently support containment queries.
CREATE INDEX ON test USING gin(data jsonb_path_ops);
EXPLAIN SELECT * FROM test WHERE data @> '{"field": "value1"}';
we could use a B-tree expression index to index the field keys:
CREATE INDEX ON test USING btree ((data ->> 'field'));
The Postgres query planner will then use the specific expression index behind the scenes, if your query matches the expression:
EXPLAIN SELECT * FROM test WHERE data->>'field' = 'value1';
Downside: Expensive Updates Gin often contain multiple index entries per single row that is being inserted. Due to the fact that a single row can cause 10s or worst case 100s of index entries to be updated, it’s important to understand the special fastupdate mechanism of GIN indexes.
By default fastupdate is enabled for GIN indexes, and it causes index updates to be deferred, so they can occur at a point where multiple updates have to be made, reducing the overhead for a single UPDATE, at the expense of having to do the work at a later point.
The data that is deferred is kept in the special pending list, which then gets flushed to the main index structure in one of three situations:
- The gin_pending_list_limit (default of 4MB) is reached during a regular index update
- Explicit call to the gin_clean_pending_list function
- Autovacuum on the table with the GIN index (GIN pending list cleanup happens at the end of vacuum) As you can imagine this can be quite an expensive operation, which is why one symptom of index write overhead with GIN can be that every Nth INSERT or UPDATE statement suddenly is a lot slower, in case you run into the first scenario above, where the gin_pending_list_limit is reached.
Strategies for dealing with GIN pending list update issues There are multiple alternate ways you can resolve issues like the one GitLab encountered:
(1) Reduce gin_pending_list_limit Have more frequent, smaller flushes This may sound odd - but gin_pending_list_limit started out as being determined by work_mem (instead of being its own setting), and is only configurable separately since Postgres 9.5 - explaining the 4MB default, which may be too high in some cases (2) Increase gin_pending_list_limit Have more opportunities to cleanup the list outside of the regular workload (3) Turning off fastupdate Taking the overhead with each individual INSERT/UPDATE (4) Tune autovacuum to run more often on the table, in order to clean the pending list (5) Explicitly calling gin_clean_pending_list(), instead of relying on Autovacuum (6) Drop the GIN index If you have alternate ways of indexing the data, for example using expression indexes Depending on your workload one or multiple of these approaches could be a good fit.
In addition, it’s important to ensure you have sufficient memory available during the GIN pending list cleanup. The memory limit used for the pending list flush can be confusing, and is not related to the size of gin_pending_list_limit. Instead it uses the following Postgres settings:
work_mem during regular INSERT/UPDATE maintenance_work_mem during gin_clean_pending_list() call autovacuum_work_mem during autovacuum Last but not least, you may want to consider partitioning or sharding a table that encounters problems like this. It may not be the easiest thing to do, but scaling GIN indexes to heavy write workloads is quite a tricky business.
- Generalized Index Search Tree (GIST), which is a platform on top of which new index types can be built. The idea is to provide a pluggable infrastructure where you can define operators and features that can index a data structure.Its implementation in PostgreSQL allows support for 2-dimensional data types such as the geometry point or the rang data types. Those data types don’t support a total order and as a consequence can’t be indexed properly in a B-tree index.
GIST is a family of index structures, each of which supports a certain data type and can be configured to implement several different tree-based index structures. Specifically, it implements index structure for spatial data known as an R-tree. Support for R-tree and a few other indexes is included in the PostgreSQL distribution;
An R-tree index supports a search on spatial data. An index key for an R-tree always represents a rectangle in a multidimensional space. A search returns all objects having a non-empty intersection with the query rectangle. The structure of an R-tree is similar to the structure of a B-tree; however, splitting overflowed nodes is much more complicated. R-tree indexes are efficient for a small number of dimensions (typically, two to three).
- Spaced Partitioned gist SP-GIST, a spatial index used in geographical applications.SP-GiSTindexes are the only PostgreSQL index access method imple-mentation that support non-balanced disk-based data structures, such as quadtrees, k-d trees, and radix trees (tries). This is useful when you want to index 2-dimensional data with very di?ferent densities.
GIST index is efficient for collections of documents where the total number of different terms is small. This is uncommon with texts in natural languages, so GIN indexes are usually more efficient in this case.
- Bloom filters is a space-efficient data structure that is used to test whether an element is a member of a set. In the case of an index access method, it allows fast exclusion of non-matching tuples via signatures whose size is determined at index creation. This type of index is most useful when a table has many attributes and queries test arbitrary combinations of them. A traditional B-tree index is faster than a Bloom index, but it can require many B-tree indexes to support all possible queries where one needs only a single Bloom index. Note however that Bloom indexes only support equality queries, whereas B-tree indexes can also perform inequality and range searches.
- The Bloom filter index is implemented as a PostgreSQL extension starting in PostgreSQL 9.6, and so to be able to use thisaccess methodit’s necessary to first
create extension bloom
.
- The Bloom filter index is implemented as a PostgreSQL extension starting in PostgreSQL 9.6, and so to be able to use thisaccess methodit’s necessary to first
Both Bloom indexes and BRIN indexes are mostly useful when covering mutliple columns. In the case of Bloom indexes, they are useful when the queries themselves are referencing most or all of those columns in equality comparisons.
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON [
ONLY ] table_name [ USING method ]
( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [
ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ INCLUDE ( column_name [, ...] ) ]
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace_name ]
[ WHERE predicate ]
It is possible to store an index in another tablespace than that of the underlying table, and this can be useful to store important indexes in faster storage.
INCLUDE
clause allows you to specify some extra columns of the underlying table that are going to be stored in the index, even if not indexed. The idea is that if the index is useful for an index-only scan, you can still get extra information without the trip to the underlying table. Of course, having a covering index (which is the name of an INCLUDE clause index) means that the index is going to grow in size and, at the same time, every tuple update could require extra index update effort.USING
clause allows the specification of the type of index to be built, and if none is specified, the default B-Tree is used.CONCURRENTLY
clause allows the creation of an index in a concurrent way: when an index is in its building phase, the underlying table is locked against changes so that the index can finish its job of indexing the tuple values. In a concurrent index creation, the table allows changes even during index creation, but once the index has been built, another pass on the underlying table is required to “adjust” what has changed in the meantime.
CREATE INDEX CONCURRENTLY index_name ON table_name using btree (column);