postgres fts full text search - ghdrako/doc_snipets GitHub Wiki
Pełnotekstowe przeszukiwanie (Full Text Search) jest wbudowaną funkcjonalnością PostgreSQL. Od wersji PostgreSQL 8.3, dostępne są natywnie:
- tsvector – typ danych do reprezentacji przeszukiwanego tekstu
- tsquery – typ zapytań wyszukujących
- Funkcje: to_tsvector, to_tsquery, plainto_tsquery, ts_rank, ts_headline itd.
- Indeksy GIN/GiST – dla przyspieszenia zapytań
- Obsługa wielu języków (np. 'english', 'polish')
SELECT to_tsvector('polish', 'To jest przykładowy tekst do wyszukania');
Wyszukiwanie na zywo
SELECT *
FROM articles
WHERE to_tsvector('polish', content) @@ plainto_tsquery('polish', 'szukana fraza');
Wyszukiwanie w dedykowanej zaindeksowanej kolumnie
CREATE TABLE articles (
id serial PRIMARY KEY,
title text,
content text,
document_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('polish', coalesce(title, '') || ' ' || coalesce(content, ''))) STORED
);
CREATE INDEX idx_articles_document_tsv ON articles USING GIN(document_tsv);
SELECT *
FROM articles
WHERE document_tsvv @@ plainto_tsquery('polish', 'szukana fraza');
Wyszukiwanie
The plainto_tsquery
function is useful in scenarios where all lexemes from a generated query must be present in a target tsvector value derived from our dataset. However, if we need more advanced filtering during a search—such as using the & (AND), | (OR), <-> (FOLLOWED BY) operators, or any combination of these — we can use the to_tsquery function.The to_tsquery function is defined as follows:
to_tsquery([ config regconfig, ] querytext text) returns ts
The function returns a tsquery value created from querytext, which must include text tokens/words separated by the & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) operators. It also allows grouping tokens using parentheses. Also, like other full-text search functions, to_tsquery accepts an optional regconfig argument if we need to specify a full-text search configuration different from the default one.
Using combination of AND and OR operators
SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('computer & animated & (lion | clownfish | donkey)');
Using the NOT operator and filtering by phrase
SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('lion & !''The Lion King''');
This query uses the ! (NOT) operator to exclude any movies with the phrase "The Lion King" in their description. The phrase must be enclosed in '' '' to ensure it is treated as a single unit.
Postgres check that the phrase "The Lion King" is not present in the movie description by looking into the result of the to_tsquery('lion & !''The Lion King''') function call:
SELECT * FROM to_tsquery('lion & !''The Lion King''');
to_tsquery
---------------------------------
'lion' & !( 'lion' <-> 'king' )
(1 row)
As the output shows, the to_tsquery function converts the phrase
into two lexemes ("lion" and "king") and removes "the" because it is a
stop word. Also, the function adds the <->
(FOLLOWED BY) operator
between the lexemes, indicating that "king" must immediately follow
"lion" in the description.
The <->
operator allows specifying the distance between lexemes.
Find a movie where "king" is the third token after "return", we can achieve this by using the operator and setting N to 3, as shown in this example:
SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('return <3> king');
This query returns a single match:
id | name
-----+-----------------------------------------------
122 | The Lord of the Rings: The Return of the King
(1 row)
This result satisfies the search criteria because the description includes the token "return", followed by "of", "the", and "king", with "king" being the third token after "return".
Postgres provides two additional functions that you might find useful in your applications:
• phraseto_tsquery
function works similarly to plainto_tsquery, except it inserts the <-> operator between lexemes instead of the & operator. It also accounts for stop words by inserting instead of <-> operators whenever applicable.
• websearch_to_tsquery
function is ideal for processing raw input directly from users. Users can include simple search operators (like AND, OR, or quotes for phrases) without worrying about special formatting, as the function is designed to handle such input without raising syntax errors.
Rozwiazanie z triggerem
- Dodaj kolumnę:
ALTER TABLE articles ADD COLUMN content_tsv tsvector;
- Wypełnij ją:
UPDATE articles
SET content_tsv = to_tsvector('polish', content);
- Dodaj indeks GIN (bardzo przyspiesza zapytania):
CREATE INDEX idx_articles_content_tsv ON articles USING GIN(content_tsv);
- Dodaj trigger, żeby pole content_tsv było zawsze aktualne:
CREATE FUNCTION update_content_tsv() RETURNS trigger AS $$
BEGIN
NEW.content_tsv := to_tsvector('polish', NEW.content);
RETURN NEW;
END
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_update_content_tsv
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION update_content_tsv();
- Wyszukiwanie:
SELECT *
FROM articles
WHERE content_tsv @@ plainto_tsquery('polish', 'szukana fraza');
Overall, full-text search in Postgres works as follows:
- Tokenization:The original text data, usually referred to as a document, is parsed into tokens, which are smaller text components such as words or phrases. For instance, the sentence “5 explorers are travelling” is tokenized into: “5,” “explorers,” “are,” and “travelling.”
- Normalization: The database converts tokens into lexemes, which are the basic units of meaning in a language. During normalization, Postgres removes case sensitivity, stems tokens to their root forms, and removes stop words such as articles or prepositions. For example, after normalizing the four tokens from the previous step, the database can produce the following three lexemes: “5,” “explor,” and “travel.” The token “travelling” is stemmed to its root form, “travel,” and “explorers” is converted to the lexeme “explor.” The token “are” is removed because it is a stop word with no significant meaning. The token “5” remains unchanged.
- Storing and indexing: The lexemes are stored in the database to avoid repeating the tokenization and normalization steps. Postgres provides a special data type called tsvector, specifically designed to store lexemes. Additionally, lexemes can be indexed to improve the performance of full-text search queries.
- Searching