postgres fts full text search - ghdrako/doc_snipets GitHub Wiki

Pełnotekstowe przeszukiwanie (Full Text Search) jest wbudowaną funkcjonalnością PostgreSQL. Od wersji PostgreSQL 8.3, dostępne są natywnie:

tsvector – typ danych do reprezentacji przeszukiwanego tekstu
tsquery – typ zapytań wyszukujących
Funkcje: to_tsvector, to_tsquery, plainto_tsquery, ts_rank, ts_headline itd.
Indeksy GIN/GiST – dla przyspieszenia zapytań
Obsługa wielu języków (np. 'english', 'polish')

SELECT to_tsvector('polish', 'To jest przykładowy tekst do wyszukania');

Wyszukiwanie na zywo

SELECT *
FROM articles
WHERE to_tsvector('polish', content) @@ plainto_tsquery('polish', 'szukana fraza');

Wyszukiwanie w dedykowanej zaindeksowanej kolumnie

CREATE TABLE articles (
    id serial PRIMARY KEY,
    title text,
    content text,
    document_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('polish', coalesce(title, '') || ' ' || coalesce(content, ''))) STORED
);
CREATE INDEX idx_articles_document_tsv ON articles USING GIN(document_tsv);
SELECT *
FROM articles
WHERE document_tsvv @@ plainto_tsquery('polish', 'szukana fraza');

Wyszukiwanie

The plainto_tsquery function is useful in scenarios where all lexemes from a generated query must be present in a target tsvector value derived from our dataset. However, if we need more advanced filtering during a search—such as using the & (AND), | (OR), <-> (FOLLOWED BY) operators, or any combination of these — we can use the to_tsquery function.The to_tsquery function is defined as follows:

to_tsquery([ config regconfig, ] querytext text) returns ts

The function returns a tsquery value created from querytext, which must include text tokens/words separated by the & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) operators. It also allows grouping tokens using parentheses. Also, like other full-text search functions, to_tsquery accepts an optional regconfig argument if we need to specify a full-text search configuration different from the default one.

Using combination of AND and OR operators

SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('computer & animated & (lion | clownfish | donkey)');

Using the NOT operator and filtering by phrase

SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('lion & !''The Lion King''');

This query uses the ! (NOT) operator to exclude any movies with the phrase "The Lion King" in their description. The phrase must be enclosed in '' '' to ensure it is treated as a single unit.

Postgres check that the phrase "The Lion King" is not present in the movie description by looking into the result of the to_tsquery('lion & !''The Lion King''') function call:

SELECT * FROM to_tsquery('lion & !''The Lion King''');
to_tsquery
---------------------------------
'lion' & !( 'lion' <-> 'king' )
(1 row)

As the output shows, the to_tsquery function converts the phrase into two lexemes ("lion" and "king") and removes "the" because it is a stop word. Also, the function adds the <-> (FOLLOWED BY) operator between the lexemes, indicating that "king" must immediately follow "lion" in the description. The <-> operator allows specifying the distance between lexemes.

Find a movie where "king" is the third token after "return", we can achieve this by using the operator and setting N to 3, as shown in this example:

SELECT id, name
FROM omdb.movies
WHERE lexemes @@ to_tsquery('return <3> king');

This query returns a single match:

id | name
-----+-----------------------------------------------
122 | The Lord of the Rings: The Return of the King
(1 row)

This result satisfies the search criteria because the description includes the token "return", followed by "of", "the", and "king", with "king" being the third token after "return".

Postgres provides two additional functions that you might find useful in your applications: • phraseto_tsquery function works similarly to plainto_tsquery, except it inserts the <-> operator between lexemes instead of the & operator. It also accounts for stop words by inserting instead of <-> operators whenever applicable. • websearch_to_tsquery function is ideal for processing raw input directly from users. Users can include simple search operators (like AND, OR, or quotes for phrases) without worrying about special formatting, as the function is designed to handle such input without raising syntax errors.

Rozwiazanie z triggerem

Dodaj kolumnę:

ALTER TABLE articles ADD COLUMN content_tsv tsvector;

Wypełnij ją:

UPDATE articles
SET content_tsv = to_tsvector('polish', content);

Dodaj indeks GIN (bardzo przyspiesza zapytania):

CREATE INDEX idx_articles_content_tsv ON articles USING GIN(content_tsv);

Dodaj trigger, żeby pole content_tsv było zawsze aktualne:

CREATE FUNCTION update_content_tsv() RETURNS trigger AS $$
BEGIN
  NEW.content_tsv := to_tsvector('polish', NEW.content);
  RETURN NEW;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_update_content_tsv
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION update_content_tsv();

Wyszukiwanie:

SELECT *
FROM articles
WHERE content_tsv @@ plainto_tsquery('polish', 'szukana fraza');

Overall, full-text search in Postgres works as follows: