postgres statistics - ghdrako/doc

https://www.youtube.com/watch?v=ApAClPFJ_rU

Postgres interna statistics

activity statistic
data statistic (rozklad danych)

W postgres 9.4 jest okolo 104 metryk.

obraz obraz

pg_stat_database

obraz

cache hit ratio 90 % to ok

obraz

anomalie

obraz

pg_stat_bgwriter

obraz

checkpoints

obraz

Background writer

obraz

pg_stat_replication

obraz

czy replikacja nadaza ze zmianami

obraz

przyczyny nienadazania repliki ze zmianami

obraz

pg_stat_all_tables

obraz

sequencial scan - mozna stroic dodajac indeksy

obraz

rozmiar tablicy

obraz

write activity

obraz

hot update - update ktory nie powoduje zmiany indeksu. Gdy update - czyli insert trafia do tej samej strony to index ktory wskazywal nie musi sie mieniac(delete i insert) bo wskazuje na wlasciwa strone

obraz

fillfactore - jaki procent strony moga zapelnic inserty - reszta jest zarezerwowana na updaty - aby sie wykonywaly jakio hot update

obraz

autovacuum queue

obraz

https://github.com/avito-tech/dba-utils/blob/master/munin/vacuum_queue

pg_stat_all_indexes

obraz

using indexes

obraz

pg_stat_activity

obraz

aktywnosc w bazie
długie tranzakcje
idle transaction
blokady

Data Statistic

PG_STAT Views

User friendly views because PG_STAT tables are optimised to store on disk and hard to read. obraz obraz obraz obraz

Statistic gathered by default:

obraz obraz

Use statistic by Query Planer

obraz

The statistic collector collects various statistics about tables, indexes, and database activity, providing valuable insights into data distribution and access patterns. This information is then used by the statistic analyzer, which analyzes the collected statistics and updates them accordingly. By enabling these components and regularly analyzing the statistics, database administrators can ensure that the query planner makes accurate decisions based on the actual data distribution, resulting in more efficient query execution and improved overall performance of the PostgreSQL database.

Enabling the statistic collect or postgresql.conf

track_counts = on

Querying statistics

SELECT * FROM pg_stat_all_tables WHERE relname = 'testings';

default_statistics_target
Analyzing statistics

ANALYZE <Table_Name>;
VACUUM ANALYZE <Table_Name>; -- command to perform both vacuuming and analysis simultaneously.

ALTER TABLE users
ALTER COLUMN first_name SET STATISTICS 5_000;
ANALYZE users;
SELECT
attname,
n_distinct,
most_common_vals
FROM pg_stats
WHERE schemaname = 'rideshare'
AND tablename = 'users'
AND attname = 'first_name';
-- -[ RECORD 1 ]----+---------------------------------------------------
-- attname | first_name
-- n_distinct | -0.9931146
-- most_common_vals | {Alonso,Fred,Harris,Corey,
-- Dominique,Jamal,Marion,Mary,Reed,Sam,Sandy,Stevie}
-- Set statistics target back to default of 1000

pg_stats system view

https://www.postgresql.org/docs/current/view-pg-stats.html

Extended statistics

You can define additional statistics that will be collected when you next analyze the table:

CREATE STATISTICS cust_stat1 ON state, area_code FROM cust;

The execution time of ANALYZE will increase to collect the additional stats information, plus there is a small increase in query planning time, so use this sparingly when you can confirm this will make a difference. If there is no benefit, use DROP STATISTICS to remove them again. By default, multiple types of statistics will be collected—you can fine-tune this by specifying just a few types of statistics if you know what you are doing. Unfortunately, the statistics command doesn’t automatically generate names, so include the table name in the statistics you create since the name is unique within the database and cannot be repeated on different tables.

Additionally, you cannot collect statistics on individual fields within JSON documents at the moment, nor collect dependency information between them; this command only applies to whole column values at this time.

The goal of CREATE STATISTICS is to mitigate problem with correlation with columns which cause wrong estimation and wrong plan. You can manually force postgres to link two columns. There are three types of statistics you can create:

Functional dependencies
Multivariant N-distinct Count
Multivariant MVC list

Dependencies

Describe a dependency between two columns:

Related columns in exp Counry and City
The values of two columns vary together (column a= column b+1)

CREATE STATISTICS (dependencies) on country,city FROM facilities;

Multivariant N-distinct Count

For each column you have a ndistinct.

During a GROUP BY , calculating the number of distinct groups can be wrong when the columns are linked. To improve estimation number of group you can do:

CREATE STATISTICS (ndistinct) on category,title FROM baseline_measurements;

Multivariant MVC list

This is the same idea than MCVs, but for more then one column. It will aggregate the frequency of combined columns.

The difference with functional dependencies is that MCV list support other operators like <,>,<=,>= then two others (only support =). So its prefered using this statistic.

CREATE STATISTICS (mcv) on organ_system, adverse_event_term FROM reported_events;

Extended statistics Limitation

Because histogran isn't supported in extended statistics, they are only accurate:

For MCV
If the restof your dataset is evenly distributed

Configuration to improve statistics

default_statistics_target: default is 100, can go from 1 to 10000 - how many values is in MCV in whole db (every columns)
ALTER TABLE reported_events ALTER COLUMN organ_system SET STATISTICS 1000; - only on specified column
ALTER TABLE foo SET (n_distinct=value);
- Positive value = exact nb of distinct
- Negative value = percentage of the overall rows; -1 = each value is unique; -0.5 each value appears twice
Filtering columns: ANALYZE foo(bar);

Gather statistics

While analyzing a table, postgres will gather statistic on the table in this steps:

Gather the sample rows
Compute statistics on those sample rows according to the data type
Inserting/updating into pg_statistics tables

The number of rows is based on the number on how many values we want to compute (default_statistic_target parameter). Number of rows by default: 300*100

Postgres will go through the table and uses a reserviour sampling algorithm(Vitter)

Initialize list of sample rows
Scan rows until the list is full
Each new row has (to extremaly simplify) a probability of 1/num of rows to be selected
If row is selected, replace a random row in the existing list

postgres statistics - ghdrako/doc_snipets GitHub Wiki

pg_stat_database

pg_stat_bgwriter

pg_stat_replication

pg_stat_all_tables

pg_stat_all_indexes

pg_stat_activity

Data Statistic

PG_STAT Views

Statistic gathered by default:

Use statistic by Query Planer

pg_stats system view

Extended statistics

Dependencies

Multivariant N-distinct Count

Multivariant MVC list

Extended statistics Limitation

Configuration to improve statistics

Gather statistics

⚠️ GitHub.com Fallback ⚠️

postgres statistics - ghdrako/doc_snipets GitHub Wiki

pg_stat_database

pg_stat_bgwriter

pg_stat_replication

pg_stat_all_tables

pg_stat_all_indexes

pg_stat_activity

Data Statistic

PG_STAT Views

Statistic gathered by default:

Use statistic by Query Planer

pg_stats system view

Extended statistics

Dependencies

Multivariant N-distinct Count

Multivariant MVC list

Extended statistics Limitation

Configuration to improve statistics

Gather statistics

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️