postgres statistics - ghdrako/doc_snipets GitHub Wiki

Postgres interna statistics

  • activity statistic
  • data statistic (rozklad danych)

W postgres 9.4 jest okolo 104 metryk.

obraz obraz

pg_stat_database

obraz

  • cache hit ratio 90 % to ok

obraz

  • anomalie

obraz

pg_stat_bgwriter

obraz

  • checkpoints

obraz

  • Background writer

obraz

pg_stat_replication

obraz

  • czy replikacja nadaza ze zmianami

obraz

  • przyczyny nienadazania repliki ze zmianami

obraz

pg_stat_all_tables

obraz

  • sequencial scan - mozna stroic dodajac indeksy

obraz

  • rozmiar tablicy

obraz

  • write activity

obraz

  • hot update - update ktory nie powoduje zmiany indeksu. Gdy update - czyli insert trafia do tej samej strony to index ktory wskazywal nie musi sie mieniac(delete i insert) bo wskazuje na wlasciwa strone

obraz

  • fillfactore - jaki procent strony moga zapelnic inserty - reszta jest zarezerwowana na updaty - aby sie wykonywaly jakio hot update

obraz

  • autovacuum queue

obraz

pg_stat_all_indexes

obraz

  • using indexes

obraz

pg_stat_activity

obraz

  • aktywnosc w bazie obraz

  • długie tranzakcje obraz

  • idle transaction obraz

  • blokady obraz

Data Statistic

PG_STAT Views

User friendly views because PG_STAT tables are optimised to store on disk and hard to read. obraz obraz obraz obraz

Statistic gathered by default:

obraz obraz

Use statistic by Query Planer

obraz

The statistic collector collects various statistics about tables, indexes, and database activity, providing valuable insights into data distribution and access patterns. This information is then used by the statistic analyzer, which analyzes the collected statistics and updates them accordingly. By enabling these components and regularly analyzing the statistics, database administrators can ensure that the query planner makes accurate decisions based on the actual data distribution, resulting in more efficient query execution and improved overall performance of the PostgreSQL database.

  • Enabling the statistic collect or postgresql.conf
track_counts = on
  • Querying statistics
SELECT * FROM pg_stat_all_tables WHERE relname = 'testings';
ANALYZE <Table_Name>;
VACUUM ANALYZE <Table_Name>; -- command to perform both vacuuming and analysis simultaneously.
ALTER TABLE users
ALTER COLUMN first_name SET STATISTICS 5_000;
ANALYZE users;
SELECT
attname,
n_distinct,
most_common_vals
FROM pg_stats
WHERE schemaname = 'rideshare'
AND tablename = 'users'
AND attname = 'first_name';
-- -[ RECORD 1 ]----+---------------------------------------------------
-- attname | first_name
-- n_distinct | -0.9931146
-- most_common_vals | {Alonso,Fred,Harris,Corey,
-- Dominique,Jamal,Marion,Mary,Reed,Sam,Sandy,Stevie}
-- Set statistics target back to default of 1000

pg_stats system view

Extended statistics

You can define additional statistics that will be collected when you next analyze the table:

CREATE STATISTICS cust_stat1 ON state, area_code FROM cust;

The execution time of ANALYZE will increase to collect the additional stats information, plus there is a small increase in query planning time, so use this sparingly when you can confirm this will make a difference. If there is no benefit, use DROP STATISTICS to remove them again. By default, multiple types of statistics will be collected—you can fine-tune this by specifying just a few types of statistics if you know what you are doing. Unfortunately, the statistics command doesn’t automatically generate names, so include the table name in the statistics you create since the name is unique within the database and cannot be repeated on different tables.

Additionally, you cannot collect statistics on individual fields within JSON documents at the moment, nor collect dependency information between them; this command only applies to whole column values at this time.

The goal of CREATE STATISTICS is to mitigate problem with correlation with columns which cause wrong estimation and wrong plan. You can manually force postgres to link two columns. There are three types of statistics you can create:

  • Functional dependencies
  • Multivariant N-distinct Count
  • Multivariant MVC list
Dependencies

Describe a dependency between two columns:

  • Related columns in exp Counry and City
  • The values of two columns vary together (column a= column b+1)
CREATE STATISTICS (dependencies) on country,city FROM facilities;
Multivariant N-distinct Count

For each column you have a ndistinct.

During a GROUP BY , calculating the number of distinct groups can be wrong when the columns are linked. To improve estimation number of group you can do:

CREATE STATISTICS (ndistinct) on category,title FROM baseline_measurements;
Multivariant MVC list

This is the same idea than MCVs, but for more then one column. It will aggregate the frequency of combined columns.

The difference with functional dependencies is that MCV list support other operators like <,>,<=,>= then two others (only support =). So its prefered using this statistic.

CREATE STATISTICS (mcv) on organ_system, adverse_event_term FROM reported_events;
Extended statistics Limitation

Because histogran isn't supported in extended statistics, they are only accurate:

  • For MCV
  • If the restof your dataset is evenly distributed

Configuration to improve statistics

  • default_statistics_target: default is 100, can go from 1 to 10000 - how many values is in MCV in whole db (every columns)
  • ALTER TABLE reported_events ALTER COLUMN organ_system SET STATISTICS 1000; - only on specified column
  • ALTER TABLE foo SET (n_distinct=value);
    • Positive value = exact nb of distinct
    • Negative value = percentage of the overall rows; -1 = each value is unique; -0.5 each value appears twice
  • Filtering columns: ANALYZE foo(bar);

Gather statistics

While analyzing a table, postgres will gather statistic on the table in this steps:

  1. Gather the sample rows
  2. Compute statistics on those sample rows according to the data type
  3. Inserting/updating into pg_statistics tables

The number of rows is based on the number on how many values we want to compute (default_statistic_target parameter). Number of rows by default: 300*100

Postgres will go through the table and uses a reserviour sampling algorithm(Vitter)

  1. Initialize list of sample rows
  2. Scan rows until the list is full
  3. Each new row has (to extremaly simplify) a probability of 1/num of rows to be selected
  4. If row is selected, replace a random row in the existing list
⚠️ **GitHub.com Fallback** ⚠️