postgres optimizer - ghdrako/doc_snipets GitHub Wiki

default_statistics_target

default jest 100 powino byc co najmniej 1000

Monitor progres operation

select * from pg_stat_progress_create_index

Query procesing

  • Parsing
  • Rewriting
  • Optimize the logical plan and convert it into an execution plan.
  • Execute (interpret) the plan and return results.

Optimalize

The DBA can only interact with the database in the optimization phase, trying to help PostgreSQL better understand the statement and optimize it correctly whenever PostgreSQL is not doing an optimal job

If the statement involves *more than 12 table joins, the optimizer does not iterate all the possibilities but rather executes a genetic algorithm to find a compromise way to access the data. The compromise is between the time spent in computing the path to the data and finding a not-too-bad access path.

The optimizer divides the set of actions to pass to the executor in nodes; a node is an action to execute in order to provide the final or an intermediate result.

Nodes that the optimizer uses

  • Sequential nodes
    • Sequential Scan (Seq Scan)
    • Index Scan, Index-Only Scan, and Bitmap Index Scan
    • Nested Loop, Hash Join, and Merge Join
    • The Gather and Merge parallel nodes
  • Index nodes
    • Index Scan - reads the chosen index, and from that, it goes seeking the tuples, reading again from the storage.
    • Index-Only Scan - use when can extract all the required information directly from the index.
    • Bitmap Index Scan - builds a memory bitmap of where tuples that satisfy the statement clauses are, and then this bitmap is used to locate those tuples. Bitmap Index Scan is usually associated with Bitmap Heap Scan
  • Join nodes
    • Nested Loop: both tables are scanned in a sequential or in-dexed-based method and every tuple is checked to see whether there is a match.Nested Loop is not forced to perform a sequential scan on both tables, and, in fact, depending on the context, every table could be walked in a sequential or indexed-based access method.
    • Hash Join node: the inner table is mapped into a hash, which is a set of buckets containing the tuples of the table; the outer table is then walked and for every tuple extracted from the outer table, the hash is searched to see whether there is a match.
    • Merge Join involves a step of sorting: both the tables are fi rst sorted by the join key(s), and then they are walked sequentially. For every tuple of the outer table, all the tuples that do match in the inner
  • Parallel nodes - setup time to distribute the job among parallel processes, as well as the time and resources needed to return the results of every single process.
    • Gather nodes - A parallel execution plan always involves two types of Gather nodes: a plain Gather node and a Gather Merge node.
    • Parallel scans - Parallel Seq scan, or index scans like Parallel Index and Parallel Index-Only scans and, Parallel Bitmap Heap scans
    • Parallel joins - tries to keep the inner table accessed in a non-parallel way (assuming such a table is small enough) and performs parallel access to the outer table.Parallel Hash Join in the case of Hash Join, the inner table is computed as a hash by every parallel process, which therefore requires every parallel process working on the outer table to compute the same results for the inner table.
    • Parallel aggregations When the fi nal result set is made by the aggregation of different parallel subqueries, there must be a parallel aggregation, which is the aggregation of every single parallel part. This aggregation happens in different steps: fi rst, there is a Partial Aggregate node, done by every parallel process that produces a partial result set. After that, a Gather node (or Gather Merge) collects all the partial results and passes the whole set to the Finalize Aggregate node, which sequentially assembles the fi nal result.

You can force PostgreSQL to perform a parallel plan, even if the preceding values are not satisf i ed, with an extra conf i guration parameter – debug_parallel_query – which is set to off by default:

SHOW min_parallel_table_scan_size;
SHOW min_parallel_index_scan_size;
SHOW debug_parallel_query ;
  • Utility nodes - used in a plan to achieve the fi nal result.
    • Sort node - statement involves an ordering of the result that is a clause such as ORDER BY
    • Limit node - query has an output limitation, such as a LIMIT clause
    • Append node - UNION ALL
    • Distinct node - Union, Distinct
    • GroupAggregate node - Group BY
    • WindowAgg node - managing the tuple aggregation required by the window function.
    • CTEScan node responsible for the join between the CTE subquery and the real table.
    • Materialize node - If a CTE join requires the materi-alization of a dataset

Node cost

Every node is associated with a cost, which is the estimation of how expensive, in terms of com-putational resources, the execution of the node will be. Of course, every node has a variable cost that depends on the type and quantity of the input, as well as the node type. PostgreSQL provides a list of costs, expressed in arbitrary units, for the main type of operations that a node can perform. Computing the cost of a node is, therefore, the computation of the cost of the single operations that the node performs multiplied by the number of times these operations are repeated, and this depends on the size of the data that the node has to evaluate.

SELECT name, setting  
  FROM pg_settings  
  WHERE name LIKE 'cpu%\_cost'  OR name LIKE '%page\_cost'  
  ORDER BY setting DESC;

Costs can change depending on the computation power of your system; in particular, having enterprise-level SSD storage disks can decrease your random_page_cost to 1.5, which is almost the same as a sequential page cost.

Data Access Algorithms:

  • Full scan There are two separate physical operations used by PostgreSQL to retrieve rows via indexes:
  • index scan
  • bitmap heap scan In an index scan, the database engine reads each entry of the index that satisfies the filter condition and retrieves blocks in index order. Because the underlying table is a heap, multiple index entries might point to the same block. To avoid multiple reads of the same block, the bitmap heap scan implemented in PostgreSQL builds a bitmap indicating the blocks that contain needed rows. Then all rows in these blocks are filtered. An advantage of the PostgreSQL implementation is that it makes it easy to use multiple indexes on the same table within the same query, by applying logical ANDs and ORs on the block bitmaps generated by each index.
  • index-only scan

For relatively slow rotating drives, index-based access is preferable only if selectivity does not exceed 2–5%. For SSDs or virtual environments, this value can be higher.