postgres explain performance query auto_explain - ghdrako/doc_snipets GitHub Wiki
Description
- https://www.pgmustard.com/docs/explain
- https://pganalyze.com/docs/explain/setup/google_cloud_sql/01_auto_explain_check
- https://www.postgresql.org/docs/current/using-explain.html
Visualizers
- A query plan visualizer is built-in to the PgAdmin
- https://www.depesz.com/tag/unexplainable/
- https://explain.depesz.com/
- https://www.pgmustard.com
Fast Row Count Estimates
EXPLAIN SELECT * FROM movies WHERE rating = 'NC-17' AND price < 4.99;
The first row in the plan is always a node, and the other node rows are indented to the right and have an arrow as a prefix (->). The other lines in the plan provide information about the node they are under; therefore, in the preceding example, the Sort Key row is additional information to the Sort node.
Another approach to distinguish node rows from additional information is to consider that every node line has the cost, rows, and width attributes in parentheses.
Once you have discovered the nodes of the query, you have to find out the very first node, which is usually the most indented one, and also the one with the lowest startup cost;
EXPLAIN ( FORMAT JSON ) SELECT * FROM categories;
outputs in TEXT
, XML
, JSON
, and YAML
JSON
format provides not only a different structure to the query plan but also a different and more rich set of information.
EXPLAIN ANALYZE
The ANALYZE
mode of EXPLAIN
enhances the command by effectively running the query to explain. It prints out the best plan to execute the query and it runs the query, also reporting back some statistical information.
While EXPLAIN
can only estimate the costs of a node, the EXPLAIN ANALYZE
provides feedback on the execution time (expressed in milliseconds), the effective number of rows, and how many times a node has been executed (loops).
EXPLAIN ANALYZE SELECT * FROM posts;
EXPLAIN ANALYZE
can also be invoked by passing ANALYZE as an option to EXPLAIN, as follows:
EXPLAIN ( ANALYZE ) SELECT * FROM categories ORDER BY title DESC;
The option form of EXPLAIN ANALYZE
is handy when you want to add other options to EXPLAIN,
Explain options
VERBOSE
option
EXPLAIN (VERBOSE on) SELECT * FROM categories;
SETTING
option
SET work_mem TO '32MB';
EXPLAIN (SETTINGS on) SELECT * FROM posts ORDER BY created_on DESC;
BUFFERS
option Add the BUFFERS argument to EXPLAIN to get insight into the amount of data being fetched (the IO). The amount of data being fetched adds latency to retrieval speed. Selecting less data will likely result in faster queries.
EXPLAIN (ANALYZE, BUFFERS on) SELECT * FROM posts;
WAL
option
EXPLAIN (ANALYZE on, WAL on, FORMAT yaml) INSERT INTO...
EXPLAIN (analyze, verbose, costs, timing, buffers)
- analyze true will actually execute the query
- verbose true will add some more information to the plan (such as column information). costs true will show information about costs.
- timing true is equally important, as it will provide us with good runtime data so that we can see where in the plan time gets lost.
- buffers true, which can be very enlightening.
auto_explain
auto_explain can log plans when a statement exceeds a certain duration.
auto_explain does add some overhead.
Only on dev en we can set auto_explain.log_min_duration to 0, so that all queries are logged.
auto-explain triggers when a running query is slower than a specified threshold, and then dumps the execution plan of the query in the PostgreSQL logs.
The auto-explain module is configured via a set of auto_explain
parameter options that can be inserted in the PostgreSQL configuration (the postgresql.conf
file), but you need to remember that in order to activate the module, you need to restart the cluster.
All the settings for auto-explain are named in the namespace auto_explain
, therefore any parameter has a prefix that starts with auto_explain.; the main settings are the following ones:
auto_explain.log_min_duration
is the threshold of time a statement must take before it is logged. Any statement requiring more time than this setting will appear in the cluster logs with its EXPLAIN output.auto_explain.log_format
andauto_explain.log_level
control the format of the out-put, in terms of text, JSON, YAML, XML and the level at which such output will be logged (e.g., INFO).auto_explain.log_verbose
, if turned on, provides more verbose information in the out-put.auto_explain.sample_rate
is a value between 0 and 1 indicating the sampling rate of a session. For example, 0.5 means that one statement out of two will be logged.auto_explain.log_nested_statements
is a boolean value that determines whether “in-ner” statements have to be logged on their own. For example, if this option is turned on when logging a function call, statements that also happen inside the function will be logged and explained.auto_explain.log_analyze
is a boolean value that indicates whether the logged state-ment must also report EXPLAIN ANALYZE values, mainly the actual timing. Be aware that taking per-node timing information can be resource-demanding and thus can slow down the whole query. When this parameter is turned on, other settings can be turned on to provide the same information that EXPLAIN ANALYZE does:auto_explain.log_buffers
, when turned on, provides information about the buffer’s utilization.auto_explain.log_wal
, when turned on, provides information about the WAL produced by a query.auto_explain.log_timing
, when turned on, provides per-node timing information.auto_explain.log_triggers
, when turned on, provides information about trigger exe-cutions within a statement.auto_explain.log_settings
, when turned on, reports settings different from the clus-ter-wide conf i guration.
In order to install and configure the module, let’s start simple and add the following two settings to the cluster configuration in postgresql.conf
:
session_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '100ms'
The first line tells PostgreSQL to load the library related to the auto-explain module, while the second instruments the module to trigger whenever a query takes longer than 100 milliseconds to conclude. Of course, you can raise the query duration or lower it, depending on your needs.
test=# LOAD 'auto_explain';
test=# SET auto_explain.log_analyze TO on;
test=# SET auto_explain.log_min_duration TO 500;
session_preload_libraries = 'auto_explain';
EXPLAIN Output
When you execute a query with EXPLAIN ANALYZE, PostgreSQL generates an execution plan that consists of several components:
- Plan Tree: The execution plan is represented as a tree structure. Each node in the tree corresponds to an operation (e.g., scan, join, filter) performed during query execution.
- Cost Estimates: PostgreSQL estimates the cost of each operation based on factors like I/O, CPU, and memory usage.
- Actual Execution Times: The actual time spent executing each operation during query execution.
- Statistics: Information about the number of rows processed, memory usage, and other relevant metrics.
Common Output Columns
- Node Type: Describes the type of operation (e.g., Seq Scan, Index Scan, Hash Join).
- Relation Name: The name of the table or index being accessed.
- Startup Cost: The estimated cost before starting the operation.
- Total Cost: The total estimated cost for the entire query.
- Actual Rows: The actual number of rows processed.
- Actual Time: The actual time spent executing the operation.
Node start with an arrow (->) except first one node (highest) called root
Node types
- Sequential Scan (Seq Scan) - full table scan
- Index Scan (Index Only Scan) - Indicates an index lookup
Bitmap Index Scan first, means it will scan the index first and form a bitmap of which pages need to be pulled data from that will happen in the upper level (Bitmap Heap Scan).Bitmap Heap Scan it means instead of looking for the entire disk, it will only look through the bitmap disk. The result is mapped in memory based on the result of the Bitmap Index Scan.
The Bitmap Index Scan happens when the result set will have a high selectivity rate with respect to the search conditions (i.e., there is a high percentage of rows that satisfy the search criteria). In this case, the planner will plan to scan the entire index, forming a bitmap of which pages on disk to pull the data out from (which happens during the Bitmap Heap Scan step). This is better than a Sequential Scan because it only scans the relevant pages on disk, skipping the pages that it knows relevant data does not exist. Depending on the statistics available to the optimizer, it may not be advantageous to do an Index Scan or an Index-Only Scan, but it is still better than a Sequential Scan.
Index-Only Scan is a scan of the index that will pull the relevant data without having to visit the actual table. This is because the relevant data is already in the index.