Database - atabegruslan/Notes GitHub Wiki

Speed up MySQL

Index

Index All Columns Used in 'where', 'order by', and 'group by' Clauses

Index make selection faster but modifications/DML operations slower.

For selection: clustered index is the fastest, non clustered index can be even worse than no index.

Don't have unused indexes lying around.

Optimize Like Statements With Union Clause

If the ORs in the WHERE clause are too much or too complicated, the optimizer could incorrectly choose a full table scan to retrieve a record.
Use UNION instead.

Avoid Like Expressions With Leading Wildcards

Don't do WHERE column LIKE '%xxx'. MySQL can't utilize indexes and will do full table scan instead.

Don’t do negative searches

Instead of: SELECT * FROM Customers WHERE RegionID <> 3 (which can't utilize indexes),

Use: SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID (which can utilize indexes).

Don't use function inside WHERE clause

If so, the SQL server will scan instead of seek table using indexes.

Take Advantage of MySQL Full-Text Searches

If you have to use wildcards, utilize full text index instead.

https://github.com/atabegruslan/Notes/wiki/Database#full-text-vs-metadata-search

IN vs UNION ALL

(Sidenote: UNION ALL display duplicate values, UNION don't)

When filtering rows of data on multiple values in tables with skewed distributions and non-covering indexes, writing your logic into multiple statements joined with UNION ALLs can sometimes generate more efficient execution plans than just using IN or ORs.

DISTINCT with few unique values

Using the DISTINCT operator is not always the fastest way to return the unique values in a dataset.

Using recursive CTEs to return distinct values on large datasets with relatively few unique values: https://sqlperformance.com/2014/10/t-sql-queries/performance-tuning-whole-plan

What are Common Table Expressions: https://www.sqlservertutorial.net/sql-server-basics/sql-server-cte/

Optimize Your Database Schema

Normalize
- 1NF: Cell should contain only single value. So no eg arrays.
- 2NF: If a table have composite key, then every non-candidate attribute must depend on the entire composite key. Accomplished by seperating the attribute that depends on only a part of the composite key into another table.
- 3NF: In a table, can't have Col2 -> Col1 -> PK. Accomplished by seperating that transitive dependency into another table.
- 3.5NF: In a table, there can't be any other dependencies, except the dependencies that stems from the super key.
- 4NF: In a table with >2 columns, there can't be multi-valued dependency. Eg, if person1 likes to eat apples and bananas, and likes to play soccer and basketball, then 2 unique rows containing the same info can arise (person1 - apples - soccer & person1 - bananas - soccer), which is not allowed.
- 5NF: If a table is broken down. It must be the same upon recombining. If not so, then don't break the table apart.
- (Ref: https://www.youtube.com/playlist?list=PLLGlmW7jT-nTr1ory9o2MgsOmmx2w8FB3 )
Don't use datatypes that is bigger than needed. Eg, for yes/no flags, use tinyint, don't use int.
Avoid null values
Don't have too many columns in a table
Minimize JOINs, especially nested/compound joins.
- Sometimes, SELECT clause subqueries can also replace JOINs
- When have nested/compound joins, force the order of joins, instead of query-optimizer's order Force table join order with blocking operators: https://sqlbits.com/Sessions/Event14/Query_Tuning_Mastery_Clash_of_the_Row_Goals

Avoid Use of Non-correlated Scalar Sub Query

Non-correlated vs correlated subqueries:
Non-correlated means subquery is independent of parent query. Correlated means subquery depends on input from parent query.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Queries/Subqueries/NoncorrelatedAndCorrelatedSubqueries.htm

Scalar subqueries:
A subquery that selects only one column or expression and returns one row.
https://docs.actian.com/ingres/11.0/index.html#page/SQLRef/Scalar_Subqueries.htm

If subquery in independent of parent query, then write it as an independent query. It's less complicated for the optimizer.

Derived tables instead of correlated subqueries

What are derived tables: https://www.mysqltutorial.org/mysql-derived-table/

Derived table queries often produces better performance due to their set-based nature.

GROUP BY instead of Window functions

What are window function: https://www.sqltutorial.org/sql-window-functions/

Sometimes window functions rely a little too much on tempdb and blocking operators to accomplish what you ask of them. While using them is always my first choice because of their simple syntax, if they perform poorly you can usually rewrite them as an old-fashioned GROUP BY to achieve better performance.

MySQL Query Caching

Cache: https://www.digitalocean.com/community/tutorials/how-to-optimize-mysql-with-query-cache-on-ubuntu-18-04

Cache refreshing:

On write. Update cache upon updating original data.
Application polls original data to update cache.
Original DB pushes/notifies cache upon change.

Issue about `UPDATE`

UPDATE statement is logged, which means it has to write twice for every single write to the table.

Example:

Situation: Reading from the Customer table and you want anyone with more than $100,000 in orders to be labeled as "Preferred".

Bad solution: Insert the data into the table. Then run an UPDATE statement to set the CustomerRank column to "Preferred" for anyone who has more than $100,000 in orders.

Good solution:

SELECT ..., ...,
CASE
    WHEN amount > 100000 THEN 'Preferred'
    ELSE ''
END AS CustomerRank
FROM Customer;

and then insert it.

Don’t blindly reuse code

When copying code, rid the things that you don't need. eg joins, where conditions...

Be selective for what columns and rows you retrieve

Regarding columns: Don't use SELECT *, just select what you need.

Don’t double-dip

Query large tables only once whenever possible.

Do pre-stage data

This is one of my favorite topics because it’s an old technique that’s often overlooked. If you have a report or a procedure (or better yet, a set of them) that will do similar joins to large tables, it can be a benefit for you to pre-stage the data by joining the tables ahead of time and persisting them into a table. Now the reports can run against that pre-staged table and avoid the large join.

You’re not always able to use this technique, but when you can, you’ll find it is an excellent way to save server resources.

Note that many developers get around this join problem by concentrating on the query itself and creating a view-only around the join so that they don’t have to type the join conditions again and again. But the problem with this approach is that the query still runs for every report that needs it. By pre-staging the data, you run the join just once (say, 10 minutes before the reports) and everyone else avoids the big join. I can’t tell you how much I love this technique; in most environments, there are popular tables that get joined all the time, so there’s no reason why they can’t be pre-staged.

Temporary Staging Tables

Sometimes the query optimizer struggles to generate an efficient execution plan for complex queries. Breaking a complex query into multiple steps that utilize temporary staging tables can provide SQL Server with more information about your data. They also cause you to write simpler queries which can cause the optimizer to generate more efficient execution plans as well as allow it to reuse result sets more easily.

Do delete and update in batches

Don't delete/update too many entries in one go. ("in one go" here means: as 1 transaction). It can cause blockages.

Don't do large ops on many tables in the same batch

Don't operate on too many tables on 1 transaction. Remember that each table will be locked. It can cause blockages.

Avoid cursors

What are cursors: https://www.mysqltutorial.org/mysql-cursor/

Cursors can cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.

Solutions/Mitigations:
- Use set-based queries: Set-based queries is more efficient than cursor-based queries.
- If you really need to use cursors:
  - Avoid dynamic cursors: dynamic cursor limits the optimizer to using nested loop joins.
  - Do the cursor operations against a temp table:

Eg, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.

Don't nest views

What are views: https://www.mysqltutorial.org/mysql-views-tutorial.aspx
Nested view aren't a 'cache'. It's a 'remembered query'.

Avoid nested views:

you will probably have much more data coming back than you need.
the query optimizer will give up and return a bad query plan.

https://bornsql.ca/blog/nested-views-bad/

Indexed Views

When you can't add new indexes to existing tables, you might be able to get away with creating a view on those tables and indexing the view instead . This works great for vendor databases where you can't touch any of the existing objects.

Do use partitioning to avoid large data moves

What is table partitioning: https://www.sqlshack.com/database-table-partitioning-sql-server/

Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you’re running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.

This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.

This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.

This is a case where understanding that all tables are partitions slashed hours from a data load.

If you must use ORMs, use stored procedures

ORM machine-generates queries, which is never as good as a programmer who knows what he's doing.

Stored procedures and User defined functions

What they are: https://github.com/atabegruslan/Notes/wiki/Database#stored-procedures--udfs

UDFs are further seperated into scalar UDFs and Table-Valued functions.

UDFs can cause query plans to serialize, which can obviously slow things down (but not always).
But sometimes on the contrary, a poorly configured server will parallelize queries too frequently and cause poorer performance than their serially equivalent plan.

Do use table-valued functions

This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.

What is CROSS APPLY: https://www.youtube.com/watch?v=kVogo0AbatM

But avoid Multi-statement Table Valued Functions (TVFs)

Multi-statement TVFs are more costly than inline TFVs. SQL Server expands inline TFVs into the main query like it expands views but evaluates multi-statement TVFs in a separate context from the main query and materializes the results of multi-statement into temporary work tables. The separate context and work table make multi-statement TVFs costly.

Don't use triggers

What are triggers: https://github.com/atabegruslan/Notes/wiki/Database#triggers

Avoid triggers!

Reason: If you update table, and the trigger updates another table.
Then locks are put onto 2 tables.
The original update and the triggered update counts as 1 transaction and locks won't be lifted until transaction is done.
Can cause blockage and slow-downs.

Instead: Make the original and triggered updates as 2 stored procedures, and as seperate transactions.

Don’t cluster on GUID

What is GUID/UUID: https://www.mysqltutorial.org/mysql-uuid/ , https://www.sqlservertutorial.net/sql-server-basics/sql-server-guid/
A GUID (globally unique identifier) is a 16-byte randomly generated number.

Ordering your table's data on this clustered GUID will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY.

This applies to more than GUIDs -- it goes toward any volatile column.

About UUID

Position a Column in an Index

Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key.

As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.

Don't count all rows if you only need to see if data exists

Don't use SELECT COUNT(*) to check existence. Use If EXISTS instead.

SET @CT = (SELECT COUNT(*) FROM dbo.T1);
If @CT > 0
BEGIN
<Do something>
END

If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
<Do something>
END

SELECT COUNT(*) takes a LOT more reads.

If you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is:

SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1.

In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.

Default filegroup settings

Filegroups are "directories" where DB store their files.

Advantage of more filegroups: more parallelization.

Specify these 2 to make insert faster:

Statistic

Turn on statistics as they helps the query optimizer.

Data Compression

Not only does data compression save space , but on certain workloads it can actually improve performance. Since compressed data can be stored in fewer pages, read disk speeds are improved, but maybe more importantly the compressed data allows more to be stored in SQL Server's buffer pool, increasing the potential for SQL Server to reuse data already in memory.

Switch cardinality estimators

The newer cardinality estimator introduced in SQL Server 2014 improves the performance of many queries. However, in some specific cases it can make queries perform more slowly.

In those cases, a simple query hint is all you need to force SQL Server to change back to the legacy cardinality estimator: https://blog.sqlauthority.com/2019/02/09/sql-server-enabling-older-legacy-cardinality-estimation/

Copy the data

If you can't get better performance by rewriting a query, you can always copy the data you need to a new table in a location where you CAN create indexes and do whatever other helpful transformations you need to do ahead of time.

Ref:

Eager vs Lazy load

Security

https://www.infoworld.com/article/2953834/7-essential-sql-server-security-tips.html

General Basics

Storage Engines

InnoDB	MyISAM
row-level locking	full table-level locking
faster insert and update	faster read in some situations
less efficient `select count(*)`	efficient `select count(*)` (save data as table level)
referential integrity (RDBMS)	no referential integrity (DMBS)
ACID	no ACID
transaction, logs, rollback	no transaction nor crash recovery
big projects	small projects, small footprint
FULLTEXT only after MySQL 5.6. InnoDB uses inverted lists for FULLTEXT indexes.	FULLTEXT search indexes
AUTO_INCREMENT field is a part of index
cant re-establish deleted tables

Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT) are stored in B-trees.
Indexes on spatial data types use R-trees.
MEMORY tables also support hash indexes.

Index

Auto Increment

https://trebleclick.blogspot.com/2009/01/mysql-set-auto-increment-in-phpmyadmin.html

Full-text vs metadata search

Full-text search is distinguished from searches based on metadata or on parts of the original texts.

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria.

In CJK languages

Execution order

https://dba.stackexchange.com/a/162756

Data Types

https://www.mysqltutorial.org/mysql-data-types.aspx

Spatial

https://www.w3resource.com/mysql/mysql-spatial-data-types.php

Binary

Prepared Statements

https://www.w3schools.com/php/php_mysql_prepared_statements.asp

Atomicity

Basic transaction and rollback:

try 
{
	$dsn = DB_DRIVER.":host=".DB_SERVER.";dbname=".DB_DATABASE.";charset=UTF8";
	$dbh = new PDO($dsn, DB_USERNAME, DB_PASSWORD);
	$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);	
} 
catch (PDOException $e) 
{
	die($e->getMessage());		
}	

try 
{
    $dbh->beginTransaction();

	$sql = 'SELECT col1,col2 FROM table WHERE id=:key';
	$sth = $dbh->prepare($sql);
	$sth->execute(array(":key" => 1));

    $dbh->commit();
} 
catch (PDOException $e) 
{
    $dbh->rollBack();
    die($e->getMessage());
}

Rollbacks:

Concurrency

Table lock: https://www.mysqltutorial.org/mysql-table-locking/

https://github.com/atabegruslan/Notes/blob/main/notes/storage/db_concurrency_2pl_or_timestamp.pdf

Locking

Triggers

Examples

MySQL	PGSQL

https://www.mysqltutorial.org/mysql-triggers.aspx	https://www.postgresqltutorial.com/postgresql-triggers/creating-first-trigger-postgresql , https://www.postgresql.org/docs/current/plpgsql-errors-and-messages.html , https://www.cybertec-postgresql.com/en/postgresql-how-to-write-a-trigger

Calling API from Trigger

Stored procedures & UDFs

Stored Function (UDF)	Stored Procedure
Must return a value	Can return nothing
Can only have input parameters	Can have both input and output parameters
Functions can be called from Procedure	Procedures cannot be called from a Function
Can't edit, just read (ie: returns) the data	Can edit data, and should be used for editing data and making transactions

Full list of differences:

UDF Examples

MySQL	PGSQL

	https://www.postgresqltutorial.com/postgresql-plpgsql/postgresql-create-function

https://www.mysqltutorial.org/mysql-stored-function , https://www.sqlservertutorial.net/sql-server-user-defined-functions

Other points of interest from above example:

PGSQL's closest equivalent to DETERMINISTIC:
- https://dba.stackexchange.com/questions/185044/postgresql-immutable-volatile-stable
- https://www.postgresql.org/message-id/2131668.1686840541%40sss.pgh.pa.us
PLPGSQL have full functionalities but hurts performance. 'sql' is the simpler alt: https://en.wikipedia.org/wiki/PL/pgSQL
User defined variables in MySQL: https://www.mysqltutorial.org/mysql-basics/mysql-variables
Better to use Text instead of Varchar in PGSQL: https://wiki.postgresql.org/wiki/Don't_Do_This#Don.27t_use_varchar.28n.29_by_default
These are both valid syntaxes for PGSQL stored functions
- CREATE OR REPLACE FUNCTION get_user_level( IN user_id INTEGER ) RETURNS TEXT
- CREATE OR REPLACE FUNCTION get_user_level( IN user_id INTEGER, OUT something TEXT )
- But still, functions are better than procedures for routines with returns.
PGSQL often have "case folding", so better to use snake_case and avoid camelCase

Stored Procedures Examples

MySQL	PGSQL


https://www.mysqltutorial.org/mysql-stored-procedure , https://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx , https://www.youtube.com/watch?v=LgSgEt1mSFk	https://www.postgresqltutorial.com/postgresql-plpgsql/postgresql-create-procedure

Partitioning and Sharding

MySQL	PGSQL

https://www.percona.com/blog/what-is-mysql-partitioning , https://viblo.asia/p/gioi-thieu-ve-partitioning-trong-mysql-jvEla6kz5kw , https://www.devart.com/dbforge/mysql/studio/partition-mysql.html , https://planetscale.com/blog/what-is-mysql-partitioning	https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql , https://www.postgresql.org/docs/current/ddl-partitioning.html

Character Sets and Collations

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

Encoding

https://github.com/atabegruslan/Notes/wiki/Encoding#encodings-commonly-used-in-web-development

Import and Export DB via Terminal

Export all: mysqldump -u root -p mydatabase > /home/myuser/database-dump.sql
Export data only: mysqldump -u [user] -p[pass] --no-create-info mydb > mydb.sql
Export structure only: mysqldump -u [user] -p[pass] --no-data mydb > mydb.sql
Import: mysql -u [user] -p[pass] mydb < mydb.sql

Good to know techniques

https://github.com/atabegruslan/Notes/tree/main/notes/storage/techniques
Find out what happens under the hood: https://dev.mysql.com/blog-archive/mysql-explain-analyze/

Temporarily disable foreign constraints

MySQL

SET FOREIGN_KEY_CHECKS = 0;
...
SET FOREIGN_KEY_CHECKS = 1;

PGSQL

BEGIN;
ALTER TABLE b DISABLE TRIGGER ALL;
...
ALTER TABLE b ENABLE TRIGGER ALL;
COMMIT;

https://stackoverflow.com/questions/38112379/disable-postgresql-foreign-key-checks-for-migrations

DB Platforms

DB Types

https://www.youtube.com/watch?v=W2Z7fbCLSTw

https://www.youtube.com/watch?v=kkeFE6iRfMM

https://www.youtube.com/watch?v=jb2AvF8XzII

Relational

https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems

NoSQL

https://www.youtube.com/watch?v=0buKQHokLK8

BigData

Hadoop

Multi Dimensional

OLTP (Online Transaction Processing): Inserts, Updates, Deletes.
OLAP (Online Analytic Processing): Answer MultiDimensional queries, i.e.: Selects.
Example of a MultiDimensional DBMS: https://en.wikipedia.org/wiki/Essbase

Database - atabegruslan/Notes GitHub Wiki

Speed up MySQL

Index

Optimize Like Statements With Union Clause

Avoid Like Expressions With Leading Wildcards

Don’t do negative searches

Don't use function inside WHERE clause

Take Advantage of MySQL Full-Text Searches

IN vs UNION ALL

DISTINCT with few unique values

Optimize Your Database Schema

Avoid Use of Non-correlated Scalar Sub Query

Derived tables instead of correlated subqueries

GROUP BY instead of Window functions

MySQL Query Caching

Issue about UPDATE

Don’t blindly reuse code

Be selective for what columns and rows you retrieve

Don’t double-dip

Do pre-stage data

Temporary Staging Tables

Do delete and update in batches

Don't do large ops on many tables in the same batch

Avoid cursors

Don't nest views

Indexed Views

Do use partitioning to avoid large data moves

If you must use ORMs, use stored procedures

Stored procedures and User defined functions

Do use table-valued functions

But avoid Multi-statement Table Valued Functions (TVFs)

Don't use triggers

Don’t cluster on GUID

About UUID

Position a Column in an Index

Don't count all rows if you only need to see if data exists

Default filegroup settings

Statistic

Data Compression

Switch cardinality estimators

Copy the data

Eager vs Lazy load

Security

General Basics

Storage Engines

Index

Auto Increment

Full-text vs metadata search

In CJK languages

Execution order

Data Types

Spatial

Binary

Prepared Statements

Atomicity

Concurrency

Locking

Triggers

Examples

Stored procedures & UDFs

UDF Examples

Stored Procedures Examples

Partitioning and Sharding

Character Sets and Collations

Encoding

Import and Export DB via Terminal

Good to know techniques

DB Platforms

DB Types

Relational

NoSQL

BigData

Hadoop

Multi Dimensional

Cloud

Issue about `UPDATE`