Database - atabegruslan/Notes GitHub Wiki

Speed up MySQL

Index

Index All Columns Used in 'where', 'order by', and 'group by' Clauses

Index make selection faster but modifications/DML operations slower.

For selection: clustered index is the fastest, non clustered index can be even worse than no index.

Don't have unused indexes lying around.

Optimize Like Statements With Union Clause

If the ORs in the WHERE clause are too much or too complicated, the optimizer could incorrectly choose a full table scan to retrieve a record.
Use UNION instead.

Avoid Like Expressions With Leading Wildcards

Don't do WHERE column LIKE '%xxx'. MySQL can't utilize indexes and will do full table scan instead.

Don’t do negative searches

Instead of: SELECT * FROM Customers WHERE RegionID <> 3 (which can't utilize indexes),

Use: SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID (which can utilize indexes).

Don't use function inside WHERE clause

If so, the SQL server will scan instead of seek table using indexes.

Take Advantage of MySQL Full-Text Searches

If you have to use wildcards, utilize full text index instead.

https://github.com/atabegruslan/Notes/wiki/Database#full-text-vs-metadata-search

IN vs UNION ALL

(Sidenote: UNION ALL display duplicate values, UNION don't)

When filtering rows of data on multiple values in tables with skewed distributions and non-covering indexes, writing your logic into multiple statements joined with UNION ALLs can sometimes generate more efficient execution plans than just using IN or ORs.

DISTINCT with few unique values

Using the DISTINCT operator is not always the fastest way to return the unique values in a dataset.

Using recursive CTEs to return distinct values on large datasets with relatively few unique values: https://sqlperformance.com/2014/10/t-sql-queries/performance-tuning-whole-plan

What are Common Table Expressions: https://www.sqlservertutorial.net/sql-server-basics/sql-server-cte/

Optimize Your Database Schema

Avoid Use of Non-correlated Scalar Sub Query

Non-correlated vs correlated subqueries:
Non-correlated means subquery is independent of parent query. Correlated means subquery depends on input from parent query.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Queries/Subqueries/NoncorrelatedAndCorrelatedSubqueries.htm

Scalar subqueries:
A subquery that selects only one column or expression and returns one row.
https://docs.actian.com/ingres/11.0/index.html#page/SQLRef/Scalar_Subqueries.htm

If subquery in independent of parent query, then write it as an independent query. It's less complicated for the optimizer.

Derived tables instead of correlated subqueries

What are derived tables: https://www.mysqltutorial.org/mysql-derived-table/

Derived table queries often produces better performance due to their set-based nature.

GROUP BY instead of Window functions

What are window function: https://www.sqltutorial.org/sql-window-functions/

Sometimes window functions rely a little too much on tempdb and blocking operators to accomplish what you ask of them. While using them is always my first choice because of their simple syntax, if they perform poorly you can usually rewrite them as an old-fashioned GROUP BY to achieve better performance.

MySQL Query Caching

Cache: https://www.digitalocean.com/community/tutorials/how-to-optimize-mysql-with-query-cache-on-ubuntu-18-04

Cache refreshing:

  1. On write. Update cache upon updating original data.
  2. Application polls original data to update cache.
  3. Original DB pushes/notifies cache upon change.

Issue about UPDATE

UPDATE statement is logged, which means it has to write twice for every single write to the table.

Example:

Situation: Reading from the Customer table and you want anyone with more than $100,000 in orders to be labeled as "Preferred".

Bad solution: Insert the data into the table. Then run an UPDATE statement to set the CustomerRank column to "Preferred" for anyone who has more than $100,000 in orders.

Good solution:

SELECT ..., ...,
CASE
    WHEN amount > 100000 THEN 'Preferred'
    ELSE ''
END AS CustomerRank
FROM Customer;

and then insert it.

Don’t blindly reuse code

When copying code, rid the things that you don't need. eg joins, where conditions...

Be selective for what columns and rows you retrieve

Regarding columns: Don't use SELECT *, just select what you need.

Don’t double-dip

Query large tables only once whenever possible.

Do pre-stage data

This is one of my favorite topics because it’s an old technique that’s often overlooked. If you have a report or a procedure (or better yet, a set of them) that will do similar joins to large tables, it can be a benefit for you to pre-stage the data by joining the tables ahead of time and persisting them into a table. Now the reports can run against that pre-staged table and avoid the large join.

You’re not always able to use this technique, but when you can, you’ll find it is an excellent way to save server resources.

Note that many developers get around this join problem by concentrating on the query itself and creating a view-only around the join so that they don’t have to type the join conditions again and again. But the problem with this approach is that the query still runs for every report that needs it. By pre-staging the data, you run the join just once (say, 10 minutes before the reports) and everyone else avoids the big join. I can’t tell you how much I love this technique; in most environments, there are popular tables that get joined all the time, so there’s no reason why they can’t be pre-staged.

Temporary Staging Tables

Sometimes the query optimizer struggles to generate an efficient execution plan for complex queries. Breaking a complex query into multiple steps that utilize temporary staging tables can provide SQL Server with more information about your data. They also cause you to write simpler queries which can cause the optimizer to generate more efficient execution plans as well as allow it to reuse result sets more easily.

Do delete and update in batches

Don't delete/update too many entries in one go. ("in one go" here means: as 1 transaction). It can cause blockages.

Don't do large ops on many tables in the same batch

Don't operate on too many tables on 1 transaction. Remember that each table will be locked. It can cause blockages.

Avoid cursors

What are cursors: https://www.mysqltutorial.org/mysql-cursor/

Cursors can cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.

  • Solutions/Mitigations:
    • Use set-based queries: Set-based queries is more efficient than cursor-based queries.
    • If you really need to use cursors:
      • Avoid dynamic cursors: dynamic cursor limits the optimizer to using nested loop joins.
      • Do the cursor operations against a temp table:

Eg, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.

Don't nest views

What are views: https://www.mysqltutorial.org/mysql-views-tutorial.aspx
Nested view aren't a 'cache'. It's a 'remembered query'.

Avoid nested views:

  1. you will probably have much more data coming back than you need.
  2. the query optimizer will give up and return a bad query plan.

Indexed Views

When you can't add new indexes to existing tables, you might be able to get away with creating a view on those tables and indexing the view instead . This works great for vendor databases where you can't touch any of the existing objects.

Do use partitioning to avoid large data moves

What is table partitioning: https://www.sqlshack.com/database-table-partitioning-sql-server/

Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you’re running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.

This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.

This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.

This is a case where understanding that all tables are partitions slashed hours from a data load.

If you must use ORMs, use stored procedures

ORM machine-generates queries, which is never as good as a programmer who knows what he's doing.

Stored procedures and User defined functions

What they are: https://github.com/atabegruslan/Notes/wiki/Database#stored-procedures--udfs

UDFs are further seperated into scalar UDFs and Table-Valued functions.

UDFs can cause query plans to serialize, which can obviously slow things down (but not always).
But sometimes on the contrary, a poorly configured server will parallelize queries too frequently and cause poorer performance than their serially equivalent plan.

Do use table-valued functions

This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.

What is CROSS APPLY: https://www.youtube.com/watch?v=kVogo0AbatM

But avoid Multi-statement Table Valued Functions (TVFs)

Multi-statement TVFs are more costly than inline TFVs. SQL Server expands inline TFVs into the main query like it expands views but evaluates multi-statement TVFs in a separate context from the main query and materializes the results of multi-statement into temporary work tables. The separate context and work table make multi-statement TVFs costly.

Don't use triggers

What are triggers: https://github.com/atabegruslan/Notes/wiki/Database#triggers

Avoid triggers!

Reason: If you update table, and the trigger updates another table.
Then locks are put onto 2 tables.
The original update and the triggered update counts as 1 transaction and locks won't be lifted until transaction is done.
Can cause blockage and slow-downs.

Instead: Make the original and triggered updates as 2 stored procedures, and as seperate transactions.

Don’t cluster on GUID

What is GUID/UUID: https://www.mysqltutorial.org/mysql-uuid/ , https://www.sqlservertutorial.net/sql-server-basics/sql-server-guid/
A GUID (globally unique identifier) is a 16-byte randomly generated number.

Ordering your table's data on this clustered GUID will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY.

This applies to more than GUIDs -- it goes toward any volatile column.

About UUID

Position a Column in an Index

Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key.

As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.

Don't count all rows if you only need to see if data exists

Don't use SELECT COUNT(*) to check existence. Use If EXISTS instead.

SET @CT = (SELECT COUNT(*) FROM dbo.T1);
If @CT > 0
BEGIN
<Do something>
END 
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
<Do something>
END

SELECT COUNT(*) takes a LOT more reads.

If you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is:

SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1.

In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.

Default filegroup settings

Filegroups are "directories" where DB store their files.

Advantage of more filegroups: more parallelization.

Specify these 2 to make insert faster:

Statistic

Turn on statistics as they helps the query optimizer.

Data Compression

Not only does data compression save space , but on certain workloads it can actually improve performance. Since compressed data can be stored in fewer pages, read disk speeds are improved, but maybe more importantly the compressed data allows more to be stored in SQL Server's buffer pool, increasing the potential for SQL Server to reuse data already in memory.

Switch cardinality estimators

The newer cardinality estimator introduced in SQL Server 2014 improves the performance of many queries. However, in some specific cases it can make queries perform more slowly.

In those cases, a simple query hint is all you need to force SQL Server to change back to the legacy cardinality estimator: https://blog.sqlauthority.com/2019/02/09/sql-server-enabling-older-legacy-cardinality-estimation/

Copy the data

If you can't get better performance by rewriting a query, you can always copy the data you need to a new table in a location where you CAN create indexes and do whatever other helpful transformations you need to do ahead of time.

Ref:


Eager vs Lazy load


Security


General Basics

Storage Engines

InnoDB MyISAM
row-level locking full table-level locking
faster insert and update faster read in some situations
less efficient select count(*) efficient select count(*) (save data as table level)
referential integrity (RDBMS) no referential integrity (DMBS)
ACID no ACID
transaction, logs, rollback no transaction nor crash recovery
big projects small projects, small footprint
FULLTEXT only after MySQL 5.6. InnoDB uses inverted lists for FULLTEXT indexes. FULLTEXT search indexes
AUTO_INCREMENT field is a part of index
cant re-establish deleted tables
  • Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT) are stored in B-trees.
  • Indexes on spatial data types use R-trees.
  • MEMORY tables also support hash indexes.

Index

Auto Increment

https://trebleclick.blogspot.com/2009/01/mysql-set-auto-increment-in-phpmyadmin.html

Full-text vs metadata search

Full-text search is distinguished from searches based on metadata or on parts of the original texts.

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria.

In CJK languages

Execution order

https://dba.stackexchange.com/a/162756

Data Types

https://www.mysqltutorial.org/mysql-data-types.aspx

Spatial

Binary

Prepared Statements

https://www.w3schools.com/php/php_mysql_prepared_statements.asp

Atomicity

Basic transaction and rollback:

try 
{
	$dsn = DB_DRIVER.":host=".DB_SERVER.";dbname=".DB_DATABASE.";charset=UTF8";
	$dbh = new PDO($dsn, DB_USERNAME, DB_PASSWORD);
	$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);	
} 
catch (PDOException $e) 
{
	die($e->getMessage());		
}	

try 
{
    $dbh->beginTransaction();

	$sql = 'SELECT col1,col2 FROM table WHERE id=:key';
	$sth = $dbh->prepare($sql);
	$sth->execute(array(":key" => 1));

    $dbh->commit();
} 
catch (PDOException $e) 
{
    $dbh->rollBack();
    die($e->getMessage());
}

Concurrency

Table lock: https://www.mysqltutorial.org/mysql-table-locking/

https://github.com/atabegruslan/Notes/blob/main/notes/storage/db_concurrency_2pl_or_timestamp.pdf

Locking

Triggers

Examples

MySQL PGSQL
https://www.mysqltutorial.org/mysql-triggers.aspx https://www.postgresqltutorial.com/postgresql-triggers/creating-first-trigger-postgresql , https://www.postgresql.org/docs/current/plpgsql-errors-and-messages.html , https://www.cybertec-postgresql.com/en/postgresql-how-to-write-a-trigger

Calling API from Trigger

Stored procedures & UDFs

Stored Function (UDF) Stored Procedure
Must return a value Can return nothing
Can only have input parameters Can have both input and output parameters
Functions can be called from Procedure Procedures cannot be called from a Function
Can't edit, just read (ie: returns) the data Can edit data, and should be used for editing data and making transactions

Full list of differences:

UDF Examples

MySQL PGSQL
https://www.postgresqltutorial.com/postgresql-plpgsql/postgresql-create-function
https://www.mysqltutorial.org/mysql-stored-function , https://www.sqlservertutorial.net/sql-server-user-defined-functions

Other points of interest from above example:

Stored Procedures Examples

MySQL PGSQL
https://www.mysqltutorial.org/mysql-stored-procedure , https://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx , https://www.youtube.com/watch?v=LgSgEt1mSFk https://www.postgresqltutorial.com/postgresql-plpgsql/postgresql-create-procedure

Partitioning and Sharding

MySQL PGSQL
https://www.percona.com/blog/what-is-mysql-partitioning , https://viblo.asia/p/gioi-thieu-ve-partitioning-trong-mysql-jvEla6kz5kw , https://www.devart.com/dbforge/mysql/studio/partition-mysql.html , https://planetscale.com/blog/what-is-mysql-partitioning https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql , https://www.postgresql.org/docs/current/ddl-partitioning.html

Character Sets and Collations

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

Encoding

https://github.com/atabegruslan/Notes/wiki/Encoding#encodings-commonly-used-in-web-development

Import and Export DB via Terminal

  • Export all: mysqldump -u root -p mydatabase > /home/myuser/database-dump.sql
  • Export data only: mysqldump -u [user] -p[pass] --no-create-info mydb > mydb.sql
  • Export structure only: mysqldump -u [user] -p[pass] --no-data mydb > mydb.sql
  • Import: mysql -u [user] -p[pass] mydb < mydb.sql

Good to know techniques

Temporarily disable foreign constraints

MySQL

SET FOREIGN_KEY_CHECKS = 0;
...
SET FOREIGN_KEY_CHECKS = 1;

PGSQL

BEGIN;
ALTER TABLE b DISABLE TRIGGER ALL;
...
ALTER TABLE b ENABLE TRIGGER ALL;
COMMIT;

https://stackoverflow.com/questions/38112379/disable-postgresql-foreign-key-checks-for-migrations

DB Platforms

DB Types

Relational

NoSQL

BigData

Hadoop

Multi Dimensional

  • OLTP (Online Transaction Processing): Inserts, Updates, Deletes.
  • OLAP (Online Analytic Processing): Answer MultiDimensional queries, i.e.: Selects.
  • Example of a MultiDimensional DBMS: https://en.wikipedia.org/wiki/Essbase

Cloud