Database technical implementation - Kosudo/nextSIS GitHub Wiki

###UTF-8, CJK and utf8mb4_unicode_ci As mentioned on the [Development Framework] (Development-Framework) page, the database uses the InnoDB storage engine. Collations (where required) are utf8mb4_unicode_ci - as we need UTF-8 Unicode to ensure [proper sorting in a multilingual application] (http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci).

A note about UTF-8 - all unicode is formed from 8-bit units - but encoding certain Asian languages (most notably Chinese, Japanese and Korean - so called ['CJK' characters] (https://en.wikipedia.org/wiki/CJK_characters)) [may require up to 4 bytes] (http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings) to represent a character whereas UTF-16 can do them in 2-6, potentially making it more efficient. However, there is a negative impact on most projected use cases for nextSIS, UTF-16 isn't ASCII compatible, and there's a considerable body of opinion that [it represents the worst] (http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful) of UTF-8 and the potentially preferred UTF-32.

Almost everything (there are a few obscure exceptions) that can be represented in UTF-16 can be represented in UTF-8, and we sacrifice efficiency in some languages for compatibility and ease-of-use.

However, MySQL's implementation of UTF-8 under utf8-unicode-ci [only uses three-byte characters] (https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html) - not four - so the correct collation to use is utf8mb4_unicode_ci which was introduced in MySQL version 5.5.3 ([released in March 2010] (https://dev.mysql.com/doc/refman/5.5/en/news-5-5-x.html)). This may further restrict our pool of users on non-updated servers, but we can't compromise the fundamental design of the system to the detriment of CJK users when people should be in a position to upgrade by now and certainly over time probably will even if they haven't yet.

###Indexes Indexes take up storage space and slow down INSERTs and UPDATEs so I'm using them on fields users are likely to frequently search on, otherwise I'm not.

###VARCHAR length/values The length/values used in VARCHAR types after MySQL 4.1 refer to character length not the number of bytes (this is important to note when dealing with CJK-type languages). VARCHAR lengths are chosen with reasonably high values to ensure they cope with all conceivable scenarios. This may raise the question of why not to set most of them arbitrarily to 255 (the maximum length prior to MySQL 5.0.3 although now it's 65,535) - but when sorting and grouping the database creates a temporary table and converts VARCHAR to CHAR for processing efficiency - which has a significant negative impact if 255 has normally been chosen.

###VARCHAR NULL versus None/Empty I'm setting any VARCHAR field which allows NULL to a default value of NULL. The length of NULL is NULL, whereas the length of an empty string is zero, and it distinguishes between putting no data (no action), and putting empty data (an action). Despite this minor differences, there isn't a compelling argument one way or the other, but a decision has to be made one way or the other. In practice at code level we will check for NULL and empty strings where required I suspect.