database normalization - ghdrako/doc_snipets GitHub Wiki

obraz

obraz

Steps to normalization

Normalization operates on specific assumptions. Before initiating the normalization process, it is crucial to validate the accuracy of these assumptions.

  • ensure that there is only one fact per column. For instance, a full_name column where users can input both first and last names contains multiple strings that could constitute separate facts. As this information could be used separately in different parts of the business process, it is preferable to split it into separate columns
  • check that no data were lost during the transition to tabular form, especially in the case of unstructured data, such as text.
  • remove data that can be derived, deduced, or calculated from other fields unless needed for performance reasons.
  • each table must have a primary key. A primary key is defined as the minimal set of columns that contain a unique combination of values for each row in the table, allowing for the distinct identification of each record.

First normal form and repeating groups

The first normal form requires each table cell to have only one value and the records to be unique

  • Befor normalization obraz
  • After - table in 1NF obraz

In the preceding table, it is possible to find unique records using a combination of {Name, Phone, Sport} fields.

Putting data into table form and identifying and removing repeating groups. This can be done by splitting the original table into several tables—one for the primary data and one for each repeating group. First, we remove each distinct set of repeating group columns and relocate them to a new table. It is essential to include the primary key of the core table in each new table, serving as a cross-reference. In the repeating group tables, this is referred to as a foreign key since it is the primary key of a foreign table. It is crucial to determine if the sequence of the repeated groups holds business value and to add a sequence column accordingly. obraz obraz

After defining the new tables, we assign relevant names to them and identify the new primary key for each table. Identifying this new key may be the most challenging aspect of the normalization process. If we cannot find a new set of columns uniquely identifying each record, we can create a new unique key. Following this first step, our tables are technically in the first form of normalization, also known as the first normal form; this means that all data of the same kind are held in the same place, and the repeating groups are now only limited by the maximum number of possible rows in their own table.

Second normal form

In order to normalize a database into the second normal form, it should already be in 1NF and should have single-column primary key.

Identifying determinants

Upon further examination of our data, we may discover that the content of one column influences other columns in the table, meaning they are determinants of the other columns.

Determinants are not necessarily single columns; they can also be a combination of columns. For instance, the first_name and last_name columns can determine the birth_date, gender, and all personally identifiable information columns in a table. In the second step, we need to identify any determinants besides the primary key and the columns they influence. Next, establish a new table for each determinant and its associated columns. Assign a suitable name for the new table and use the determinant column as the new primary key. In the original table, eliminate the determined columns, retaining only the determinants to act as a foreign key used as a reference between tables.

obraz

Third normal form (3NF)

For a database to be normalized into the third normal form, it should already be in 2NF and should have no transitive functional dependencies.

Many texts recommend beginning by identifying the determinants that are part of the table’s primary key. Once these determinants are moved to new tables, the table reaches its second normal form (2NF). However, the 2NF is merely a stage in the process, as we also need to relocate all determinants that are not part of the primary key. After addressing all determinants, our tables will be in their 3NF.

In more complex scenarios, where every column in the table is part of the primary key or where tables contain time-dependent data, other normalization forms are used. However, the 3NF is usually enough to achieve a robust data model.