database normalization - ghdrako/doc_snipets GitHub Wiki
Steps to normalization
Normalization operates on specific assumptions. Before initiating the normalization process, it is crucial to validate the accuracy of these assumptions.
- ensure that there is only one fact per column. For instance, a full_name column where users can input both first and last names contains multiple strings that could constitute separate facts. As this information could be used separately in different parts of the business process, it is preferable to split it into separate columns
- check that no data were lost during the transition to tabular form, especially in the case of unstructured data, such as text.
- remove data that can be derived, deduced, or calculated from other fields unless needed for performance reasons.
- each table must have a primary key. A primary key is defined as the minimal set of columns that contain a unique combination of values for each row in the table, allowing for the distinct identification of each record.
First normal form and repeating groups
Tabela (encja) jest w pierwszej postaci normalnej, kiedy wiersz przechowuje informacje o pojedynczym obiekcie, nie zawiera kolekcji, posiada klucz główny (kolumnę lub grupę kolumn jednoznacznie identyfikujących go w zbiorze) a dane są atomowe. Wszystko tak naprawdę zależy od definicji atomowości, jaka nas satysfakcjonuje. Jeśli atomowa informacja o adresie to dla nas nazwa ulicy z numerem to wszystko jest ok. Co innego jeśli wymogi biznesowe potrzebują rozdzielenia tych informacji – np. system zleceń dla firmy kurierskiej. Obsługa jednej ulicy może być realizowana przez kilku kurierów (np. długa ulica Piotrkowska w Łodzi) – tu z pewnością trzeba by rozdzielić te informacje na dwie kolumny
The first normal form requires each table cell to have only one value and the records to be unique
- Befor normalization
- After - table in 1NF
In the preceding table, it is possible to find unique records using a combination of {Name, Phone, Sport} fields.
Putting data into table form and identifying and removing repeating groups. This can be done by splitting the original
table into several tables—one for the primary data and one for each repeating group.
First, we remove each distinct set of repeating group columns and relocate them to a new table. It is
essential to include the primary key of the core table in each new table, serving as a cross-reference.
In the repeating group tables, this is referred to as a foreign key since it is the primary key of a foreign
table. It is crucial to determine if the sequence of the repeated groups holds business value and to add
a sequence column accordingly.
After defining the new tables, we assign relevant names to them and identify the new primary key for each table. Identifying this new key may be the most challenging aspect of the normalization process. If we cannot find a new set of columns uniquely identifying each record, we can create a new unique key. Following this first step, our tables are technically in the first form of normalization, also known as the first normal form; this means that all data of the same kind are held in the same place, and the repeating groups are now only limited by the maximum number of possible rows in their own table.
Second normal form
2NF mowi ze każda tabela powinna przechowywać dane dotyczące tylko konkretnej klasy obiektów. Normalizując do 2NF, wydzielić należy zbiór atrybutów (kolumn) który jest zależny tylko od klucza głównego. Wszystkie atrybuty informacyjne (nie należące do klucza), muszą zawierać informacje o elementach tej konkretnej klasy (encji, tabeli) a nie żadnej innej. Kolumny opisujące inne obiekty, powinny trafić do właściwych encji (tabel) w których te obiekty będziemy przechowywać.
In order to normalize a database into the second normal form, it should already be in 1NF and should have single-column primary key.
Identifying determinants
Upon further examination of our data, we may discover that the content of one column influences other columns in the table, meaning they are determinants of the other columns.
Determinants are not necessarily single columns; they can also be a combination of columns. For instance, the first_name and last_name columns can determine the birth_date, gender, and all personally identifiable information columns in a table. In the second step, we need to identify any determinants besides the primary key and the columns they influence. Next, establish a new table for each determinant and its associated columns. Assign a suitable name for the new table and use the determinant column as the new primary key. In the original table, eliminate the determined columns, retaining only the determinants to act as a foreign key used as a reference between tables.
Third normal form (3NF)
W 3NF kazda kolumna informacyjna nie należąca do klucza nie zależy też od innej kolumny informacyjnej, nie należącej do klucza. Czyli każdy niekluczowy argument jest bezpośrednio zależny tylko od klucza głównego a nie od innej kolumny. Np kolumny Kwota_netto,Kwota_brutto,Vat. Pole Kwota_brutto zalezy od Kwota_netto i Vat wiec moze zostac usunieta bez utraty informacji. Czesta łamie sie swiadomie 3NF w przypadku czasochlonych wyliczen zeby zaoszczedzic czas.
For a database to be normalized into the third normal form, it should already be in 2NF and should have no transitive functional dependencies.
Many texts recommend beginning by identifying the determinants that are part of the table’s primary key. Once these determinants are moved to new tables, the table reaches its second normal form (2NF). However, the 2NF is merely a stage in the process, as we also need to relocate all determinants that are not part of the primary key. After addressing all determinants, our tables will be in their 3NF.
In more complex scenarios, where every column in the table is part of the primary key or where tables contain time-dependent data, other normalization forms are used. However, the 3NF is usually enough to achieve a robust data model.