Relational databases - SYSC3020-Winter2016/SYSC3020LectureNotes GitHub Wiki
Relational databases are the classic type of database, used since the 1970s.
Useful non-functional properties of relational databases:
- Scalable (used to store TBs of data)
- Safe and reliable
- Must resist system crashes, power outages, hacking...
- Handle multiple concurrent access
- Efficient: fast query processing (can handle thousands of queries per second)
- Convenient:
- declarative queries
- Interoperability standards (ODBC)
The relational model was introduced by E.F. Codd in 1970. The paper that introduced the model is considered one of the most influential computer science research papers.
Data in relational databases is stored in tables. Each table has a predefined list of columns, with a specified data type.
Rows (known as records) are added to these columns, with one value per column. In some cases, a record can have missing values, in which case the corresponding column contains the special value null. A set of tables with the associated columns and datatypes is called a relational schema.
A mathematical view of database records is that the data is made up of tuples, and logically a tuple represents an n-ary relation. Relational database theory has strong mathematical foundations in relational algebra and relational calculus (out of scope of this course).
The ER model is a modeling technique that is usually used to model data that is to be stored for persistence in a relational database. This technique is similar to that of modeling a domain using a class diagram, except there is no distinction in the relationship semantics (aggregation, etc.).
In the ER model, the domain is modeled by entities related by relations. Entities and relations can both have attributes. When an attribute or a set of attributes uniquely identifies an entity, it is called a primary key of the entity. The multiplicities of relations must be specified, as they will have an impact on the way this data will be mapped to the relational model (i.e. to tables in the database).
An example ER diagram:
The corresponding class diagram:
ER models can be mapped to database tables following some simple rules:
- Entity => table
- Attributes => columns
- Introduce unique id attribute for each table
- 1-to-many relation => column in the table representing the entity on the “many” side
- Many-to-many relation => table
- Relation attributes => columns in the relation table (many-to-many relation) or alongside the column storing the relation.
The above example produces the following tables:
- table moodle_user, columns: (id, firstname, lastname, email, password)
- table socialwiki_page, columns (id, title, content, parent, userid)
- table socialwiki_page, columns (id, title, content, parent, userid)
Queries for relational databases are usually expressed in the query language SQL (often said "sequel").
SQL is a declarative language, meaning the user expresses what he/she wants and not how to obtain it. A language that describes the procedure to obtain the data would be a procedural language.
SQL (decalrative) queries are translated to query plans, that are procedural descriptions of how to obtain the data. For a given query there are several possible query plans, and database systems are equipped with a subsystem that optimizes the query plan, i.e. that selects the query plan that will correctly answer the query at a minimal cost.
There are two notions of consistency. One is related to the notion of integrity constraint, and the other is related to replication in a distributed database.
Along a data schema, an application may require the data to respect specific constraints, known as integrity constraints. these constraints are usually determined by "business rules".
In theory, an integrity constraint can be any predicate applied to the data, although in practice the main types of constraints are unicity of values and the validity of references from one table to another.
For example, we have a human resources database, with a table listing the employees and a table listing the departments of the company.
We could then have the following integrity constraints: - the employee id is unique (a primary key) - the department name is unique (a primary key) - every employee must work for a valid department, i.e. a department that exists in the department table.
In large scale databases, data is often replicated: copies exist on different computers. This ensures that when a computer fails, the data is still available in another location.
The main problem is maintaining the consistency of these copies: if one copy lists employee Alice (id: 1234) with a salary of $50,000 and the other lists her with a salary of $65,000, they are inconsistent. When data is updated, replicas must be updated immediately to ensure consistency, and this additional processing overhead is an example of a tradeoff between non-functional requirements: we gain reliability/availability by having two copies of the data, but the data requres more memory and updates become more costly, therefore we lose in efficiency.
An important notion to guarantee the integrity of data in a relational database is the notion of a transaction.
When a database is modified, often the data must be changed in several places to maintain the integrity of the data. For example, in a bank database, if money is transferred between two accounts, the source account must be debited and the recipient account must be credited. For the operation to be correct, this set of operations must be done completely, or not at all. Such groups of interdepent operations are grouped into a transaction.
Database management systems guarantee that transactions are processed atomically (all or nothing), and in isolation (they may happen concurrently but the effect on the database is as if they happened sequentially).
Along with the notions of consistency and durability, these principles form the so-called ACID properties of a database system, key non-functional properties of databases.
Data Warehousing describes the practice of storing large amounts of data in a relational database for the purpose of performing complex analytical queries. Typically, a business will have smaller "online" databases, that will handle day-to-day operational queries, and will periodically upload their data to a large "warehouse" database, where it will then be reorganized and used for analytics. In a retail business with many stores, the data warehouse will for example be used to analyze sales across the different stores and over a long time period. Data is usually organized into "dimensions" (e.g. products, time, regions), and summarized over each dimension: for all products, it can provide the sales per region over time, or over all regions, the sales per product over time, etc.
Relational databases provide strong consistency guarantees, and can accomodate any data schema. These desirable properties come at the expense of performance (when dealing with internet-scale data (Petabytes of data = 10¹⁵ bytes)) and partition-tolerance (the ability for a distributed system to function when the network is down). Scalabiltiy, availability and performance requirements are the main drivers of the recent development of NOSQL databases.