Relational Database Design#

Midjourney: A walk in the park, ref. Georges Seurat

Simple things should be simple. Complex things should be possible.

  • Alan Kay

Slides/PDF#

Design Approach#

The classic design approach for databases is similar to the waterfall model in software design. In this approach, the requirements and use cases are defined first:

  • Requirements analysis: What requirements and use cases does the database have?

  • Conceptual design: High-level design in which the required entities, attributes, and relationships are mapped in the ER diagram

  • Logical design: Detailed design of the concrete database schema for a specific DBMS that maps the actual tables in the database and is normalized

  • Physical design: Final determinations of primary indexes and additional search indexes for access optimization

  • Implementation and testing: Creating the database with SQL

  • Deployment and maintenance: Use of the database

What is decisive in the approach is that there is a high-level design at the conceptual stage, in which the functionally necessary entities are designed, and a subsequent detailed design with the logical and physical design, in which the actual tables in a DBMS are derived, which usually differ, as shown below.

Steps in Database Design

ER Diagrams#

To represent database designs, one uses Entity-Relationship diagrams. In them, one visualizes the entities with their attributes and the relationships between the entities. There are different variants of ER diagrams, as shown in the figure below. What is common to all of them is that they depict entities, attributes (not shown here) and relationships with cardinalities. They differ primarily in notation rather than in the use of symbols.

ER diagrams in different notations (https://de.wikipedia.org/wiki/Datei:ERD_Darstellungen.png)

We will use the UML notation below, which is based on UML class diagrams.

  • Classes receive the annotation <<Entity>> to distinguish them from normal classes

  • Attributes (table columns) are represented as class attributes with their data type

  • Primary keys are marked with PK

  • Relationship attributes (foreign keys) are modeled with associations

  • The reading direction of relationship types is indicated with << or >>

  • Numbers on relationships indicate cardinality

An important difference to UML diagrams is that in ER diagrams there are no methods, inheritance, polymorphism, encapsulation, or class hierarchies. Therefore we do not apply corresponding symbols.

A simple example is shown in the following image. We have a person who lives in a place. The person has the attributes Ausweisnummer and a name. We use the Ausweisnummer as the primary key, since it is unique. The place has a postal code and a name. The postal code serves here as the primary key for simplicity (since a postal code can cover many municipalities or a place can have multiple postal codes, this is not ideal). As a relationship we denote here that a person lives in exactly one place. Note the reading direction >> here and the cardinalities 1-to-1 that appear at the start and end of the relationship.

Cardinalities#

In ER diagrams, the cardinality of relationships plays a key role, since it leads to different table representations. Cardinalities describe the number of minimal and maximal possible relationships between entities for this relationship. In general, the following cardinalities are distinguished:

Cardinality Example ER Diagram
1 to 1 A person is born in at least one, at most one place.
1 to 0..1 A person has died in zero or one place.
1 to 0..* A person goes on vacation to zero or more places.
1 to 1..* A person has previously been to at least one, at most many places.

It should be noted that the reading direction (<< or >>) is important, since no statement is made about the cardinality of the reverse direction! In the above example, only a statement is made about how many places one person was in, and not about how many people can be in places.

Example: ER Conceptual Design for Geometric Objects#

As a design example, we refer to the geometry objects example discussed in the chapter Objektorientierung. We have the classes Point, Line and Polygon with the variants Triangle, Tetragon and Pentagon.

The Point has attributes the x and y coordinates of type float, both of which are public as annotated with +. Furthermore, there are different methods in Point, Line, and Polygon for calculating the distance, area, and length of the respective objects.

In the UML diagram we have depicted that a Polygon has the aggregation association with multiplicity 1 to 3..* to the class Point, i.e., a polygon is linked to at least three points. The Line is, in turn, associated with the Point class with the multiplicity 1 to 2, i.e., a line is associated with exactly two points.

The classes Triangle, Tetragon and Pentagon are subclasses of Polygon, which add no further functionality.

To design a database schema for these objects, we map them into an ER diagram by applying the conversion rules shown above.

  • We annotate the classes first with <<Entity>> to mark them as such.

  • We remove all methods, since ER diagrams do not support them.

  • We remove all encapsulation annotations +, -, or # before attributes

  • We add primary keys to all entities

  • We convert the special association types Aggregation and Composition into simple associations

Since there is no inheritance in ER diagrams and DBMS, we need to decide how to handle the subclasses Triangle, Tetragon, and Pentagon. One possibility would be to introduce a new entity for each, built just like a Polygon. But this would lead to several redundant tables. This is necessary if the subclasses introduce their own attributes. Since they do not do this, we also do not need separate entities, but can incorporate them into the generic Polygon entity by introducing the attribute PolygonType, in which we can store whether it is a Triangle, Tetragon, Pentagon, or subtype.

We obtain the following conceptual design:

In the conceptual design there are three entity types. The entity Point still has x and y coordinates as attributes of type float. The entity Line has two relations to Point for the start point and end point. The entity Polygon has the relation to Point with cardinality 1 to 3..*, i.e., at least three points. It also has a new attribute PolygonType of type str to store the subtype. For all three entities we have defined a numeric primary key, since numeric indexes are the most efficient.

Normalization of Conceptual Designs to Logical Designs#

The normalization is an important step in the process of mapping a conceptual data model to a logical and physical data model. Its purpose is to minimize redundancies (storing the same fact multiple times) by creating new tables and linking them via new relations for

  • complex attributes (lists, dictionaries)

  • relations with high cardinality

  • redundant data (e.g., polygon type)

There are several normal forms, defined by progressively stricter conditions on the database schema:

  • 1NF - First Normal Form: All attribute values are atomic — that is, each piece of information within a table gets its own column, and related information, such as the postal code and the city, or complex attributes like lists or dictionaries, are not stored in a single column.

  • 2NF - Second Normal Form: Non-key attributes are fully functionally dependent on all primary keys — that is, only information that functionally depends on the key is stored in an entity. For example, for an invoice, the customer’s name is not stored directly, but belongs to a separate Customer entity, since a customer can have multiple invoices. When drawing an ER diagram and thinking about the entities, you usually automatically apply the Second Normal Form.

  • 3NF - Third Normal Form: Non-key attributes depend only on the primary key — that is, redundant attributes that belong together should be stored together. So ZIP codes and place names tend to occur in pairs and should form a separate entity.

  • BCNF - Boyce-Codd Normal Form: All attributes on which other attributes depend are keys — that is, if several foreign keys occur in an entity and these depend on each other, this dependency is moved into its own entity. For example, if you store place names and ZIP codes in separate entities (because place names duplicate), you should not store the foreign keys ZIP and city name in the Customer entity, but still need the ZIP–City entity.

  • 4NF - Fourth Normal Form: There are only trivial multivalued dependencies remaining — that is, if multivalued attributes occur with foreign keys or other multivalued attributes that have no functional relation, they are moved out. For example, in 4NF it is incorrect to store in one table an order’s shipping address, customer number, and item numbers, since the shipping address depends on the customer number but not on the item numbers. The correct design is two tables with shipping address and customer number, and one with item number and customer number.

  • 5NF - Fifth Normal Form: There are no join dependencies remaining — that is, every join dependency between attributes and foreign keys is moved into its own table.

Usually only the first three normal forms are relevant in database design. Higher normal forms result in a very large number of small tables, which even for simple queries must be joined together, drastically limiting usability.

In our geometry example, we want to look at a few examples of violations of the first through third normal forms.

In the following ER diagram, we violate the first normal form because we define a complex attribute List<Punkt> instead of decomposing it into a relation.

In the second ER diagram we violate the second normal form, because we store the coordinates of the start and end points directly in the line, even though these points are represented as points and thus have a functional dependency. Here is the solution: move them into the Punkt entity and create relationships.

In the last example, we erroneously store the PolygonTyp attribute in the Point entity, even though it belongs to the Polygon entity.

Normalization of Cardinalities#

According to the first normal form, we must map cardinalities differently according to their degree. Simple cardinalities (0 or 1) can be stored as an attribute, while higher cardinalities >1 are mapped as a relation in a new entity. In both logical and physical design, it matters whether an attribute when storing a new entity in the database does not have to be specified (Nullable) or must be specified (Not Null).

Cardinality Example ER Diagram
1 to 1 The relationships are represented by foreign keys in the entity from which the relationship starts, denoted by `>>`. The foreign key is `NOT NULL`, so an entry must ALWAYS be present.
1 to 0..1 The relation is represented by a foreign key in the entity. The foreign key is `NULLABLE`, so an entry does not have to be present.
1 to 0..* The relation is represented as a new entity with a foreign key.
1 to 1..* The relation is represented as a new entity with a foreign key. At least one entry should exist (this is a consistency rule that cannot be enforced by a DBMS alone (there is otherwise a chicken-and-egg problem)).

Example Geometry#

To convert the above-mentioned conceptual design for the geometry objects into a logical design, we must normalize the design. Let’s take another look at the design.

We have the three entities Linie, Punkt and Polygon with two 1:1 relationships from Linie to Punkt and a 1:3..*-relationship from Polygon to Punkt.

First, by applying the rules of 1NF and decomposing these relations according to the table mentioned above, we can represent the two 1:1 relationships from Linie to Punkt using foreign keys in the Linie entity.

The 1:3..*-relationships from Polygon to Punkt, however, cannot be directly mapped according to 1NF, since the multiplicity is greater than 1 and we must introduce a new entity. This entity (table) lists which polygons use which points. The new entity thus contains the two foreign keys and a new primary key. In the logical design we also include the foreign keys as attributes to represent them unambiguously.

In this draft, 2NF and 3NF are already satisfied.

However, we have many duplications (redundancies) with the attribute PolygonTyp, which can only take the values Triangle, Tetragon, and Pentagon, but must be stored as a new string each time. At the same time, we also want this attribute to only take these values and not be filled with gibberish.

Therefore, it makes sense to constrain the values here and to avoid duplication of strings by introducing a new entity PolygonType. This defines the base types Triangle, Tetragon, and Pentagon and assigns them a numeric primary key, which we reference in the entity Polygon as a foreign key.

Lastly, we also define for all attributes and foreign keys whether they are Nullable or NotNull.

We obtain the final logical design:

Wichtig zu erkennen in dem logischen Entwurf ist, dass wir jetzt nur noch 1:1-Relationen haben, also alle höherwertigen Kardinalitäten aufgelöst haben. Zusätzlich haben wir Redundanzen aufgelöst und haben jetzt ein sehr effizienten Datenbankentwurf.

Im Physikalischen Entwurf würde man diesem Entwurf jetzt noch weitere Informationen hinzufügen, wie zum Beispiel zusätzliche Zugangangsbeschränkungen oder Abfrageindizies. Zugangsbeschränkungen gibt es in den meisten DBMSs nur auf Entitätsebene und nicht auf Attributsebene (wie in UML), man kann also Beschränken welche Nutzer, welche Tabellen sehen, lesen, erzeugen, oder verändern können, aber nicht welche Spalten sie sehen.

Beim Entwurf von Abfrageindizies ist die Regel, dass man versucht alle häufig genutzten WHERE-Bedingungen, die man in SELECT-Anfragen in SQL benutzt, durch entsprechende Indizies beschleunigt werden.

Damit sind wir auch schon direkt im SQL Thema wo wir uns mit der Erzeugung von Tabellen beschäftigen wollen.

Quiz#

--- shuffleQuestions: true shuffleAnswers: true --- ### Which steps belong to the classic database design? - [x] Requirements analysis, conceptual design, logical design - [ ] Only conceptual and physical design - [x] Physical design, implementation and testing - [ ] Only implementation and maintenance ### What is the main difference between conceptual and logical design? - [x] Conceptual design is a rough outline, logical is a detailed design - [ ] Conceptual design uses SQL, logical does not - [ ] There is no difference - [ ] Logical design comes before the conceptual ### Sort the steps of database design in the correct order: 1. Requirements analysis 2. Conceptual design 3. Logical design 4. Physical design 5. Implementation and testing ### What characterizes an entity in the UML notation for ER diagrams? - [x] The annotation `<<Entity>>` - [ ] The name `Entity` - [ ] The annotation `<Table>` - [ ] The special symbol `ER` - [ ] No special designation ### Which UML elements are NOT used in ER diagrams? - [x] Methods and inheritance - [ ] Attributes and classes - [x] Polymorphism and encapsulation - [ ] Associations ### What does the symbol `>>` mean in a relation? - [x] The relation is read from left to right - [ ] The relation is bidirectional - [ ] It is a notation error - [ ] The relation is read from right to left ### What does the cardinality "1 to 0..*" mean? - [x] An entity can be linked to zero to many others - [ ] An entity must be linked to at least one other - [ ] Exactly one link is allowed - [ ] The cardinality is invalid ### In which cardinality are new tables needed for relations? - [ ] 1 to 1 - [ ] 1 to 0..1 - [x] 1 to 0..* - [x] 1 to 1..* ### Which error is present in this ER diagram fragment? ``` Person >>lives in>> Location 1 1..* ``` - [x] The reading direction does not make sense for this cardinality - [ ] People cannot live in locations - [ ] The notation is completely wrong - [ ] Attributes are missing ### What is the goal of normalization? - [x] Minimize redundancies - [x] Ensure data integrity - [ ] Slow down queries - [ ] Create more tables ### What does the 1st normal form (1NF) require? - [x] All attribute values are atomic - [ ] All attributes are primary keys - [x] No complex attributes like lists - [ ] At least three tables ### Which normal form is violated if customer data is stored directly in the invoice table? - [ ] 1NF - [x] 2NF - [ ] 3NF - [ ] No violation ### Sort the normal forms by their strictness (from less to more strict): 1. 1NF (First Normal Form) 2. 2NF (Second Normal Form) 3. 3NF (Third Normal Form) 4. BCNF (Boyce-Codd Normal Form) ### Which error is present in this table design? ```sql CREATE TABLE Customers ( id INTEGER, name TEXT, addresses LIST<TEXT> ) ``` - [x] The attribute `addresses` violates 1NF (not atomic) - [ ] The data type INTEGER is incorrect - [ ] A primary key is missing - [ ] The table name is invalid ### How is a 1:1 relation mapped in tables? - [x] By a foreign key in one of the tables - [ ] By a separate relation table - [ ] By duplicating all attributes - [ ] Not at all, because it is invalid ### When is a foreign key "Nullable"? - [x] With cardinality 1 to 0..1 - [ ] With cardinality 1 to 1 - [x] When the relationship is optional - [ ] Never ### Sort these steps for normalizing a 1:n relation: 1. Create a new relation table 2. Add foreign keys to both original tables 3. Define the primary key for the relation table 4. Check cardinality ### Which SQL command creates a new table? - [x] CREATE TABLE - [ ] INSERT TABLE - [ ] NEW TABLE - [ ] MAKE TABLE ### What does AUTOINCREMENT do for a primary key? - [x] Automatically assigns unique values - [ ] Automatically deletes old entries - [x] Facilitates inserting new records - [ ] Prevents data changes ### Which error is present in this SQL code? ```sql CREATE TABLE Points ( point_id INTEGER PRIMARY KEY, x REAL NOT NULL, y REAL NOT NULL, FOREIGN KEY(x) REFERENCES Lines(line_id) ) ``` - [x] An attribute (x) cannot be both a coordinate and a foreign key - [ ] REAL is not a valid data type - [ ] PRIMARY KEY is misspelled - [ ] NOT NULL is redundant ### Sort these SQL commands in the logical order for creating a database: 1. CREATE TABLE for main entities 2. CREATE TABLE for foreign-key dependent tables 3. INSERT core data 4. INSERT linked data ### With which SQL command do you insert data? - [x] INSERT INTO - [ ] ADD DATA - [ ] CREATE DATA - [ ] PUT INTO ### What happens with this SQL command? ```sql DELETE FROM Points WHERE x > 50; ``` - [x] All points with x-coordinate > 50 are deleted - [ ] The Points table is deleted - [ ] Only the first found point will be deleted - [ ] It is a syntax error ### Which error is present in this INSERT statement? ```sql INSERT INTO Lines VALUES(1, 99, 100); ``` *Assume the points with IDs 99 and 100 do not exist* - [x] Foreign key constraint will be violated - [ ] Too many values provided - [ ] INTEGER is not a valid data type - [ ] INSERT is misspelled ### Sort these steps for safe deletion of data: 1. Create a backup of the table 2. Test the DELETE statement with a WHERE condition 3. Perform the actual deletion 4. Verify the result ### With which command do you rename a table? - [x] ALTER TABLE ... RENAME TO - [ ] CHANGE TABLE NAME - [ ] RENAME TABLE - [ ] UPDATE TABLE NAME ### What is the difference between DELETE FROM table and DROP TABLE? - [x] DELETE deletes only data, DROP deletes the entire table - [ ] There is no difference - [x] DROP also removes the table structure - [ ] DELETE is faster ### Which error is present in this ALTER statement? ```sql ALTER TABLE Points ADD COLUMN z REAL; ``` *Assume there is already data in the table* - [ ] ADD COLUMN is wrong - [ ] REAL is not a valid data type - [x] A DEFAULT value is missing for existing rows - [ ] The column name z is too short ### Sort these steps for safe table modification: 1. Create a backup of the table 2. Plan the ALTER statement 3. Perform the change 4. Check data integrity