Database types#

Midjourney: Database Tree, ref. Gustav Klimt

Eventually, you have to throw the data over a wall.

— Pat Helland

Slides/PDF#

Fundamentals and Definitions#

In many software solutions, the data to be managed are so large that it is no longer efficient to manage them in files. Instead, they are stored in databases. Databases allow different applications to access the same data concurrently without conflicts. They can manage very large data volumes and organize them into structured formats. Databases are defined as follows:

📘 Definition: Database

A Database (DB) refers to the logically related data that are managed by a DBMS (Database Management System).

📘 Definition: Database System

A database and a DBMS together are referred to as a Database System.

A database comprises all logically related data that are managed by a database management system. The database system denotes the combination of database and DBMS. In addition to the actual user data, the database includes all administrative objects such as indexes and log files that the system needs for efficient management. Long-term storage is ensured through persistence, while indexing enables fast search operations. Transactions ensure safe data changes, and logs document all operations in a traceable manner.

For engineers this means, for example, that project data, cost estimates, and schedules can be centrally managed, with multiple team members able to work concurrently without data inconsistencies arising.

File-Based Systems versus Databases#

In file-based systems, each application structures its data independently according to its own requirements. This often leads to redundancies and inconsistencies, as the same information is stored in multiple formats. Databases solve this problem with a unified data structure that all applications access. The DBMS coordinates all accesses and ensures that changes are synchronized and logged.

A practical example from civil engineering clearly shows the difference: Traditionally, accounting might manage cost data in Excel, while project management maintains separate schedules in specialized software. In a database-based approach, both access the same central data source, which automatically produces consistent and up-to-date reports on project costs and progress.

Codd’s Rules#

Edgar Codd defined nine fundamental principles for database systems. His motivation was to define clear, verifiable criteria that databases must satisfy. From this examination he derived precise rules designed to ensure that a system consistently implements data integrity, data independence, and declarative access.

  • Integration: unified, non-redundant data management

  • Operations: Storing, Searching, Updating

  • Catalog: Access to database descriptions in the data dictionary

  • User Views: Each user sees only the data they are permitted to view, in the way they want to view it

  • Integrity Assurance: Correctness of the database contents

  • Data Security: Prevention of unauthorized access; only authorized users

  • Transactions: multiple database operations as a single unit (all or nothing)

  • Synchronization: coordinating concurrent transactions

  • Data Recovery: Restoring data after system failures

Database Types#

Over the decades, different database types have emerged that are particularly well suited for different kinds of data and applications.

Distribution of database types

Relational Databases#

Relational database management systems have dominated since the 1980s and are based on the tabular data model. Each table has a fixed schema with defined columns and data types. This strong structuring makes relational databases ideal for well-structured business data.

SQL serves as the standard language for all database operations. A typical application is a project database with tables for projects, employees, materials, and costs, where complex queries such as “all projects that are over budget and have a duration of more than six months” are easily realizable.

Popular systems: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, IBM Db2

NoSQL Databases#

NoSQL systems dispense with the relational data model and have grown in importance since 2009. They emerged in response to the challenges of modern web applications that require high scalability, fault tolerance, and the processing of big data. Their ability to handle semi-structured data such as JSON, which cannot be squeezed into a rigid tabular schema, is particularly important.

Popular systems: MongoDB, CouchDB, Cassandra, Redis, Neo4j, Amazon DynamoDB, HBase, OrientDB

Document-oriented Databases#

Document stores offer maximum flexibility through schema-free organization. Documents can have completely different structures; fields can contain different data types or even include arrays. Nesting enables hierarchical data structures, usually represented in JSON format.

For environmental engineers, this is especially valuable for documenting environmental samples, where different sites may have different measurement parameters. One aquatic site could measure pH, temperature, and dissolved oxygen, while another site also records heavy metal concentrations. Each site can map its specific data structure in the document-oriented database.

Popular systems: MongoDB, Amazon DynamoDB, Databricks, Azure Cosmos DB, Couchbase

Key-Value Databases#

Key-Value Stores work similarly to Python dictionaries and store simple key-value pairs. Their simplicity makes them attractive for resource-constrained systems like embedded PCs and for rapid development of web interfaces. They are particularly suitable for caching applications and session management.

In smart-building systems, sensor data can be stored efficiently as key-value pairs. A temperature sensor in Room 101 would store the current value under the key “Sensor_Room_101_Temperature”, enabling quick access for building control.

Popular systems: Redis, Amazon DynamoDB, Azure Cosmos DB, Memcached, Hazelcast

Search Engine Databases#

These specialized NoSQL systems are optimized for searching text content. They support complex search terms with boolean operators, perform full-text searches, and can stem words to their root forms. The results are sorted by relevance and can be grouped, with the search distributed across multiple servers.

Environmental agencies benefit particularly from such systems when they need to make thousands of assessments, reports, and studies searchable. A search query for groundwater contamination in industrial areas would find all relevant documents and rank them by their relevance to the specific question.

Popular systems: Elasticsearch, Splunk, Solr, OpenSearch, MarkLogic

Graph databases#

Graph databases model data as a network of nodes and edges, making them ideal for analyzing relationships. They enable network analyses, the modeling of social structures, and the development of recommender systems.

In construction projects, graph databases can map the complex dependencies between infrastructure elements. In an urban redevelopment, nodes would represent streets, buildings, and utility lines, while the connections would represent the dependencies between the renovation tasks. This enables optimal planning of the sequence of work.

Selection Criteria for Database Types#

The choice of the appropriate database type depends on several factors. Structured data with clearly defined relationships is well suited for relational databases, while flexible or evolving data structures favor NoSQL solutions. Scaling requirements play an important role, as does the complexity of the queries required.

Modern infrastructure projects often use multiple database types simultaneously. A Smart City project could combine relational databases for administrative data, graph databases for traffic-flow analyses, key-value stores for sensor caching, and search-engine databases for document management.

Quiz#

--- shuffleQuestions: true shuffleAnswers: true --- ### Welche Python-Pakete werden zum Laden von CSV-Daten aus dem Internet verwendet? - [x] `urllib` und `pandas` - [ ] `requests` und `numpy` - [ ] `json` und `matplotlib` - [ ] `sqlite3` und `os` ### Was ist SQL? - [x] Eine Sprache zur Abfrage und Manipulation von Datenbanken - [ ] Ein Programm zur Datenvisualisierung - [ ] Eine Programmiersprache wie Python - [ ] Ein Datenformat wie JSON ### Was ist SQLite? - [x] Eine lokale relationale Datenbank ohne Server - [ ] Ein Cloud-Datenbank-Service - [ ] Ein NoSQL-Datenbanksystem - [ ] Ein Datenvisualisierungstool ### Sortiere die folgenden Zeilen für das Speichern einer Pandas-Tabelle in SQLite: ```python import sqlite3 import pandas as pd ``` 1. `df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})` 2. `con = sqlite3.connect("test.sqlite")` 3. `df.to_sql("tabelle", con, if_exists="replace")` ### Welcher Fehler steckt in diesem Beispiel? ```python con = sqlite3.connect("test.sqlite") df.to_sql("tabelle", con, if_exists="append") ``` - [ ] `sqlite3` ist nicht importiert - [ ] Der Datenbankname ist zu kurz - [x] Kein Fehler - der Code ist korrekt - [ ] `to_sql` existiert nicht ### Was macht der SQL-Befehl `SELECT * FROM baustellen;`? - [ ] Löscht alle Daten aus der Tabelle - [x] Gibt alle Spalten und Zeilen der Tabelle zurück - [ ] Erstellt eine neue Tabelle - [ ] Zählt die Zeilen in der Tabelle ### Wie führt man einen SQL-Befehl in Python aus? - [x] Mit `cur.execute(sql)` nach Erstellung eines Cursors - [ ] Mit `con.run(sql)` - [ ] Mit `sqlite3.query(sql)` - [ ] Mit `pd.sql(sql)` ### Sortiere die folgenden Zeilen für eine SQL-Abfrage: ```python con = sqlite3.connect("test.sqlite") ``` 1. `cur = con.cursor()` 2. `sql = 'SELECT * FROM tabelle;'` 3. `for row in cur.execute(sql):` 4. ` print(row)` ### Welcher SQL-Befehl filtert Baustellen mit Baubeginn ab 2023? - [ ] `FILTER baubeginn >= '2023-01-01'` - [x] `WHERE baubeginn >= '2023-01-01 00:00:00+01'` - [ ] `IF baubeginn >= '2023-01-01'` - [ ] `SELECT baubeginn >= '2023-01-01'` ### Wie verknüpft man mehrere Bedingungen in SQL? - [x] Mit `AND`, `OR`, `NOT` - [ ] Mit `&&`, `||`, `!` - [ ] Mit `+`, `-`, `*` - [ ] Mit `COMBINE`, `MERGE`, `JOIN` ### Welcher Fehler steckt in diesem SQL-Beispiel? ```sql SELECT * FROM baustellen WHERE baubeginn = '2023-01-01' AND bauende = '2024-01-01' ``` - [ ] `AND` ist nicht erlaubt - [ ] Datumsformat ist falsch - [x] Kein Fehler - der Code ist korrekt - [ ] `WHERE` muss am Ende stehen ### Was macht die SQL-Funktion `count(*)`? - [ ] Summiert alle Werte - [x] Zählt die Anzahl der Zeilen - [ ] Findet den Maximalwert - [ ] Berechnet den Durchschnitt ### Welche Aggregatfunktionen gibt es in SQL? - [x] `min()`, `max()`, `avg()`, `sum()`, `count()` - [ ] `first()`, `last()`, `middle()` - [ ] `add()`, `subtract()`, `multiply()` - [ ] `top()`, `bottom()`, `center()` ### Sortiere die folgenden Teile einer SQL-Aggregationsabfrage: 1. `SELECT sparte, count(*), avg(baudauer)` 2. `FROM baustellen` 3. `GROUP BY sparte` ### Was macht `GROUP BY sparte` in einer SQL-Abfrage? - [ ] Sortiert nach Sparte - [x] Fasst Zeilen mit gleicher Sparte zusammen - [ ] Filtert nach Sparte - [ ] Löscht doppelte Sparten ### Wie sortiert man Ergebnisse in SQL absteigend? - [ ] `SORT BY spalte DOWN` - [x] `ORDER BY spalte DESC` - [ ] `ARRANGE BY spalte REVERSE` - [ ] `DESCEND BY spalte` ### Was bewirkt `LIMIT 3` in einer SQL-Abfrage? - [ ] Begrenzt auf 3 Spalten - [x] Gibt maximal 3 Ergebniszeilen zurück - [ ] Sortiert die ersten 3 Einträge - [ ] Filtert nach Werten kleiner 3 ### Welcher Fehler steckt in diesem SQL-Beispiel? ```sql SELECT sparte, count(*) FROM baustellen ORDER BY count(*) GROUP BY sparte ``` - [x] `ORDER BY` muss nach `GROUP BY` stehen - [ ] `count(*)` kann nicht sortiert werden - [ ] `GROUP BY` ist falsch positioniert - [ ] Es fehlt ein `WHERE` ### Was macht ein `JOIN` in SQL? - [ ] Kombiniert Spalten einer Tabelle - [x] Verknüpft Daten aus mehreren Tabellen - [ ] Sortiert Tabellendaten - [ ] Löscht doppelte Einträge ### Warum verwendet man Tabellenpräfixe wie `b.spalte` bei JOINs? - [ ] Zur Verschönerung des Codes - [x] Zur eindeutigen Identifikation bei gleichnamigen Spalten - [ ] Zur Beschleunigung der Abfrage - [ ] Zur Fehlerbehandlung ### Was macht `DISTINCT` in einer SQL-Abfrage? - [ ] Sortiert die Ergebnisse - [ ] Begrenzt die Anzahl der Ergebnisse - [x] Entfernt doppelte Zeilen aus den Ergebnissen - [ ] Verbindet mehrere Tabellen ### Was ist der Unterschied zwischen INNER JOIN und LEFT JOIN? - [x] INNER JOIN gibt nur übereinstimmende Zeilen zurück, LEFT JOIN alle Zeilen der linken Tabelle - [ ] LEFT JOIN gibt nur übereinstimmende Zeilen zurück, INNER JOIN alle Zeilen der linken Tabelle - [ ] Es gibt keinen Unterschied, beide sind gleich - [ ] INNER JOIN ist schneller als LEFT JOIN ### Sortiere die folgenden Teile einer JOIN-Abfrage: 1. `SELECT b.spalte1, a.spalte2` 2. `FROM baustellen AS b` 3. `JOIN adressenliste AS a` 4. `WHERE b.id = a.id` ### Was ist der Unterschied zwischen verschiedenen JOIN-Typen? - [x] Sie bestimmen, welche Zeilen bei fehlenden Übereinstimmungen zurückgegeben werden - [ ] Sie beeinflussen die Geschwindigkeit der Abfrage - [ ] Sie ändern die Sortierung der Ergebnisse - [ ] Sie bestimmen die Anzahl der verknüpften Tabellen ### Welcher Fehler steckt in diesem JOIN-Beispiel? ```sql SELECT * FROM baustellen NATURAL JOIN adressenliste, gemeinden ``` - [ ] `NATURAL JOIN` existiert nicht - [ ] Zu viele Tabellen im JOIN - [x] Potentielle Kreuzverknüpfung durch mehrere Tabellen ohne explizite Bedingungen - [ ] Falsche Syntax bei `NATURAL JOIN`