Database Normalization

Table of Contents

What is Database Normalization?
- The drawbacks of data redundancy
Denormalization
Types of database normalization

What is Database Normalization?

https://en.wikipedia.org/wiki/Database_normalization

Database normalization is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.

Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).

The drawbacks of data redundancy

Data maintenance becomes tedious – data deletion and data updates become problematic
It creates data inconsistencies
Insert, Update and Delete anomalies become frequent. An update anomaly, for example, means that the versions of the same record, duplicated in different places in the database, will all need to be updated to keep the record consistent
Redundant data inflates the size of a database and takes up an inordinate amount of space on disk

Denormalization

https://en.wikipedia.org/wiki/Denormalization

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.

Types of database normalization

The database normalization process is further categorized into the following types:

First Normal Form (1 NF)
Second Normal Form (2 NF)
Third Normal Form (3 NF)
Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)
Fifth Normal Form (5 NF)
Sixth Normal Form (6 NF)

First Normal Form (1NF)

The first normal form requires that a table satisfies the following conditions:

Rows are not ordered
Columns are not ordered
There is duplicated data
Row-and-column intersections always have a unique value
All columns are “regular” with no hidden values

A table that violates 1NF. This contains more than one value in the Dept column.

EmpID     Employee   Age    Dept
-----     --------   ---    ----
1001      ABC        30     Sales, Finance
1002      DEF        40     Sales, Finance, DevOps

Now, normalize this.

Department table

DeptID    DeptName
-----     ------------    ----
1         Sales
2         Finance
3         DevOps

Employee table

EmpID     Employee   Age    DeptID
-----     --------   ---    ----
1001      ABC        30     1
1001      ABC        30     2
1002      DEF        40     1
1002      DEF        40     2
1002      DEF        40     3

Second Normal Form (2NF)

An entity is in a second normal form if all of its attributes depend on the whole primary key. So this means that the values in the different columns have a dependency on the other columns.

The table must be already in 1 NF and all non-key columns of the tables must depend on the PRIMARY KEY
The partial dependencies are removed and placed in a separate table

Third Normal Form (3NF)

The third normal form states that you should eliminate fields in a table that do not depend on the key.

A Table is already in 2 NF
Non-Primary key columns shouldn’t depend on the other non-Primary key columns
There is no transitive functional dependency

Find Nth highest salary from a table

Consider an Employee table

employeeName     Salary
------------     ------
A                23000
B                31000
C                24500
D                35000
E                28500
F                31500
G                39800
H                51000
I                39800

Using Limit function

select * from (
    (select * from Employee
     ORDER BY `sal` DESC limit 6 ) AS T
)
ORDER BY T.`sal` ASC limit 1;

select * from Employee
ORDER BY `sal` DESC limit 5,1;
// will return 6th highest

Limit clause has two components, First component is to skip number of rows from top and second component is display number of rows we want.

Using DENSE_RANK

select * from (
    select ename, sal, dense_rank()
    over(order by sal desc)r from Employee
)
where r=&n;

DENSE_RANK computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER. The ranks are consecutive integers beginning with 1.
This function accepts arguments as any numeric data type and returns NUMBER.
As an analytic function, DENSE_RANK computes the rank of each row returned from a query with respect to the other rows, based on the values of the value_exprs in the order_by_clause.
In the above query the rank is returned based on sal of the employee table. In case of tie, it assigns equal rank to all the rows.

To find to the 2nd highest sal set n = 2 To find 3rd highest sal set n = 3 and so on.

Using DISTINCT and MIN functions

Find the employees with top N distinct salaries.
Find the lowest salary among the salaries fetched by the above query, this will give us the Nth highest salary.
Find the details of the employee whose salary is the lowest salary fetched by the above query.

SELECT * FROM Employee WHERE sal =
        (
           SELECT MIN(sal) FROM Employee
           WHERE  sal IN (
                                SELECT DISTINCT TOP N
                                    sal FROM Employee
                                        ORDER BY sal DESC
                            )
       )

Links to this note

Databases