Database Normalization
What is Database Normalization?
https://en.wikipedia.org/wiki/Database_normalization
Database normalization is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.
Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).
The drawbacks of data redundancy
- Data maintenance becomes tedious – data deletion and data updates become problematic
- It creates data inconsistencies
- Insert, Update and Delete anomalies become frequent. An update anomaly, for example, means that the versions of the same record, duplicated in different places in the database, will all need to be updated to keep the record consistent
- Redundant data inflates the size of a database and takes up an inordinate amount of space on disk
Denormalization
https://en.wikipedia.org/wiki/Denormalization
Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.
Types of database normalization
The database normalization process is further categorized into the following types:
- First Normal Form (1 NF)
- Second Normal Form (2 NF)
- Third Normal Form (3 NF)
- Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)
- Fifth Normal Form (5 NF)
- Sixth Normal Form (6 NF)
First Normal Form (1NF)
The first normal form requires that a table satisfies the following conditions:
- Rows are not ordered
- Columns are not ordered
- There is duplicated data
- Row-and-column intersections always have a unique value
- All columns are “regular” with no hidden values
A table that violates 1NF. This contains more than one value in the Dept column.
EmpID Employee Age Dept
----- -------- --- ----
1001 ABC 30 Sales, Finance
1002 DEF 40 Sales, Finance, DevOps
Now, normalize this.
Department table
DeptID DeptName
----- ------------ ----
1 Sales
2 Finance
3 DevOps
Employee table
EmpID Employee Age DeptID
----- -------- --- ----
1001 ABC 30 1
1001 ABC 30 2
1002 DEF 40 1
1002 DEF 40 2
1002 DEF 40 3
Second Normal Form (2NF)
An entity is in a second normal form if all of its attributes depend on the whole primary key. So this means that the values in the different columns have a dependency on the other columns.
- The table must be already in 1 NF and all non-key columns of the tables must depend on the PRIMARY KEY
- The partial dependencies are removed and placed in a separate table
Third Normal Form (3NF)
The third normal form states that you should eliminate fields in a table that do not depend on the key.
- A Table is already in 2 NF
- Non-Primary key columns shouldn’t depend on the other non-Primary key columns
- There is no transitive functional dependency
Find Nth highest salary from a table
Consider an Employee table
employeeName Salary
------------ ------
A 23000
B 31000
C 24500
D 35000
E 28500
F 31500
G 39800
H 51000
I 39800
Using Limit function
select * from (
(select * from Employee
ORDER BY `sal` DESC limit 6 ) AS T
)
ORDER BY T.`sal` ASC limit 1;
or
select * from Employee
ORDER BY `sal` DESC limit 5,1;
// will return 6th highest
Limit clause has two components, First component is to skip number of rows from top and second component is display number of rows we want.
Using DENSE_RANK
select * from (
select ename, sal, dense_rank()
over(order by sal desc)r from Employee
)
where r=&n;
- DENSE_RANK computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER. The ranks are consecutive integers beginning with 1.
- This function accepts arguments as any numeric data type and returns NUMBER.
- As an analytic function, DENSE_RANK computes the rank of each row returned from a query with respect to the other rows, based on the values of the value_exprs in the order_by_clause.
- In the above query the rank is returned based on sal of the employee table. In case of tie, it assigns equal rank to all the rows.
To find to the 2nd highest sal set n = 2 To find 3rd highest sal set n = 3 and so on.
Using DISTINCT and MIN functions
- Find the employees with top N distinct salaries.
- Find the lowest salary among the salaries fetched by the above query, this will give us the Nth highest salary.
- Find the details of the employee whose salary is the lowest salary fetched by the above query.
SELECT * FROM Employee WHERE sal =
(
SELECT MIN(sal) FROM Employee
WHERE sal IN (
SELECT DISTINCT TOP N
sal FROM Employee
ORDER BY sal DESC
)
)