Data Normalization, refers to creating a structure to store your data in that results in reduction/elimination of redundant data. In a pure normalized form, a piece of data exists only in one place. Anywhere that the data is displayed is only making a reference to that data and joining tables together to recreate a human readable data structure. You gain several things from having a well normalized database schema, and you lose some things. Things that you gain are forced integrity of your data, everything belongs to something, attribute values are enforced to be in a known set of choices, and things are generally more clean and structured. Things you lose.. simplicity. The process of normalization results in generation of more tables, and so querying against data stored in these becomes more complex as you need to start doing joins and knowing what data will be there, what data might not be there; you can sometimes start getting very odd results if you are not careful.
As with many other subjects, there is a set of vocabulary terms that are used to express complex concepts succintly; data normalization is no different. There are varying degrees of normalization, each one being progressively more strict about what conditions must be met to attain a given level; these are referred to as "Normal Forms".
First Normal Form - typically abbreviated as 1NF, this is the most basic form, stating simply that a table must enforce that each row be unique, and not allow nulls in any column. So, an example to illustrate this lets take a table that most of you will have worked with at one point or another: Addresses. We will define the structure of the table as such:
create table Addresses
(
AddressId int identity(1,1) not null primary key,
StreetAddress1 nvarchar(100) not null,
StreetAddress2 nvarchar(100) not null,
City nvarchar(100) not null,
State nvarchar(100) not null,
Zip nvarchar(100) not null,
IsMailingAddress bit not null,
IsBillingAddress bit not null,
IsPrimaryAddress bit not null
)
go
This table adheres to the basic definition of 1NF, it has an identity column that has a unique key on it, which by definition, enforces that the row will be unique, even if the data is duplicated in every other column. If we changed the table to allow the IsMailingAddress (or any other column for that matter to contain a null value, then this table would no longer be properly normalized. There is some degree of contention about 1NF though; some people argue that Dr. Edgar F. Codd (the man who worked for IBM and invented the idea of relational data models) stated that null values must be allowed in a relational database. If you go with this concept, you will never be able to achieve a fully normalized database. For the sake of this discussion though, we will avoid a detailed explanation of this so as to keep things in scope.
Second Normal Form - also referred to as 2NF, this states that, first, the table must meet the guidelines of 1NF to be considered 2NF compliant. Further, that all columns in the table must rely on the key to express their uniqueness as a row in the table. Lets think about it like this if you have a table [Employees] with the columns [EmployeeId],[EmployeeName], and [Skill], and that a row in the table relied on a combination of EmployeeId and Skill to express uniqueness as a row, then the table is not meet the guidelines for 2NF. Due to the fact that EmployeeName does not depend on Skill, it only depends on EmployeeId (other employees can have the same skill, but other employees cannot have the same EmployeeId)
Third Normal Form - 3NF is the more commonly used structure when designing a new data model. It states that first a table must meet the requirements of 2NF, and goes one step further to state that all transitive or otherwise extraneous attributes have to be eliminated, and split off into what are commonly called Type tables. Let's look at again at the table above; notice how we have 3 columns that are attributes of an address IsMailingAddress, IsBillingAddress, and IsPrimaryAddress. These 3 columns are transitive properties, and would need to be removed from this table. We would first create a table AddressType, and have the columns AddressTypeId, AddressTypeName, and perhaps an AddressTypeDescription column. Next, we would create a table AddressTypes (note, these naming conventions can be adapted to best suit you, this is simply a convention that I have become accustomed to and ultimately I prefer a verbose name that directly implies what type of data we are working with), this table would have only two columns AddressId, and AddressTypeId. The goal of this is to only explicitly state what types the address IS, not also specify what they are NOT. Since not every address is a billing address, its extraneous data... we just don't care.
In most cases, 3NF is the most you will ever need to know about. There are additional ones though, but their definitions are difficult to explain in concise terms to those new to database design, so I will address those later in detail. Briefly though, they are Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), Fifth Normal Form (5NF), and Sixth Normal Form (6NF).
Posted
Jul 30 2008, 05:05 PM
by
dacrowlah