Archive for August, 2012

Why Normalization in database design is not so well understood? And are Data-Warehouses really Denormalized?

August 12, 2012 2 comments

Over last three months, four software professionals I met at various places asked me this question: “Why is there so much hype about normalization technique,  when you can get away with a wide table (large number of columns) for almost anything”. Yes, most of the database vendors these days support large number of columns so you can indeed write one wide table instead of paying big bucks to the database designers.

Here is the catch; as soon as any software application becomes worth its salt, one wide table will start falling short. Without getting into the formal definitions of normalization/denormalization, I will provide an example:

Consider designing a database for maintaining transaction on the checkout counters in a retail store chain. Let us assume a simplified case, the details printed on your recipt has “store”, “product” and “salesman” information. Behind the scenes, there are four tables, one for transactions and three others for the “stores”, “products” and “salesmen”. Let us say, we want to just maintain one WIDE table for this work. This wide table has these columns: transaction-date&time, store-name, product-name and salesman-name. Here is what we will experience:
1. We cannot enter the details of a store, until the first transaction happens in that store. Same is the case for product and salesman.   C of CRUD (Create Retrieve, Update Delete) is compromised here.
2. You want to list of products. It is not straight forward as reading from a product table. You will have to scan all transaction rows (slow operation), pick upall  the products and then do a “distinct” on them. R of CRUD is compromised.
3. Let us say a product is sold out, ideally in the product table we would have toggled a flag. One row would have got impacted. Here that column needs to be updated for all the transactions that had that product. It could be updating potentially millions of rows. U of CRUD is compromised.
4. A product needs to be deleted. There is no straightforward way to do it. D of CRUD is compromised.
In summary, normalization is not a choice. It is a necessity for maintaining the basics of CRUD.
Coming to Data-Warehouses; are they really denormalized? Don’t they need to maintain the basics of CRUD? Short answer is data-warehouses have different set of requirements as compared to OLTP applications.  maintaining the basics of CRUD is a necessity for OLTP applications. In data-warehouses, U and D of CRUD usually doesn’t happen. C and R definitely happens, that is why, “dimensions” are NORMALIZED from the “facts”. So, “dimensions” within themselves are denormalized. “Facts” within themselves are also denormalized. But the fact that “dimensons” are taken out of “facts”, it conveys that normaliztion does happen in data-warehouses also.