Why Normalization in database design is not so well understood? And are Data-Warehouses really Denormalized?

August 12, 2012 2 comments

Over last three months, four software professionals I met at various places asked me this question: “Why is there so much hype about normalization technique,  when you can get away with a wide table (large number of columns) for almost anything”. Yes, most of the database vendors these days support large number of columns so you can indeed write one wide table instead of paying big bucks to the database designers.

Here is the catch; as soon as any software application becomes worth its salt, one wide table will start falling short. Without getting into the formal definitions of normalization/denormalization, I will provide an example:

Consider designing a database for maintaining transaction on the checkout counters in a retail store chain. Let us assume a simplified case, the details printed on your recipt has “store”, “product” and “salesman” information. Behind the scenes, there are four tables, one for transactions and three others for the “stores”, “products” and “salesmen”. Let us say, we want to just maintain one WIDE table for this work. This wide table has these columns: transaction-date&time, store-name, product-name and salesman-name. Here is what we will experience:
1. We cannot enter the details of a store, until the first transaction happens in that store. Same is the case for product and salesman.   C of CRUD (Create Retrieve, Update Delete) is compromised here.
2. You want to list of products. It is not straight forward as reading from a product table. You will have to scan all transaction rows (slow operation), pick upall  the products and then do a “distinct” on them. R of CRUD is compromised.
3. Let us say a product is sold out, ideally in the product table we would have toggled a flag. One row would have got impacted. Here that column needs to be updated for all the transactions that had that product. It could be updating potentially millions of rows. U of CRUD is compromised.
4. A product needs to be deleted. There is no straightforward way to do it. D of CRUD is compromised.
In summary, normalization is not a choice. It is a necessity for maintaining the basics of CRUD.
Coming to Data-Warehouses; are they really denormalized? Don’t they need to maintain the basics of CRUD? Short answer is data-warehouses have different set of requirements as compared to OLTP applications.  maintaining the basics of CRUD is a necessity for OLTP applications. In data-warehouses, U and D of CRUD usually doesn’t happen. C and R definitely happens, that is why, “dimensions” are NORMALIZED from the “facts”. So, “dimensions” within themselves are denormalized. “Facts” within themselves are also denormalized. But the fact that “dimensons” are taken out of “facts”, it conveys that normaliztion does happen in data-warehouses also.

Irony of doing data warehousing on sharded data

Data is sharded for cost and performance reasons. If cost is not a problem, one can always get a powerful hardware to get whatever performance is needed. Sharding works because the business-needs align with the sharding key. Sharding key partitions the data in a self sufficient way. If business grows and this self-sufficiency is not maintained, sharding will lose its meaning.

 Data warehousing on sharded data is a tricky thing. Data warehousing unifies data and aligns business definitions. Data warehousing will attempt to find distincts and aggregates across all the shards. Well, it is not possible without a performance hit. So, real time data warehousing is out of question. For a not so excruciating performance, data warehousing will need to bring in all the data:

  • Either on a single physical computer. This will have a cost hit. You avoided the cost hit for your OLTP app, but if you need a good data-warehouse, you have just deferred your expenditure for future. It could be well thought out decision or a shocking discovery; depending on how much planning you have done ahead of time. One saving grace is that the cost hit on this data- warehousing hardware can be relatively less if you don’t need the solutions in (near) real time.
  • Or on a single logical computer (hadoop or in-memory kind of solutions). It is still sharding in some sense, isn’t it 🙂 And you have lost the real-time-ness and the ACID-ness 🙂


Thoughtless or Very Thoughtful: MySQL On-disc Storage System

A discovery into how schema and data are stored on disc for various mysql storage engines will definitely evoke many varied opinions. For some it could mean flexibility or plug and play architecture. For some others it could just mean a lack of well thought out architecture. Here is a summary (click on the table below for better resolution):

Categories: MySQL Tags: , ,

The Inseparables: Data Warehousing and Scrum

In spite of a simple, elegant and straight forward explanation of scrum at http://www.scrum.org/storage/scrumguides/Scrum_Guide.pdf, I continue to see so many professionals  talking and writing so much about scrum in so many forums without it making any sense to anybody. Everyone just goes for a certificate and there is no dearth of certified instructors to meet that demand.

Some of the leaders of scrum in the industry have whispered this over and over again: Scrum is only for young people. I must qualify it as “for young-at-heart people”.

And as per the scrum guide I quoted above: “Scrum is founded on empirical process control theory, or empiricism. Three pillars uphold every implementation of empirical process control: transparency, inspection, and adaptation.”

It is true that only young-at-heart people can go for transparency->inspection->adaptation.

Why do so many data warehousing projects fail? The straight forward answer is: because they don’t follow scrum. Yes, contrary to the popular belief that scrum is for regular software development projects, scrum is most pertinent for the data-warehousing projects. I refuse to listen to all those executives, who have been trying hard to cover up their data warehousing failures by emphasizing that data warehousing is unique, you can’t follow scrum because data warehousing is a huge-lifecycle project, scrum is about customer-deliverables and data-warehousing will take a long time before customers get a taste of it.

Data-Warehousing attempts to bring a unified picture of company’s business strategy and company’s data. When there is no real business strategy, nobody dares to accept it and the project fails. When there was no thought given to data in the first place and it was all done to mitigate a moment’s pain, data-warehousing will not be able integrate the data; it will fail. Most of the data warehousing projects fail because:

  1. They don’t follow scrum; they are not transparent. How can they be when different departments of the company don’t even talk to each other?
  2. They don’t follow scrum; they don’t welcome inspection. Of course not J
  3. They don’t follow scrum; they are not adaptable. Even if one department tries to adapt, the others will knock it down, isn’t it?

Data warehousing failures are a litmus test of the company’s overall health. There is no real successful company that doesn’t have a successful data-warehousing.

Meetup vs Facebook

I like Meetup.com more than Facebook.com. Meetup tends to facilitate human interaction, Facebook tends to replace it. We must continue to engage in more experiences that can’t be analyzed as compared to the ones that can be. We must continue to live in the moment as much as we can, and then forget about as many of them as we can, the next moment.

In meetup’s own words: Use the Internet to Get Off the Internet


Recently a friend of mine asked me the definitions of and differences between MOLAP, ROLAP and HOLAP. I had never pondered over these questions formally and had assumed that my passive knowledge was the right knowledge. I shared my passive knowledge with my friend with a disclaimer that formal defintions and differences could have more to them.

Fortunately for me, my passive knowledge was not too far from the formal definitions. While, one can google for these terms and can get to the details directly; I will summarize the definitions/differences as they make the most sense to me and the professionals who gel well with my style 🙂

Here it is:

MOLAP (Multidimensional OLAP): The technology of creating reports/dashboards/adhoc-analytics based on multi-dimensional storage. Whatever you ask for (even after group-bys and wheres) has been precalculated and stored in the data source you are accessing. Example: When PerformancePoint is used to create reports of an SSAS cube, you are doing MOLAP.

ROLAP (Relational OLAP): In formal computer science, relations are not relationships, they are synonyms for tables. Also in formal computer science, relationships are just constraints. ROLAP is the technology of creating reports/dashboards/adhoc-analytics based on relational-storage (data stored in traditional database tables). If whatever you ask for, needs to be fetched using a set of SQL queries with wheres, joins, group-bys, unions, havings, etc, you are using a ROLAP solution. Example: When SSRS reports use a dataset that is making a stored procedure call for a DataSet, you are doing ROLAP.

HOLAP (Hybrid OLAP): When your reports/dashboards/adhoc-analytics use both MOLAP and ROLAP technologies, your end-product is using HOLAP. MOLAP is used mainly for aggregated data and the details are usually fetched using ROLAP.  HOLAP is used a lot in reports/dashboards with drill-through/drill-down/sub-reports features. A sharepoint deployment with PPS reports built on SSAS cubes and SSRS reports built using DB-stored-procedure calls, is also a HOLAP solution.

If possible, go MOLAP. If not, do HOLAP. Never settle for just ROLAP unless you are doing trivial things or you have an enviable hardware.


As a side note for more mere mortals of software engineering (those who don’t know much about databases and those who don’t care much about databases), MOLAP vs. ROLAP is analogous to Hash-tables vs. Arrays 🙂


BI Benchmarks

February 11, 2012 Leave a comment

If it is not a single click or a double, it is too complex.
If it takes more than a second or two to load, it is too slow.
If it is more than one page, it is too much,
If audiences don’t spend more than 15 seconds on it, it is useless.
If audiences have to spend more than a minute on it, it is still useless.