Why Normalization in database design is not so well understood? And are Data-Warehouses really Denormalized?
Over last three months, four software professionals I met at various places asked me this question: “Why is there so much hype about normalization technique, when you can get away with a wide table (large number of columns) for almost anything”. Yes, most of the database vendors these days support large number of columns so you can indeed write one wide table instead of paying big bucks to the database designers.
Here is the catch; as soon as any software application becomes worth its salt, one wide table will start falling short. Without getting into the formal definitions of normalization/denormalization, I will provide an example:
Data is sharded for cost and performance reasons. If cost is not a problem, one can always get a powerful hardware to get whatever performance is needed. Sharding works because the business-needs align with the sharding key. Sharding key partitions the data in a self sufficient way. If business grows and this self-sufficiency is not maintained, sharding will lose its meaning.
Data warehousing on sharded data is a tricky thing. Data warehousing unifies data and aligns business definitions. Data warehousing will attempt to find distincts and aggregates across all the shards. Well, it is not possible without a performance hit. So, real time data warehousing is out of question. For a not so excruciating performance, data warehousing will need to bring in all the data:
- Either on a single physical computer. This will have a cost hit. You avoided the cost hit for your OLTP app, but if you need a good data-warehouse, you have just deferred your expenditure for future. It could be well thought out decision or a shocking discovery; depending on how much planning you have done ahead of time. One saving grace is that the cost hit on this data- warehousing hardware can be relatively less if you don’t need the solutions in (near) real time.
- Or on a single logical computer (hadoop or in-memory kind of solutions). It is still sharding in some sense, isn’t it :) And you have lost the real-time-ness and the ACID-ness :)
A discovery into how schema and data are stored on disc for various mysql storage engines will definitely evoke many varied opinions. For some it could mean flexibility or plug and play architecture. For some others it could just mean a lack of well thought out architecture. Here is a summary (click on the table below for better resolution):
In spite of a simple, elegant and straight forward explanation of scrum at http://www.scrum.org/storage/scrumguides/Scrum_Guide.pdf, I continue to see so many professionals talking and writing so much about scrum in so many forums without it making any sense to anybody. Everyone just goes for a certificate and there is no dearth of certified instructors to meet that demand.
Some of the leaders of scrum in the industry have whispered this over and over again: Scrum is only for young people. I must qualify it as “for young-at-heart people”.
And as per the scrum guide I quoted above: “Scrum is founded on empirical process control theory, or empiricism. Three pillars uphold every implementation of empirical process control: transparency, inspection, and adaptation.”
It is true that only young-at-heart people can go for transparency->inspection->adaptation.
Why do so many data warehousing projects fail? The straight forward answer is: because they don’t follow scrum. Yes, contrary to the popular belief that scrum is for regular software development projects, scrum is most pertinent for the data-warehousing projects. I refuse to listen to all those executives, who have been trying hard to cover up their data warehousing failures by emphasizing that data warehousing is unique, you can’t follow scrum because data warehousing is a huge-lifecycle project, scrum is about customer-deliverables and data-warehousing will take a long time before customers get a taste of it.
Data-Warehousing attempts to bring a unified picture of company’s business strategy and company’s data. When there is no real business strategy, nobody dares to accept it and the project fails. When there was no thought given to data in the first place and it was all done to mitigate a moment’s pain, data-warehousing will not be able integrate the data; it will fail. Most of the data warehousing projects fail because:
- They don’t follow scrum; they are not transparent. How can they be when different departments of the company don’t even talk to each other?
- They don’t follow scrum; they don’t welcome inspection. Of course not J
- They don’t follow scrum; they are not adaptable. Even if one department tries to adapt, the others will knock it down, isn’t it?
Data warehousing failures are a litmus test of the company’s overall health. There is no real successful company that doesn’t have a successful data-warehousing.
I like Meetup.com more than Facebook.com. Meetup tends to facilitate human interaction, Facebook tends to replace it. We must continue to engage in more experiences that can’t be analyzed as compared to the ones that can be. We must continue to live in the moment as much as we can, and then forget about as many of them as we can, the next moment.
In meetup’s own words: Use the Internet to Get Off the Internet
Recently a friend of mine asked me the definitions of and differences between MOLAP, ROLAP and HOLAP. I had never pondered over these questions formally and had assumed that my passive knowledge was the right knowledge. I shared my passive knowledge with my friend with a disclaimer that formal defintions and differences could have more to them.
Fortunately for me, my passive knowledge was not too far from the formal definitions. While, one can google for these terms and can get to the details directly; I will summarize the definitions/differences as they make the most sense to me and the professionals who gel well with my style :)
Here it is:
MOLAP (Multidimensional OLAP): The technology of creating reports/dashboards/adhoc-analytics based on multi-dimensional storage. Whatever you ask for (even after group-bys and wheres) has been precalculated and stored in the data source you are accessing. Example: When PerformancePoint is used to create reports of an SSAS cube, you are doing MOLAP.
ROLAP (Relational OLAP): In formal computer science, relations are not relationships, they are synonyms for tables. Also in formal computer science, relationships are just constraints. ROLAP is the technology of creating reports/dashboards/adhoc-analytics based on relational-storage (data stored in traditional database tables). If whatever you ask for, needs to be fetched using a set of SQL queries with wheres, joins, group-bys, unions, havings, etc, you are using a ROLAP solution. Example: When SSRS reports use a dataset that is making a stored procedure call for a DataSet, you are doing ROLAP.
HOLAP (Hybrid OLAP): When your reports/dashboards/adhoc-analytics use both MOLAP and ROLAP technologies, your end-product is using HOLAP. MOLAP is used mainly for aggregated data and the details are usually fetched using ROLAP. HOLAP is used a lot in reports/dashboards with drill-through/drill-down/sub-reports features. A sharepoint deployment with PPS reports built on SSAS cubes and SSRS reports built using DB-stored-procedure calls, is also a HOLAP solution.
If possible, go MOLAP. If not, do HOLAP. Never settle for just ROLAP unless you are doing trivial things or you have an enviable hardware.
As a side note for more mere mortals of software engineering (those who don’t know much about databases and those who don’t care much about databases), MOLAP vs. ROLAP is analogous to Hash-tables vs. Arrays :)
If it is not a single click or a double, it is too complex.
If it takes more than a second or two to load, it is too slow.
If it is more than one page, it is too much,
If audiences don’t spend more than 15 seconds on it, it is useless.
If audiences have to spend more than a minute on it, it is still useless.