Archive

Archive for the ‘Data Warehousing’ Category

Why Normalization in database design is not so well understood? And are Data-Warehouses really Denormalized?

August 12, 2012 2 comments

Over last three months, four software professionals I met at various places asked me this question: “Why is there so much hype about normalization technique,  when you can get away with a wide table (large number of columns) for almost anything”. Yes, most of the database vendors these days support large number of columns so you can indeed write one wide table instead of paying big bucks to the database designers.

Here is the catch; as soon as any software application becomes worth its salt, one wide table will start falling short. Without getting into the formal definitions of normalization/denormalization, I will provide an example:

Consider designing a database for maintaining transaction on the checkout counters in a retail store chain. Let us assume a simplified case, the details printed on your recipt has “store”, “product” and “salesman” information. Behind the scenes, there are four tables, one for transactions and three others for the “stores”, “products” and “salesmen”. Let us say, we want to just maintain one WIDE table for this work. This wide table has these columns: transaction-date&time, store-name, product-name and salesman-name. Here is what we will experience:
1. We cannot enter the details of a store, until the first transaction happens in that store. Same is the case for product and salesman.   C of CRUD (Create Retrieve, Update Delete) is compromised here.
2. You want to list of products. It is not straight forward as reading from a product table. You will have to scan all transaction rows (slow operation), pick upall  the products and then do a “distinct” on them. R of CRUD is compromised.
3. Let us say a product is sold out, ideally in the product table we would have toggled a flag. One row would have got impacted. Here that column needs to be updated for all the transactions that had that product. It could be updating potentially millions of rows. U of CRUD is compromised.
4. A product needs to be deleted. There is no straightforward way to do it. D of CRUD is compromised.
In summary, normalization is not a choice. It is a necessity for maintaining the basics of CRUD.
Coming to Data-Warehouses; are they really denormalized? Don’t they need to maintain the basics of CRUD? Short answer is data-warehouses have different set of requirements as compared to OLTP applications.  maintaining the basics of CRUD is a necessity for OLTP applications. In data-warehouses, U and D of CRUD usually doesn’t happen. C and R definitely happens, that is why, “dimensions” are NORMALIZED from the “facts”. So, “dimensions” within themselves are denormalized. “Facts” within themselves are also denormalized. But the fact that “dimensons” are taken out of “facts”, it conveys that normaliztion does happen in data-warehouses also.
Advertisements

Irony of doing data warehousing on sharded data


Data is sharded for cost and performance reasons. If cost is not a problem, one can always get a powerful hardware to get whatever performance is needed. Sharding works because the business-needs align with the sharding key. Sharding key partitions the data in a self sufficient way. If business grows and this self-sufficiency is not maintained, sharding will lose its meaning.

 Data warehousing on sharded data is a tricky thing. Data warehousing unifies data and aligns business definitions. Data warehousing will attempt to find distincts and aggregates across all the shards. Well, it is not possible without a performance hit. So, real time data warehousing is out of question. For a not so excruciating performance, data warehousing will need to bring in all the data:

  • Either on a single physical computer. This will have a cost hit. You avoided the cost hit for your OLTP app, but if you need a good data-warehouse, you have just deferred your expenditure for future. It could be well thought out decision or a shocking discovery; depending on how much planning you have done ahead of time. One saving grace is that the cost hit on this data- warehousing hardware can be relatively less if you don’t need the solutions in (near) real time.
  • Or on a single logical computer (hadoop or in-memory kind of solutions). It is still sharding in some sense, isn’t it 🙂 And you have lost the real-time-ness and the ACID-ness 🙂

Thoughts?

The Inseparables: Data Warehousing and Scrum


In spite of a simple, elegant and straight forward explanation of scrum at http://www.scrum.org/storage/scrumguides/Scrum_Guide.pdf, I continue to see so many professionals  talking and writing so much about scrum in so many forums without it making any sense to anybody. Everyone just goes for a certificate and there is no dearth of certified instructors to meet that demand.

Some of the leaders of scrum in the industry have whispered this over and over again: Scrum is only for young people. I must qualify it as “for young-at-heart people”.

And as per the scrum guide I quoted above: “Scrum is founded on empirical process control theory, or empiricism. Three pillars uphold every implementation of empirical process control: transparency, inspection, and adaptation.”

It is true that only young-at-heart people can go for transparency->inspection->adaptation.

Why do so many data warehousing projects fail? The straight forward answer is: because they don’t follow scrum. Yes, contrary to the popular belief that scrum is for regular software development projects, scrum is most pertinent for the data-warehousing projects. I refuse to listen to all those executives, who have been trying hard to cover up their data warehousing failures by emphasizing that data warehousing is unique, you can’t follow scrum because data warehousing is a huge-lifecycle project, scrum is about customer-deliverables and data-warehousing will take a long time before customers get a taste of it.

Data-Warehousing attempts to bring a unified picture of company’s business strategy and company’s data. When there is no real business strategy, nobody dares to accept it and the project fails. When there was no thought given to data in the first place and it was all done to mitigate a moment’s pain, data-warehousing will not be able integrate the data; it will fail. Most of the data warehousing projects fail because:

  1. They don’t follow scrum; they are not transparent. How can they be when different departments of the company don’t even talk to each other?
  2. They don’t follow scrum; they don’t welcome inspection. Of course not J
  3. They don’t follow scrum; they are not adaptable. Even if one department tries to adapt, the others will knock it down, isn’t it?

Data warehousing failures are a litmus test of the company’s overall health. There is no real successful company that doesn’t have a successful data-warehousing.

MOLAP, ROLAP and HOLAP


Recently a friend of mine asked me the definitions of and differences between MOLAP, ROLAP and HOLAP. I had never pondered over these questions formally and had assumed that my passive knowledge was the right knowledge. I shared my passive knowledge with my friend with a disclaimer that formal defintions and differences could have more to them.

Fortunately for me, my passive knowledge was not too far from the formal definitions. While, one can google for these terms and can get to the details directly; I will summarize the definitions/differences as they make the most sense to me and the professionals who gel well with my style 🙂

Here it is:

MOLAP (Multidimensional OLAP): The technology of creating reports/dashboards/adhoc-analytics based on multi-dimensional storage. Whatever you ask for (even after group-bys and wheres) has been precalculated and stored in the data source you are accessing. Example: When PerformancePoint is used to create reports of an SSAS cube, you are doing MOLAP.

ROLAP (Relational OLAP): In formal computer science, relations are not relationships, they are synonyms for tables. Also in formal computer science, relationships are just constraints. ROLAP is the technology of creating reports/dashboards/adhoc-analytics based on relational-storage (data stored in traditional database tables). If whatever you ask for, needs to be fetched using a set of SQL queries with wheres, joins, group-bys, unions, havings, etc, you are using a ROLAP solution. Example: When SSRS reports use a dataset that is making a stored procedure call for a DataSet, you are doing ROLAP.

HOLAP (Hybrid OLAP): When your reports/dashboards/adhoc-analytics use both MOLAP and ROLAP technologies, your end-product is using HOLAP. MOLAP is used mainly for aggregated data and the details are usually fetched using ROLAP.  HOLAP is used a lot in reports/dashboards with drill-through/drill-down/sub-reports features. A sharepoint deployment with PPS reports built on SSAS cubes and SSRS reports built using DB-stored-procedure calls, is also a HOLAP solution.

If possible, go MOLAP. If not, do HOLAP. Never settle for just ROLAP unless you are doing trivial things or you have an enviable hardware.

———————————————————————————————————————————

As a side note for more mere mortals of software engineering (those who don’t know much about databases and those who don’t care much about databases), MOLAP vs. ROLAP is analogous to Hash-tables vs. Arrays 🙂

———————————————————————————————————————————

Replicated DBs as Data Sources: Are you sure?

December 2, 2011 Leave a comment

Let us assume you have multiple OLTP databases catering to various kinds of applications and you are provided replicated databases for each of these OLTP databases. First order thinking makes you feel good about it. You have real-time data to tap into; based on your ETL system’s capabilities, you may be able to provide near real-time solutions. You are right if you are not looking for 100% accuracy at all times.

The problem with working on replications is that the data changes every instant. Few scenarios:
1. If you tapped an OLTP table for customer records to create your customer dimension, by the time you are filling in data into a fact-table with customer dimension key, the fact-table may be referring to a new customer that you didn’t have at the time you were populating your customer dimension.
2. You got the list of customers with certain statuses. By the time, fact-table is getting populated; the statuses for some of the customers have changed. Your report will not reflect the true picture at the time of processing the fact-table.

The root cause of all these scenarios is that we are not working with one particular state of source DBs at some particular point in time. OLTPs by themselves are okay to work on these DBs, but Data warehouses tap various unrelated domain areas in the OLTP systems and depend on a synchronized state.

Replications are also prone to lags and while you will assume that you processed all the data for a day or an hour, you actually may have got only a subset of data.

For mission critical data warehouses I don’t recommend to use replicated DBs are data sources (of course there are always exceptions if you know exactly what you are doing). Backup-restores or snapshots with properly defined backup/shotshot datetime that are clearly documented and agreed-upon in formal SLAs are the good solutions. It comes at the cost of faster data availability, but reduces the pains of inaccuracy and potential law suits and painful justifications to auditors.

Schema Validation Framework (SVF)

November 16, 2011 Leave a comment

SVF is the heart of any data warehousing project. I just coined this acronym, but invented this phase of the data warehousing processing engine many years ago and have been successfully using it. Most of the data warehousing projects don’t even go beyond the start line because they don’t have an SVF infrastructure in place; they begin, they struggle and they eventually die. I will explain in a bit how that happens.

SVF is a way to protect your ETL from any/all schema changes in the source systems.

Most of DBMSs provide in-built detailed metadata about the tables, relationships, constraints, columns, data types, keys, etc. This metadata is provided as a separate database called INFORMATION_SCHEMA in some of the major DBMSs. The selective and transformed, yet detailed information provided in this metadata for a source-system database can be stored in the SVF layer that executes prior to the ETL layer. On sub-sequent processing of the source-system database, SVF will fetch the metadata info, compare it with the previous stored version that the ETL has been made compatible with. If no serious mutations are found, ETL can proceed; else it can gracefully exit after logging and alerting about the mutations.

I have seen companies and consultants spending endless time discussing merits/demerits of 3NF vs. Dimensional-Modeling, real-time vs. delayed data processing, Inmon vs. Kimball, ETL tool vs. SQL-queries, OLTP vs. OLAP, Business Objects vs. Microstrategy, … and the debates go on. I haven’t seen many folks discussing the data procurement in details. Yet, as soon as a data warehousing project starts and ETL passes the first phase, the source-system database changes. The data warehouse team pushes to put a process in place where the application development team informs the data warehouse team about any database schema changes prior to the deployment. The application development team almost always fails to do so for obvious and mostly justifiable reasons; there isn’t much ROI on the efforts involved in following this process (especially when SVF can automate it). But because SVF is not given a thought, the blame game starts, the problem continues to exist, ETL continues to break and the data warehousing SLAs continue to lose their meaning,…

I am tempted to write a little bit about the inappropriateness of slave/subscriber databases as the source data for the serious data warehousing projects, but will hold that temptation until my next blog. Meanwhile, it is a good food for thought for those who are curious enough 🙂

NULLs in Dimensional Modeling


As a rule, it is impossible to always replace NULLs with some other default representations either in Dimension tables or Fact tables. Here are the possible scenerios:

1. ‘Data Not Available’ kind of values can be assigned to a column in a dimension table only if the data type of that column is string of 18 characters at least. A table usually has mix and match of data-types for its columns, not every column can get a default string value so long. So, I experimented with NULLs for numeric columns (because anything other than NULL can have an unintended meaning) and ‘N/A’ for string columns till I hit upon string columns of data type char(2), char(1) etc. And then I ended up using ‘-‘ as my default for the char(1), ‘–‘ for char(2) and ‘N/A’ for all other strings.

 2. For fact tables, you have four things to take care of:

  • Non identifying foreign keys: should be replaced with -1, -2, -3, etc (these are surrogate-ids of the dimension record in case of no match).
  • Identifying foreign keys: Can’t be null by definition.
  • Degenrate dimensions: Depends: same criteria as dimension table attributes.
  • Measures: Can be and should be null when no value is known/available.

Also read:

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2003/KimballDT43DealingWith.pdf