Archive

Posts Tagged ‘Data Warehousing’

Irony of doing data warehousing on sharded data


Data is sharded for cost and performance reasons. If cost is not a problem, one can always get a powerful hardware to get whatever performance is needed. Sharding works because the business-needs align with the sharding key. Sharding key partitions the data in a self sufficient way. If business grows and this self-sufficiency is not maintained, sharding will lose its meaning.

 Data warehousing on sharded data is a tricky thing. Data warehousing unifies data and aligns business definitions. Data warehousing will attempt to find distincts and aggregates across all the shards. Well, it is not possible without a performance hit. So, real time data warehousing is out of question. For a not so excruciating performance, data warehousing will need to bring in all the data:

  • Either on a single physical computer. This will have a cost hit. You avoided the cost hit for your OLTP app, but if you need a good data-warehouse, you have just deferred your expenditure for future. It could be well thought out decision or a shocking discovery; depending on how much planning you have done ahead of time. One saving grace is that the cost hit on this data- warehousing hardware can be relatively less if you don’t need the solutions in (near) real time.
  • Or on a single logical computer (hadoop or in-memory kind of solutions). It is still sharding in some sense, isn’t it 🙂 And you have lost the real-time-ness and the ACID-ness 🙂

Thoughts?

The Inseparables: Data Warehousing and Scrum


In spite of a simple, elegant and straight forward explanation of scrum at http://www.scrum.org/storage/scrumguides/Scrum_Guide.pdf, I continue to see so many professionals  talking and writing so much about scrum in so many forums without it making any sense to anybody. Everyone just goes for a certificate and there is no dearth of certified instructors to meet that demand.

Some of the leaders of scrum in the industry have whispered this over and over again: Scrum is only for young people. I must qualify it as “for young-at-heart people”.

And as per the scrum guide I quoted above: “Scrum is founded on empirical process control theory, or empiricism. Three pillars uphold every implementation of empirical process control: transparency, inspection, and adaptation.”

It is true that only young-at-heart people can go for transparency->inspection->adaptation.

Why do so many data warehousing projects fail? The straight forward answer is: because they don’t follow scrum. Yes, contrary to the popular belief that scrum is for regular software development projects, scrum is most pertinent for the data-warehousing projects. I refuse to listen to all those executives, who have been trying hard to cover up their data warehousing failures by emphasizing that data warehousing is unique, you can’t follow scrum because data warehousing is a huge-lifecycle project, scrum is about customer-deliverables and data-warehousing will take a long time before customers get a taste of it.

Data-Warehousing attempts to bring a unified picture of company’s business strategy and company’s data. When there is no real business strategy, nobody dares to accept it and the project fails. When there was no thought given to data in the first place and it was all done to mitigate a moment’s pain, data-warehousing will not be able integrate the data; it will fail. Most of the data warehousing projects fail because:

  1. They don’t follow scrum; they are not transparent. How can they be when different departments of the company don’t even talk to each other?
  2. They don’t follow scrum; they don’t welcome inspection. Of course not J
  3. They don’t follow scrum; they are not adaptable. Even if one department tries to adapt, the others will knock it down, isn’t it?

Data warehousing failures are a litmus test of the company’s overall health. There is no real successful company that doesn’t have a successful data-warehousing.

What is the difference between Data Warehousing, Business Intelligence and Data Mining.

May 14, 2011 2 comments

Often Data-Warehousing and Business-Intelligence are used interchangeably in day-to-day life. There is however a significant difference. Business-Intelligence drives Data-Warehousing requirements and consumes the end product that Data-Warehousing produces. And Data-Mining is an advanced level of Data-Warehousing and Business-Intelligence put together.

Data-Warehousing is the process of centralizing (at the least, the access of) all the data sources available in an organization/company. This centralization, of course, includes history-preservation, removal-of-ambiguities and optimization-for-fast-access amongst other things.  Data-Warehousing produces a Data-Warehouse; a centralized non-ambiguous and easily accessible historical set of all the data-sources.

Unlike commonly understood as an act of creating reports and dashboards, Business-Intelligence is in fact an act of identifying KPIs for various business verticals and their inter-dependence. Business-Intelligence is the guiding force behind the Data-Warehousing requirements. Business-Intelligence is also a process of discovering expected or unexpected actionable data-points from the Data-Warehouse that are of direct benefit to the business. Creation of reports and dashboards falls more under the scope of Data-Warehousing than Business-Intelligence.

Data-Mining begins where Data-Warehousing and Business-Intelligence ends. Data-Mining has not yet been classified into two separate segments like Data-Warehousing (for technical work) and Business-Intelligence (business related work). Data Mining uses the Data Warehouse in addition to preparing its own sets of sparse/dense wide and/or normalized data. Data-Mining may also use publically available data for benchmarking, comparing company data. Like Business-Intelligence, Data-Mining too discovers actionable data-points from the Data-Warehouse that are of direct benefit to the business,  but, in addition, it analyzes all the data using sophisticated mathematical/statistical/algorithmic techniques for making startling discoveries that are used more by the central strategic divisions in the company rather than the individual business units.