Home > Data Warehousing > What is Data Warehousing: Are you bleeding or somebody is competing?

What is Data Warehousing: Are you bleeding or somebody is competing?


Often folks who are not so proficient with the world of Data-Warehousing ask: What is data warehousing? And I know they mean to ask what is Data-Warehousing/Business-Intelligence/Data-Mining/… Let me concentrate on Data-Warehousing in this post. For the differences and the connections between Data-Warehousing, Business-Intelligence and Data-Mining, please refer to my another post.

Apparently, this seemingly simple question should have a simple answer. But, in fact, sometimes it is difficult to explain it even to seasoned software engineers who are not familiar with the world of Data-Warehousing.

For a not so data-warehousing savvy person, Data-Warehousing is a slightly time-lagged copy of ALL the company’s data from all the sources which is centralized, clean, profiled, standardized and reorganized for faster access. This reorganized and centralized copy makes it easier to connect the relationships between various divisions of the company. Let us take an example; your company uses Siebel CRM with Oracle database for managing the customer’s and lead’s information. A paying customer logs on to companies website to use the company’s online applications which is built on SQL-Server database. The customer is billed for the services provided using an open-source financial package which is build on MySQL database. Now, the customer information is managed in three databases. If you want some information from all the three databases, it is not easy and it is not instant. Data warehousing will make a copy of the relevant data from all the three databases in a centralized Netezza platform. Joining and copying data from different systems take time, so Data-Warehousing solutions are a little time lagged. Once the data is in Netezza platform, it can be queried for instant responses. While copying data to Netezza on a regular basis, it is not necessary to overwrite data. History can be preserved and data from many points in past can be compared together to see a trend. This is another thing data warehousing is good at. Now in Netezza, data can be pre-aggregated during the process of copying, some derived and calculated columns can be provided with some easy visibility into business. Essentially, a data warehouse is more or less same data as consumed and produced by the software applications, but with a twist.

It is not necessary to get into too much technical details, but it is essential to explain “the twist”. Without this explanation, it sounds very suspicious. So let me first talk about hard-drives and RAM in a PC. RAM lets us do the same things on the same data that Hard Drive has or can have. Then why do we need both hard-drive-storage and temporary-storage (RAM)? Because computer engineering is not advanced enough to produce hard-drives as fast as RAM. So RAM is a compromise for slow hard-drives. Similarly, Data-Warehousing is a necessity created by the inability of the OLTP systems to give the business and strategic units of the company what they want. And OLTP systems are not able to produce what all we want because of two things: The inherent architecture and nature of the OLTP system and the need to compete and produce things as fast as possible in a ruthless and competitive market.

Now I know that RAM is not time-lagged and DW/BI/DM is not temporary. This was, after all, just an analogy 🙂

Now getting back to those who are sufficiently advanced in the field of DW/BI/DM, but still haven’t been able to tie all the pieces together. Literally, Data-Warehousing is putting data in a warehouse where “everything” is very easily visible and actionable. This Data-Warehousing doesn’t necessarily has to be a physical copy. A non-clustered index does produce real results even though it doesn’t reorganize the underlying table data. Similarly, data-warehousing solutions can be built on the source-system data itself without having to copy data elsewhere. This is the main philosophy behind the concept of Meta-Data Management (MeDM) which I will touch upon in a different discussion.  Yes, these MeDM solutions could be very RAM intensive and may just reside in RAM only.

Now why would somebody put entire company’s data in a warehouse? As to anything in life, it starts with a necessity and ends up as a fashion. Data warehousing is no exception. Let’s review the real reasons why somebody would do Data Warehousing. When software applications are built, the data consumed and the data produced is stored in databases, log files, emails, excel files, media files, etc.  This data may need to be accessed again for various reasons in various consolidated forms. When this access becomes difficult or impossible, it bleeds and bleeding need to be stopped. OR maybe your competitor is giving more value for the money to its customers by providing the consolidated data in a very attractive and useful way. It is driving your customers away. Somebody is competing and you can’t just sit and watch.

So let’s see the root cause reasons what makes it bleed and how you feel the competition:

1. The databases didn’t have good enough design, data got stored easily, but access to the old data over a sufficiently large date range is too cumbersome and too slow. Bleeding!

2. The teams were big and members didn’t collaborate in a perfect way, so multiple copies of same data exist in different set of schemas. Which one should be considered the source of truth? Bleeding!

3. There is no way to access data because the table joins were not enforced and the foreign key logic is all in the applications which were written by Smart john who doesn’t work for The Perfect Solutions anymore. Bleeding!

4. Each data-set has  a life-cycle. And it continues to move from one state to another. Not all OLTP systems need to maintain all states all the time. Previous state usually gets overwritten by the next state. If I need some visibility into the entire life-cycle, it is not possible because old states are now wiped out. Bleeding!

5. I need to look at the trend in which in my data changes. I also need to do regression analysis and need to discover correlations within my data because the other company is doing it. I need to archive data for longer duration and also need to create some statistical models to discover correlations. Somebody is competing!

6. The business is doing well and we acquired many companies. Each one of them is built on a disjoint set of technology suite. CEO needs to look at inter-dependent consolidated summaries across all these systems which is not trivially possible. Bleeding!

7. Though all is well and all the information and data is available, but it is not all that presentable. The Executives are always short of time and if my presentations don’t catch their attention in 30 seconds, it is not a good solution. We need to produce some good charts, dashboards, widgets, etc.  Bleeding!

In short, DW is to integrate all the data and to give answers to all the questions instantly; anything not instant is slow. It is also about finding patterns to enable business.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: