Archive for May, 2011

NULLs in Dimensional Modeling

As a rule, it is impossible to always replace NULLs with some other default representations either in Dimension tables or Fact tables. Here are the possible scenerios:

1. ‘Data Not Available’ kind of values can be assigned to a column in a dimension table only if the data type of that column is string of 18 characters at least. A table usually has mix and match of data-types for its columns, not every column can get a default string value so long. So, I experimented with NULLs for numeric columns (because anything other than NULL can have an unintended meaning) and ‘N/A’ for string columns till I hit upon string columns of data type char(2), char(1) etc. And then I ended up using ‘-‘ as my default for the char(1), ‘–‘ for char(2) and ‘N/A’ for all other strings.

 2. For fact tables, you have four things to take care of:

  • Non identifying foreign keys: should be replaced with -1, -2, -3, etc (these are surrogate-ids of the dimension record in case of no match).
  • Identifying foreign keys: Can’t be null by definition.
  • Degenrate dimensions: Depends: same criteria as dimension table attributes.
  • Measures: Can be and should be null when no value is known/available.

Also read:

Different Combinations of Microsoft Products for DW/BI solutions: The hide and seek game never ends.

May 19, 2011 1 comment

It has never quite reached its steady state for DW/BI profesionals since Microsoft released Visual Studio 2008.

1. Visual Studio 2008 worked well with Sql Server 2008, it had support for BIDS (Business Intelligence Development Studio which is essentially SSIS, SSAS and SSRS), but it missed support for SQL Server 2008 Database Projects. It only supported SQL Server 2005 Database projects.

2. So Visual Studio 2008 Service Pack 1 came and nothing changed for DW/BI professionals.

3. And then came  Visual Studio 2008 GDR R2. It fixed all the problems, but it was too late and SQL Server 2008 R2 was ready to kick out Sql Server 2008.

4. Visual Studio 2008 GDR R2 didn’t like Sql Server 2008 R2 for database projects and life was back to mix and match of products; lose some gain some.

5. Visual Studio 2010 came and started working well with Sql Server 2008 R2 database projects. Guess what it stopped supporting BIDS 🙂 They had to do it.

So I now hoping that that Visual Studio vNext will support the upcoming SQL Server 2011 (code named DENALI) and it will also have support for BIDS.

These days, I work on Visual Studio 2008 SP1 GDR2 to be able to connect to SQL Server 2008 R2 for BIDS projects. And I use Visual Studio 2010 for working on the SQL Server 2008 R2 database projects.

So much so for all this product versions from Microsoft.

What is the difference between Data Warehousing, Business Intelligence and Data Mining.

May 14, 2011 2 comments

Often Data-Warehousing and Business-Intelligence are used interchangeably in day-to-day life. There is however a significant difference. Business-Intelligence drives Data-Warehousing requirements and consumes the end product that Data-Warehousing produces. And Data-Mining is an advanced level of Data-Warehousing and Business-Intelligence put together.

Data-Warehousing is the process of centralizing (at the least, the access of) all the data sources available in an organization/company. This centralization, of course, includes history-preservation, removal-of-ambiguities and optimization-for-fast-access amongst other things.  Data-Warehousing produces a Data-Warehouse; a centralized non-ambiguous and easily accessible historical set of all the data-sources.

Unlike commonly understood as an act of creating reports and dashboards, Business-Intelligence is in fact an act of identifying KPIs for various business verticals and their inter-dependence. Business-Intelligence is the guiding force behind the Data-Warehousing requirements. Business-Intelligence is also a process of discovering expected or unexpected actionable data-points from the Data-Warehouse that are of direct benefit to the business. Creation of reports and dashboards falls more under the scope of Data-Warehousing than Business-Intelligence.

Data-Mining begins where Data-Warehousing and Business-Intelligence ends. Data-Mining has not yet been classified into two separate segments like Data-Warehousing (for technical work) and Business-Intelligence (business related work). Data Mining uses the Data Warehouse in addition to preparing its own sets of sparse/dense wide and/or normalized data. Data-Mining may also use publically available data for benchmarking, comparing company data. Like Business-Intelligence, Data-Mining too discovers actionable data-points from the Data-Warehouse that are of direct benefit to the business,  but, in addition, it analyzes all the data using sophisticated mathematical/statistical/algorithmic techniques for making startling discoveries that are used more by the central strategic divisions in the company rather than the individual business units.

What is Data Warehousing: Are you bleeding or somebody is competing?

Often folks who are not so proficient with the world of Data-Warehousing ask: What is data warehousing? And I know they mean to ask what is Data-Warehousing/Business-Intelligence/Data-Mining/… Let me concentrate on Data-Warehousing in this post. For the differences and the connections between Data-Warehousing, Business-Intelligence and Data-Mining, please refer to my another post.

Apparently, this seemingly simple question should have a simple answer. But, in fact, sometimes it is difficult to explain it even to seasoned software engineers who are not familiar with the world of Data-Warehousing.

For a not so data-warehousing savvy person, Data-Warehousing is a slightly time-lagged copy of ALL the company’s data from all the sources which is centralized, clean, profiled, standardized and reorganized for faster access. This reorganized and centralized copy makes it easier to connect the relationships between various divisions of the company. Let us take an example; your company uses Siebel CRM with Oracle database for managing the customer’s and lead’s information. A paying customer logs on to companies website to use the company’s online applications which is built on SQL-Server database. The customer is billed for the services provided using an open-source financial package which is build on MySQL database. Now, the customer information is managed in three databases. If you want some information from all the three databases, it is not easy and it is not instant. Data warehousing will make a copy of the relevant data from all the three databases in a centralized Netezza platform. Joining and copying data from different systems take time, so Data-Warehousing solutions are a little time lagged. Once the data is in Netezza platform, it can be queried for instant responses. While copying data to Netezza on a regular basis, it is not necessary to overwrite data. History can be preserved and data from many points in past can be compared together to see a trend. This is another thing data warehousing is good at. Now in Netezza, data can be pre-aggregated during the process of copying, some derived and calculated columns can be provided with some easy visibility into business. Essentially, a data warehouse is more or less same data as consumed and produced by the software applications, but with a twist.

It is not necessary to get into too much technical details, but it is essential to explain “the twist”. Without this explanation, it sounds very suspicious. So let me first talk about hard-drives and RAM in a PC. RAM lets us do the same things on the same data that Hard Drive has or can have. Then why do we need both hard-drive-storage and temporary-storage (RAM)? Because computer engineering is not advanced enough to produce hard-drives as fast as RAM. So RAM is a compromise for slow hard-drives. Similarly, Data-Warehousing is a necessity created by the inability of the OLTP systems to give the business and strategic units of the company what they want. And OLTP systems are not able to produce what all we want because of two things: The inherent architecture and nature of the OLTP system and the need to compete and produce things as fast as possible in a ruthless and competitive market.

Now I know that RAM is not time-lagged and DW/BI/DM is not temporary. This was, after all, just an analogy 🙂

Now getting back to those who are sufficiently advanced in the field of DW/BI/DM, but still haven’t been able to tie all the pieces together. Literally, Data-Warehousing is putting data in a warehouse where “everything” is very easily visible and actionable. This Data-Warehousing doesn’t necessarily has to be a physical copy. A non-clustered index does produce real results even though it doesn’t reorganize the underlying table data. Similarly, data-warehousing solutions can be built on the source-system data itself without having to copy data elsewhere. This is the main philosophy behind the concept of Meta-Data Management (MeDM) which I will touch upon in a different discussion.  Yes, these MeDM solutions could be very RAM intensive and may just reside in RAM only.

Now why would somebody put entire company’s data in a warehouse? As to anything in life, it starts with a necessity and ends up as a fashion. Data warehousing is no exception. Let’s review the real reasons why somebody would do Data Warehousing. When software applications are built, the data consumed and the data produced is stored in databases, log files, emails, excel files, media files, etc.  This data may need to be accessed again for various reasons in various consolidated forms. When this access becomes difficult or impossible, it bleeds and bleeding need to be stopped. OR maybe your competitor is giving more value for the money to its customers by providing the consolidated data in a very attractive and useful way. It is driving your customers away. Somebody is competing and you can’t just sit and watch.

So let’s see the root cause reasons what makes it bleed and how you feel the competition:

1. The databases didn’t have good enough design, data got stored easily, but access to the old data over a sufficiently large date range is too cumbersome and too slow. Bleeding!

2. The teams were big and members didn’t collaborate in a perfect way, so multiple copies of same data exist in different set of schemas. Which one should be considered the source of truth? Bleeding!

3. There is no way to access data because the table joins were not enforced and the foreign key logic is all in the applications which were written by Smart john who doesn’t work for The Perfect Solutions anymore. Bleeding!

4. Each data-set has  a life-cycle. And it continues to move from one state to another. Not all OLTP systems need to maintain all states all the time. Previous state usually gets overwritten by the next state. If I need some visibility into the entire life-cycle, it is not possible because old states are now wiped out. Bleeding!

5. I need to look at the trend in which in my data changes. I also need to do regression analysis and need to discover correlations within my data because the other company is doing it. I need to archive data for longer duration and also need to create some statistical models to discover correlations. Somebody is competing!

6. The business is doing well and we acquired many companies. Each one of them is built on a disjoint set of technology suite. CEO needs to look at inter-dependent consolidated summaries across all these systems which is not trivially possible. Bleeding!

7. Though all is well and all the information and data is available, but it is not all that presentable. The Executives are always short of time and if my presentations don’t catch their attention in 30 seconds, it is not a good solution. We need to produce some good charts, dashboards, widgets, etc.  Bleeding!

In short, DW is to integrate all the data and to give answers to all the questions instantly; anything not instant is slow. It is also about finding patterns to enable business.