Replicated DBs as Data Sources: Are you sure?

December 2, 2011 Leave a comment

Let us assume you have multiple OLTP databases catering to various kinds of applications and you are provided replicated databases for each of these OLTP databases. First order thinking makes you feel good about it. You have real-time data to tap into; based on your ETL system’s capabilities, you may be able to provide near real-time solutions. You are right if you are not looking for 100% accuracy at all times.

The problem with working on replications is that the data changes every instant. Few scenarios:
1. If you tapped an OLTP table for customer records to create your customer dimension, by the time you are filling in data into a fact-table with customer dimension key, the fact-table may be referring to a new customer that you didn’t have at the time you were populating your customer dimension.
2. You got the list of customers with certain statuses. By the time, fact-table is getting populated; the statuses for some of the customers have changed. Your report will not reflect the true picture at the time of processing the fact-table.

The root cause of all these scenarios is that we are not working with one particular state of source DBs at some particular point in time. OLTPs by themselves are okay to work on these DBs, but Data warehouses tap various unrelated domain areas in the OLTP systems and depend on a synchronized state.

Replications are also prone to lags and while you will assume that you processed all the data for a day or an hour, you actually may have got only a subset of data.

For mission critical data warehouses I don’t recommend to use replicated DBs are data sources (of course there are always exceptions if you know exactly what you are doing). Backup-restores or snapshots with properly defined backup/shotshot datetime that are clearly documented and agreed-upon in formal SLAs are the good solutions. It comes at the cost of faster data availability, but reduces the pains of inaccuracy and potential law suits and painful justifications to auditors.

Schema Validation Framework (SVF)

November 16, 2011 Leave a comment

SVF is the heart of any data warehousing project. I just coined this acronym, but invented this phase of the data warehousing processing engine many years ago and have been successfully using it. Most of the data warehousing projects don’t even go beyond the start line because they don’t have an SVF infrastructure in place; they begin, they struggle and they eventually die. I will explain in a bit how that happens.

SVF is a way to protect your ETL from any/all schema changes in the source systems.

Most of DBMSs provide in-built detailed metadata about the tables, relationships, constraints, columns, data types, keys, etc. This metadata is provided as a separate database called INFORMATION_SCHEMA in some of the major DBMSs. The selective and transformed, yet detailed information provided in this metadata for a source-system database can be stored in the SVF layer that executes prior to the ETL layer. On sub-sequent processing of the source-system database, SVF will fetch the metadata info, compare it with the previous stored version that the ETL has been made compatible with. If no serious mutations are found, ETL can proceed; else it can gracefully exit after logging and alerting about the mutations.

I have seen companies and consultants spending endless time discussing merits/demerits of 3NF vs. Dimensional-Modeling, real-time vs. delayed data processing, Inmon vs. Kimball, ETL tool vs. SQL-queries, OLTP vs. OLAP, Business Objects vs. Microstrategy, … and the debates go on. I haven’t seen many folks discussing the data procurement in details. Yet, as soon as a data warehousing project starts and ETL passes the first phase, the source-system database changes. The data warehouse team pushes to put a process in place where the application development team informs the data warehouse team about any database schema changes prior to the deployment. The application development team almost always fails to do so for obvious and mostly justifiable reasons; there isn’t much ROI on the efforts involved in following this process (especially when SVF can automate it). But because SVF is not given a thought, the blame game starts, the problem continues to exist, ETL continues to break and the data warehousing SLAs continue to lose their meaning,…

I am tempted to write a little bit about the inappropriateness of slave/subscriber databases as the source data for the serious data warehousing projects, but will hold that temptation until my next blog. Meanwhile, it is a good food for thought for those who are curious enough ūüôā

A Web Survey: Just a Symbolic Study of Star-Schemas


Download the document below to read the post:
 
A Web Survey – Just a Symbolic Study of Star-Schemas.docx
 

Microsoft SQL Server to MySQL: Three important differences that matter.

June 6, 2011 2 comments

I recently started working on MySQL after a long association with MS SQL Server for close to 10 years. I am sharing my experience from the initial days; my frustrations, my excitements and some real differences between the two DBMSs.

MySQL has a flexibe feel almost same as that of a programming language; somehow you get this instinctive confidence that you can find a work around for almost anything. There are lots of curious things in MySQL when compared with MS SQL server e.g. support for a zero date, eccentric stored-procedure syntax, enum data-types and so on. The three things that caught my explicit attention and that could be a deciding factor while designing applications are:

1. SQL server has mainly two kinds of tables: the tables and the temporary tables. Oh well, we have table variables, table data types, sparse wide tables, upcoming columnar tables etc. MySQL has Storage-Engines instead. When you create a table, you specify the storage engine that is used to create the table. For a regular¬†OLTP application, INNODB¬†engine supports row level locking, transactions etc. But then if you don’t need all that, you can go for Memory engine or MyISAM¬†engine or CSV storage engine depending on your specific need. Gives you more control, flexibility, optimization and performance. InfiniDB¬†is another MySQL storage engine that mainly has columnar storage, is developed by a third-party¬†vendor and integrates well with MySQL Query language and processing engine. This open source thing is cool stuff. And this storage engine stuff is cooler. There is XtraDB¬†storage engine, MariaDB storage engine and you can write and plug your own if you have a need and the skills. Explore more on MySQL storage engines at http://dev.mysql.com/doc/refman/5.5/en/storage-engines.html

2. Sad, there are no linked-servers in MySQL. Instead there is “Federated Storage Engine”, so you have to replicate the callee¬†tables schema in the caller databases and you have to maintain the sync. But I could create a linked server from MS SQL Server 2008 R2¬†to MySql. MS SQL Server is definitely a few steps ahead here including support for OpenQuery¬†feature. But then SSIS is a few steps behind in talking to MySQL servers.

3. MySQL has a FIND_IN_SET funtion. The FIND_IN_SET() function is optimized to use bit arithmetic and is used to store and query “multi-value” column data. A definite win over MS SQL Server. Can be a make and break decision for lots of applications.

During my initial part of the journey, I have been riding a sine wave. I feel good about few things and feel not so good about others. But I am happy that I have both the toys at hand to play with. There is no clear winner, but if we take out the cost factor, MS Sql Server has a more professional feel and more management tools. And if we bring in the cost factor, you get almost an equivalent DBMS that needs a little extra taming. Beauty before the age or the other way round? The jury is still out on this issue.

NULLs in Dimensional Modeling


As a rule, it is impossible to always replace NULLs with some other default representations either in Dimension tables or Fact tables. Here are the possible scenerios:

1. ‚ÄėData Not Available‚Äô kind of values can be assigned to a column in a dimension table only if the data type of that column is string of 18 characters at least. A table usually has mix and match of data-types for its columns, not every column can get a default string value so long. So, I experimented with NULLs for numeric columns (because anything other than NULL can have an unintended meaning) and ‘N/A’ for string columns till I hit upon string columns of data type char(2), char(1) etc. And then I ended up using ‘-‘ as my default for the char(1), ‘–‘ for char(2) and ‘N/A’ for all other strings.

 2. For fact tables, you have four things to take care of:

  • Non identifying foreign keys: should be replaced with -1, -2, -3, etc (these are surrogate-ids of the dimension record in case of no match).
  • Identifying foreign keys: Can’t be null by definition.
  • Degenrate dimensions: Depends: same criteria as dimension table attributes.
  • Measures: Can be and should be null when no value is known/available.

Also read:

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2003/KimballDT43DealingWith.pdf

Different Combinations of Microsoft Products for DW/BI solutions: The hide and seek game never ends.

May 19, 2011 1 comment

It has never quite reached its steady state for DW/BI profesionals since Microsoft released Visual Studio 2008.

1. Visual Studio 2008 worked well with Sql Server 2008, it had support for BIDS (Business Intelligence Development Studio which is essentially SSIS, SSAS and SSRS), but it missed support for SQL Server 2008 Database Projects. It only supported SQL Server 2005 Database projects.

2. So Visual Studio 2008 Service Pack 1 came and nothing changed for DW/BI professionals.

3. And then came  Visual Studio 2008 GDR R2. It fixed all the problems, but it was too late and SQL Server 2008 R2 was ready to kick out Sql Server 2008.

4. Visual Studio 2008 GDR R2 didn’t like Sql Server 2008 R2 for database projects and life was back to mix and match of products; lose some gain some.

5. Visual Studio 2010 came and started working well with Sql Server 2008 R2 database projects. Guess what it stopped supporting BIDS ūüôā They had to do it.

So I now hoping that that Visual Studio vNext will support the upcoming SQL Server 2011 (code named DENALI) and it will also have support for BIDS.

These days, I work on Visual Studio 2008 SP1 GDR2 to be able to connect to SQL Server 2008 R2 for BIDS projects. And I use Visual Studio 2010 for working on the SQL Server 2008 R2 database projects.

So much so for all this product versions from Microsoft.

What is the difference between Data Warehousing, Business Intelligence and Data Mining.

May 14, 2011 2 comments

Often Data-Warehousing and Business-Intelligence are used interchangeably in day-to-day life. There is however a significant difference. Business-Intelligence drives Data-Warehousing requirements and consumes the end product that Data-Warehousing produces. And Data-Mining is an advanced level of Data-Warehousing and Business-Intelligence put together.

Data-Warehousing is the process of centralizing (at the least, the access of) all the data sources available in an organization/company. This centralization, of course, includes history-preservation, removal-of-ambiguities and optimization-for-fast-access amongst other things.  Data-Warehousing produces a Data-Warehouse; a centralized non-ambiguous and easily accessible historical set of all the data-sources.

Unlike commonly understood as an act of creating reports and dashboards, Business-Intelligence is in fact an act of identifying KPIs for various business verticals and their inter-dependence. Business-Intelligence is the guiding force behind the Data-Warehousing requirements. Business-Intelligence is also a process of discovering expected or unexpected actionable data-points from the Data-Warehouse that are of direct benefit to the business. Creation of reports and dashboards falls more under the scope of Data-Warehousing than Business-Intelligence.

Data-Mining begins where Data-Warehousing and Business-Intelligence ends. Data-Mining has not yet been classified into two separate segments like Data-Warehousing (for technical work) and Business-Intelligence (business related work). Data Mining uses the Data Warehouse in addition to preparing its own sets of sparse/dense wide and/or normalized data. Data-Mining may also use publically available data for benchmarking, comparing company data. Like Business-Intelligence, Data-Mining too discovers actionable data-points from the Data-Warehouse that are of direct benefit to the business,  but, in addition, it analyzes all the data using sophisticated mathematical/statistical/algorithmic techniques for making startling discoveries that are used more by the central strategic divisions in the company rather than the individual business units.