Posts Tagged ‘shards’

Irony of doing data warehousing on sharded data

Data is sharded for cost and performance reasons. If cost is not a problem, one can always get a powerful hardware to get whatever performance is needed. Sharding works because the business-needs align with the sharding key. Sharding key partitions the data in a self sufficient way. If business grows and this self-sufficiency is not maintained, sharding will lose its meaning.

ย Data warehousing on sharded data is a tricky thing. Data warehousing unifies data and aligns business definitions. Data warehousing will attempt to find distincts and aggregates across all the shards. Well, it is not possible without a performance hit. So, real time data warehousing is out of question. For a not so excruciating performance, data warehousing will need to bring in all the data:

  • Either on a single physical computer. This will have a cost hit. You avoided the cost hit for your OLTP app, but if you need a good data-warehouse, you have just deferred your expenditure for future. It could be well thought out decision or a shocking discovery; depending on how much planning you have done ahead of time. One saving grace is that the cost hit on this data- warehousing hardware can be relatively less if you don’t need the solutions in (near) real time.
  • Or on a single logical computer (hadoop or in-memory kind of solutions). It is still sharding in some sense, isn’t it ๐Ÿ™‚ And you have lost the real-time-ness and the ACID-ness ๐Ÿ™‚