What are the main reasons to split a Data Warehouse into multiple databases?

You are definitely on the right track! 320GB is not huge for a database, particularly a DW.

1) Current db is poorly optimized, little documentation, sub-optimal datatypes, sub-optimal indices.

My rookie thoughts: Aren't these problems irrelevant to splitting the database? They are simply problems that need to be solved on their own either way.

This is bang on the money. Splitting one large(ish) poorly organised, optimised and documented database into 7 poorly organised, optimised and documented databases is a waste of time! You need to tackle the root of the problem!

2) There are 372 objects in the current database which makes it slow.

My thoughts: This hardly seems large in my opinion.

Again, you are correct! 372 is positively small in terms of number of objects - many large servers have 10's of thousands. From here

The sum of the number of all objects in a database cannot exceed 2,147,483,647.

Your 370 divided by ~ 2E9 ~= 1.7E-7 - so no worries on that score! :-)

3) One database is harder to document and draw schema diagrams for than 7 databases (we will have views that will span multiple database).

My thoughts: .... This seems completely ridiculous to me, but maybe I'm wrong. We've already organized our data warehouse by 13 'source system' schemas.

Again, you're correct. If there are 372 entities with inter-relationships between them, you'll need to document and diagram them. It's going to have an inherent degree of complexity. What you can do is try to split your overall system into subsystems and document them and then try and fit them into the bigger picture - great oaks from little acorns grow!

4) One database will lead to more database deadlocks.

-- Isn't this problem also completely irrelevant to having multiple databases? It's my understanding that deadlocks occur at the table level (actually usually even just the row level, but eh). Even then, all our data inserts happen at midnight, all our selects downstream to the BI happen at 2 am. Having two processes update the same table is irrelevant to multiple databases, is it not (deadlock would happen either way)? Also, I personally have seen no evidence of table deadlocks occurring during normal operations.

What you will lose in the multiple database scenario is ACID transactions within the same schema - OK, you can have 2-phase commit, but it's not as robust as transactions within the same schema (IMHO). I'm not sure of a valid reason to hive off tables if they're necessary for your requirements.

You appear to be talking about writes blocking reads? Well, you also appear to have a batch process at midnight followed by a querying process at 02:00? If you can make transactions/tables read only, this will take some load off the server engine as it is processing your data. Only you can tell if this can be applied to your scenario!

5) Database technical ownership/ ownership.

It's only the two of us that work on the database. It's possible he wants to really segregate our 'fiefdoms'. Really, hasn't been an issue, but can't user permissions be determined at the schema level anyway?

Certainly, ownership is at the table level and access can, depending on your server/version, be granted on a column and/or a row level - so the business of ownership is a complete red herring! If you are a server DBA performing a reorganisation (as opposed to simply scheduling backups and other mundane tasks), then you will need "access all areas"!

You should have a comment on every table and field in your system - you can put "ownership" (in the organisational as opposed to database sense of things) in there - commenting tables and fields is an excellent first step to documenting a system - it becomes self-documenting!

What ARE valid reasons for splitting a Data Warehouse into multiple databases?

There can be many reasons - some are associated with multi-tenancy (both in terms of machine resources (CPU, RAM, HDD and Network) and client confidentiality or requirements. Have a look here and also google "database multi-tenancy" or similar.

Everyone says it, but it is a struggle - "documentation is very important"! As a first step, document your tables and fields in the comments. Produce ERD diagrams for all of your subsystems. Don't let anything new into the system without these steps being implemented. Best of luck in your new role!


While it sounds like a classic straw man tactic being taken by your colleague, could he or she mean the creation of formal data marts when saying "splitting up the data warehouse"?

The two main approaches to Data Warehousing are attributed to Ralph Kimball and Bill Inmon. Here are a couple of high-level overviews ([1], [2]) on the difference between these two common approaches if you've got a few minutes to burn.

What I believe may be applicable to your situation is that Bill Inmon's approach calls for the formal creation of Data Marts that the reporting tool(s) pull(s) data from. These Data Marts are designed for specific departments or business units to access exclusively, and I think this may be what you're colleague is trying to move towards. The identical nature of the copies is odd, but it may be easier to create a copy of the data warehouse in its current form and then only load a specific department's data into said copy going forward?

From what you've provided, it sounds like your current data warehouse is using Kimball's approach where the Data Marts are a logical subset of data within the dimensional data warehouse that your reporting tool accesses directly. These two design approaches have their pros and cons, and hopefully the crux of your colleague's issue is that he or she is just more comfortable with Inmon's approach.

Hopefully this is just a misunderstanding of terms and a in-depth discussion of these two different approaches with your colleague will lead to some clarifications about the hurdles he or she is trying to move past.