Data warehouse server. How do you calculate RAM/CPU specifications?

Great question, and I did a session about this at TechEd a few years ago called Building the Fastest SQL Servers:

https://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/DBI328

In it, I explain that for data warehouses, you need storage that can provide data fast enough for SQL Server to consume it. Microsoft built a great series of white papers called the Fast Track Data Warehouse Reference Architecture that goes into hardware details, but the basic idea is that your storage needs to be able to provide 200-300MB/sec sequential read performance, per CPU core, in order to keep the CPUs busy.

The more of your data that you can cache in memory, the slower storage you can get away with. But you've got less memory than required to cache the fact tables that you're dealing with, so storage speed becomes very important.

Here's your next steps:

  • Watch that video
  • Test your storage with CrystalDiskMark (Here's how)
  • With 4 cores, you'll want at least 800MB/sec of sequential read throughput
  • If you don't have that, consider adding memory until the pain goes away (and caching the entire database in RAM isn't unthinkable)

Say you've got a 200GB database that you're dealing with, and you can't get enough storage throughput to keep your cores busy. It's not unthinkable to need not just 200GB of RAM, but even more - because after all, SSIS and SSAS really want to do their work in memory, so you have to have the engine's data available, plus work space for SSIS and SSAS.

This is also why people try to separate out SSIS and SSAS onto different VMs - they all need memory simultaneously.


The Fast Track Data Warehouse Reference Guide for SQL Server 2012 is actually a bit out-of-date especially if you're moving to SQL Server 2016 (really? Call me), not just in terms of time, but also features.

In SQL Server 2012, the version on which fast-track is based, you could only have non-clustered columnstore indexes. These are separate structures from the main table so incur additional storage and processing overhead due to albeit compressed copies of the data.

From SQL Server 2014 onwards, you can have clustered columnstore indexes. These offer massive compression and potential performance boost for aggregate / summary queries. They are absolutely appropriate for fact tables, so your 32GB fact table could look more like ~8-12GB. YMMV. That changes the landscape slightly doesn't it? Looking at your table (and thumb in the air) you could maybe get away with 32GB but I would shoot for 64GB (it's not like you're asking for 1TB) and leave a room for other services and growth, the justification being this allows you to hold your largest table in memory, allow room for growth and room for other services. You don't have to tell them about the compression. One thing you have to bear in mind with sizing is, you're not just sizing for your data now, but how it will be, say a year from now. Also note however, performance for point-lookups can be horrendous, but as you're moving to SQL Server 2016 you can add additional indexes or you could always consider Columnstore Indexes for Real-Time Operational Analytics although you'll need more memory for that : )

How are you getting on with the CTPs by the way, currently at CTP3.3 it has most of the features you might want to use available, so you say you don't have resource for trials, but you could get a Windows Azure trial, spin up a VM, create some sample data, test the compression, performance of key features and queries etc for free. Or if you have an MSDN license this is built in.

In summary, size to allow your largest table to be in memory (plus other stuff) or set up a simple trial (for free in the cloud) to get the hard evidence you are after. Remember to deallocate your VM when you're done : )


Presumably while developing and maintaining the ETL packages on local development machines you sometimes use test data of similar or larger scale to that which you expect in production, and if not then perhaps you would consider doing so (anonymised real data or algorithmicly generated test data, if your real data is sensitive at all).

If this is the case you can run the process under various memory conditions and profile it, to see the point where more RAM stops making a massive difference - as useful as rules of thumb and educated guesswork nothing benchmarking and profiling can provide much more concrete answers and as a bonus may highlight obvious bottlenecks that might be easy to optimise out. Of course your dev/test environments might not exactly match production, so you may need to use experience to interpret how the results may change.

If you are running SSIS on the same machine as the databases then you should definitely make sure the SQL Server engine instances are set to never claim all the memory. Not only can memory starvation cause OOM errors in SSIS, long before that point it can cause significant performance issues as it spools buffers to disk when it could otherwise keep them in RAM. How much you need to reserve for SSIS and other tasks will vary greatly depending on your process, so again profiling is a good way to gauge this. It is often recommended that you run SSIS on a separate machine to make this easier to manage, though you may have network throughput and licensing issues to consider there.

Update

If, as per your comment, there isn't resource available to perform realistic benchmarks to gauge where performance drops off (and/or OOM errors and related problems start to happen) if too little RAM is allocated, things get considerably more hand-wavey without intimate knowledge of the warehouse and ETL processes. A rule of thumb for the warehouse database itself: you want enough RAM to be able to hold then entirety of all the most commonly used indexes, and then some to allow for less commonly used data and more again to allow for expected growth in the near/medium future.

Calculating this can be faf - sp_spaceUsed won't break things down by index so you'll have to query sys.allocation_units and friends directly yourself. There are a few examples out there to get you started though, http://blog.sqlauthority.com/2010/05/09/sql-server-size-of-index-table-for-each-index-solution-2/ looks like the best of the first few that came from a quick search.

On top of the needs for running the warehouse DB itself, remember to add on the RAM requirements for SSIS if it is to be running on the same machine and make sure SQL Server has RAM limits in place to ensure that this memory is actually available to SSIS.

From the overall data sizes you list my gut suggests that 32Gb would be the absolute minimum I'd recommend for just the database engine and SSIS, setting the SQL instance(s) to use at most 26 of it, and as you are also running SSRS and other services on the same machine a sensible minimum with some future proofing would be 64Gb (two thirds of your current data should fit in that after other services and reservations have their cut). Obviously quoting my gut won't get you very far in discussions with your infrastructure people though...