Regardless of how good you live, eventually, you will be summarizing large amounts of data with a countless number of =SUMIFS(). Let’s assume you have tens of thousands of records that you aggregate based on criteria in 6 columns. If you do it in a single cell Excel is going to be ready in a heartbeat. However, you have to do it in 24 columns on 500 rows for 25 sheets. As a result, your spreadsheet is painfully slow and takes forever to calculate. So what can you do if =SUMIFS() slow down the spreadsheet you work on?
In a hurry? Don’t have the time to read long posts?
Here is an outline for the steps to follow:
- Change your source data structure to have (much) lower volume — export of million rows can be pivoted or delivered already aggregated into much more compact set. Best way to tackle the issue is to reduce the size of the database.
- Be obsessed with not having to calculate the =SUMIFS(). This includes:
- Setting proper rules when to go into the calculation;
- Getting as much figures as possible from cells & sheets where figures have already been crunched.
- Getting rid of anything non-essential in terms of aggregations that put extra strain on your CPU.
- Last but not least — if this is purely reporting exercise, try to move it to a BI system. Excel is great but it doesn’t mean other platforms do not exist — often significantly better at handling tons of data.
A few words before I start
I guess I have been avoiding the situation described in the beginning simply because (largely unconsciously) I have always been practising some of the advice I am giving next. OK, not always — only after the first time I created a dreadful report — and then concluded that one’s job is tremendously boring when stuck waiting for a calculation to finish. Sadly, I recently inherited one such monstrosity of a file and now I have to deal with it. On the bright side, I get to write a new blog post (long-delayed because of my new job).
Before I start, a handful of notes on the =SUMIFS():
- It is not only fast but smart as well — I don’t know precisely how it is coded but my experience (and that of many others) shows that if you have empty elements in the ranges, they won’t have too much impact on the speed of execution. Often people would advise you not to include the entire row/column in the function arguments and only stick to the actual range size you are summing. This is good advice but I would argue that
have performance similar enough.
This doesn’t mean I encourage the first type of addressing (quite the opposite) but when focusing on speed only, I wouldn’t mind it.
- In its most classic syntax, shown in the previous point, the function is volatile. Essentially it means that when Excel recalculates something, anywhere in your open workbooks (plural!), it will cause the =SUMIFS() to recalculate as well. This is important as it is an obvious drag on the performance of even small tasks. Probably, it happens to be one of the main reasons for your suffering.
- As of 2020 using this formula in most offices is still considered advanced. I will debate this another time but I am making the point to come back to it later.
So… what do we do?
The first question to ask yourself is if you really need to =SUMIFS()?
This is a legit question. Frequently people would take the entire data export and put in a sheet. I get it — it is convenient to be there when you need to investigate figures; it is quick and easy; and you have been working that way since dial-up was the way to connect to the Internet. However, that giant DB is static and constantly looping through it is inefficient.
Compress your data
Data can be structured in a way that is good enough to feed your spreadsheet while not being too taxing on the performance. In my case, I used Power Query to manipulate the data first. The same can be achieved with pivots or (most probably) with the software you use to make your extracts. As a rule of thumb, the more sophisticated your tool, the better. I used PQ because (apart from it being the tool I have) I can easily make different tables and append them to one another — something hardly achieved with just pivots. However, when years go I had this option, I would happily do the same in SAS while getting figures from the DWH.
Cut down on the number of rows
You want to get from 700000 rows of this (available in this archive)
to as fewer lines of sales aggregations as possible:
This pivoting can dramatically cut the amount of data. Further, if you are not certain what kind of column combinations you will have to use when summing — granularity is not lost.
Reduce the number of conditions
Having 6 criteria columns means that Excel will make the top-to-bottom trip half a dozen times before it arrives at the figure you need. This can be avoided by combining the criteria columns into 1. Such folding is possible only when granularity is not a problem (and even then there are ways around it). You can convert the pivot above into this form:
From there you can still you use =SUMIFS() or the good old INDEX/MATCH combination.
When I used the last approach on the report I was fighting, calculation time dropped from several minutes to a few seconds.
There is a serious drawback with this solution. You are probably not working alone. Therefore, you have to consider your peers when designing your spreadsheet. Most Excel users don’t know about the existence of Power Query and they are even less aware of the other software you might involve. Therefore, you might find yourself fighting not only Excel’s calculation engine but human inertia as well. This is a very different ball game.
Due to various factors, you might have to stick to =SUMIFS() and a large database. This doesn’t mean there is nothing you can do.
SUMIFS only when needed
I might sound like Captain Obvious but people would often just drag their formula and wait for the result. Perfectly fine as long as the large amount of underlying data is not calling for a more sensible approach:
- First, check if the formula cell really need be calculated. Putting =IF() before a CPU-heavy expression has the potential to prevent it from going off. As long as your logical tests are not equally intensive as the calculations they prevent, their presence is justified. As a general rule, I would not shy away from adding extra row/col if the flags in it help me to offload some work from the processor.
- Clean up your worksheet from all calculations that are not really needed. Sometimes we would add rows/columns simply because we consider them good to have. If they contribute to the slow down, it is better to trim down.
- Check if data can’t be obtained from figures already calculated. A subtotal, for example, can be the very light sum of its elements or an entirely new =SUMIFS(). Take the time to get rid of similar cases and make your file more readable at the same time.
- Keep in mind that somewhere else in your workbook, the data might be readily available. Your division level report can scan your source data or simply sum the sheets/figures for the subdivisions.
Move it all to BI
Unless your design is spectacularly bad, chances are that waiting for =SUMIFS() to finish means you had to push Excel beyond its limits. Those are easily reached once data starts to grow from ‘ordinary’ towards ‘big’. At one point, changing the platform might be a much better choice than struggling with a spreadsheet. This is true both in reporting and modelling. You might have reached the point where a BI solution, albeit not as flexible as Excel, would fare better.
Most of the BI tools:
- Are running on a servers way more powerful than your base station/laptop;
- Go hand in hand with data compression or at least better data architecture;
- Would have the entire job scheduled or ran at a single click;
- Can easily scale up to deal with whatever volume you throw at them;
- And even though they are not as easily maintained as a spreadsheet, this might actually discourage too frequent changes — something much appreciated if the management acts erratically.
Some people argue that they actually need to make changes and massage the data once they see the result. I would hardly believe it is humanly possible to deal manually with more data that Excel can stomach. This situation would suggest a more fundamental problem. Taking a step back to reconsider your way of doing the job might be the appropriate step.
Somewhere in the Microsoft documentation, I read that people tend to switch off the automatic calculation if it takes more than 3–10 seconds for Excel to crunch the numbers. This is easy to reach. Try hard not to be in that place.