Techniques to Get Past the Limitations – Part 3: Performance Tips and a Wish List
This is the third and final part of a guest post series by Richard Leeke from Equinox Ltd.
However, these articles only discussed one of the caveats raised in Robert’s original Site Catchment Analysis with Tableau 6. I can’t do anything about the second - missing functionality of triggering an action after the change of a parameter. But in this article I discuss the various techniques I used to optimize the performance of the Tableau workbook. Whilst the discussion focuses on the Site Catchment Analysis discussed in this series, there are several that are general hints which may apply to lots of your Tableau workbooks. I actually only worked out a couple of these while working through this analysis, but have already adopted them in other workbooks with great success.
So, today’s article wraps up my guest post series with a couple of tips to increase Tableau’s performance and (especially for the Tableau folk reading this blog) a little wish list for future versions of Tableau.
I have used a number of techniques to speed up the various versions of Robert’s workbook, including a few workarounds for “tuning opportunities” with Table Calculations, some of which I discovered in the course of exploring options with this workbook.
- Do as much work in the database as possible
As a general rule, the more of the work that can be done in the database, the faster the view will be, especially if that allows filters to be evaluated in the database, rather than having to pull large results sets back to Tableau for processing. Furthermore, doing the work in the original query against the primary data source is almost certain to be quicker than combining results via Data Blending. But of course Table Calculations and Data Blending allow you to do all sorts of analysis which can’t be done directly in the back-end database.
Whilst it is generally true that it is faster to do the work in the database, there will always be exceptions, and performance will definitely depend on the performance of the database. There will doubtless be lots of exceptions, where database performance means that you want to do as little work as possible there.
- Problems with TOTAL()
In some circumstances using TOTAL() results in one or more long delays on refreshing a sheet. I noticed this by inspecting the Tableau logs when trying to find out what was taking so long on one of my first attempts to add the summary totals to the “Covered” sheet. This is definitely a bug and should be fixed in an upcoming maintenance release. In the meantime, if a sheet starts going surprisingly slowly after adding a TOTAL() table calculation, try replacing the call of TOTAL() with WINDOW_SUM().
- Window function speed
But having said that, the next performance issue is that the WINDOW_XXX() functions do not currently scale well. I first noticed this during the beta, and worked out that the time to execute the window functions increases proportional to the square of the number of rows in the partition being analyzed. This makes these functions effectively useless when analyzing large volumes of data, so I had largely avoided using them since I came across the issue, thinking I would wait until the issue is fixed.
But I have been a little surprised how long it has taken for this issue to float to the top of the priority list, so when I found that I really needed them to get the analysis going properly for Robert’s workbook I decided to have another look at the issue.
I work in system performance engineering (which basically means ideally avoiding but more often finding and fixing system performance issues). Having the response time increase with the square of the amount of work to do is a very familiar issue. In fact this example of “square-law performance” and the process of working out what was happening and finding a workaround made an ideal little case-study for the Performance Engineering section of my company blog: Square-Law Performance. Sorry Tableau folk – but it’s a great example and it’s only one little blemish on a great release.
The WINDOW_XXX() functions perform various aggregates over a “window” onto the rows in each partition of the data being analyzed. The window may be the whole partition or it may be a different subset of the rows for each row in the partition – for example a sliding window of <N> rows either side of the current row. In defining the partitions, one of the properties which can be specified is the sort order, so the partition needs to be sorted to evaluate the function. It only needs to be sorted once per partition, because the sort order is invariant, even if the window moves as each row is evaluated – but currently the window is re-sorted for each row, which is the source of the slow-down.
For small partitions, or if the window is defined as a narrow range around the row being processed, this extra sorting is utterly negligible. But in a lot of cases, such as the uses in this analysis, the required window is the whole partition. For the first version of this analysis, a few calls to WINDOW_SUM() over the full 8,000 rows of location data resulted in the sheet taking three minutes to refresh. For large partitions this can make the functions unusable. The analysis I was doing when I first discovered this involved a partition of 120,000 rows, which took three and a half hours to evaluate (though I only worked that out after initially assuming that Tableau was in a loop and killing it a few times).
Luckily, for cases where the required window is the whole partition, there are a couple of workarounds to avoid the problem. I say “a couple” because there are two common cases.
Result only needed for one row in partition
A very common case, which applies in some of the cases in this workbook, is that all that is required is to evaluate an aggregate once for the whole partition. The field [Population within Radius (blended)] on sheet “Covered (blended)” is an example of this sort. All I need is to evaluate the total population within the selected area, and I only need it once, to display in the sheet header.
The basic window calculation that I need is this:
But that will be evaluated 8,000 times, including 8,000 expensive sorts of the whole partition.
At first glance, it appears that all that is needed to fix it is this:
IF (FIRST()==0) THEN
That just returns the result for the first row in the partition and NULL for all other rows, but unfortunately it turns out to take almost as long as the original. It took me quite a while to guess my way past this one, but eventually I worked out what must be happening. Tableau actually evaluates the expression in a couple of passes. First it gets the window ready. In this case, because I haven’t specified the optional parameters defining the range of the window, it defaults to the whole partition – which it duly sorts. Then it gets on with actually evaluating the expression, which means that for the first row it calculates the result and for all other rows it just ignores the nicely sorted rows.
Having guessed that that might be what it was doing, I tried another tack. For the first row, where we want the result, specify the entire partition. For all other rows, make the window just the single row being processed, like this:
IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]), 0, IIF(FIRST()==0, LAST(), 0))
Sure enough, that fixed it. My three and a half hour case came down to about five seconds, and the calculations in the “Covered (blended)” sheet are now too fast to worry about.
Result needed for all rows in partition
The other common case, which applies elsewhere in this workbook, is that the calculation is needed over the whole partition for all rows in the partition and the answer will be the same in each case.
In this case, rather than evaluating the result for every row, we can just evaluate the result for the first row in the partition and copy it for every other row, whilst using the same trick to minimize the cost of the sorts on the other rows which are not going to be used. That can be achieved like this:
0, IIF(FIRST()==0, LAST(), 0)))
I have raised both the TOTAL() and the WINDOW_XXX() issues with Tableau and expect that both are likely to be resolved in the near future. In the meantime, the above tricks provide adequate workarounds for all the cases I’ve encountered so far.
- Optimizing Data Extracts
One last speed tip: be sure to use the “Optimize” option “Data->Extract->Optimize” on the extract after adding table calculations. Tableau checks all changes in calculated fields and pre-calculates and stores any appropriate ones (which I think basically means most calculated dimensions as long as they don’t depend on parameters or data blending). It also deletes any old versions of calculated fields which are no longer current. In some circumstances “Optimize” can make a substantial performance improvement.
One thing to watch out for is that if you are using the technique I described in the second post of this series of making duplicate copies of a connection to a data extract, you need to ensure that all calculated fields exist in the connection you select for the “Optimize” or “Refresh” options, otherwise Tableau will remove fields which it thinks are unused. I just make a habit of defining all calculated fields in the primary data source and copying them as necessary to the other connections and then I only ever refresh or optimize the extract from the primary data source.
This one relatively simple workbook has provided an excellent learning exercise for me and a great way to illustrate many different aspects of the new version 6 features.
The resulting workbook now implements everything from the version 5 workbook, with Robert’s enhancements of the use of parameters to select the location of the center of interest and the radius, a roll-up hierarchy of locations and the ability to publish to Public. And in the case of the third tab, it is at least as quick as with version 5.
Some of the key messages I take out of this are:
- There are lots of ways to do anything – the first way you try may not always be the best, so if you run into roadblocks or the resulting analysis is too slow, don’t give up, try coming at it from a different direction. Hopefully this article provides some useful clues about alternative approaches.
- In general, I try to do as much as possible in the data source rather than in table calculations – this will generally give the best performance and avoid restrictions on subsequent analysis.
- There are currently a few quite significant performance issues with table calculations, which can make the table calculation approach much less viable. Hopefully these will be resolved soon; in the meantime, the tuning tricks I have described get around the problems for many common situations.
There is still room for a few more enhancements which offer the chance of making things easier or quicker or both. I have touched on several ideas that I have thought of above. The key ones are summarized below.
- Parameter Setting Actions
The ability to set the value of one or more parameters by visual selection from the results displayed on a sheet would have numerous uses. I have long wanted this to allow visual selections of ranges from a view with continuous axes, simply for the speed of navigation. Working through the performance issues with Robert’s workbook has also highlighted the performance benefit of being able to pass values that have been looked up in one sheet directly into the query for the next sheet.
- Lookup tables
Another feature which would offer an alternative approach to getting that speed benefit would be more direct support for lookup tables (to avoid the clunky delimited list embedded in parameters that I used here).
- On Parameter Change Actions
One of Robert’s original issues (which also applied in version 5.2) was that he could not automatically update the map view when a parameter changed. The ability to trigger an action on change of a parameter would have lots of uses.
- Other Data Blending Join options
Although I didn’t talk about this explicitly above, I will just briefly mention this one. There are often situations in which you want summary data from another data source, but for which a join based on equality does not work. Typically this might apply to a range of values. This would allow blending to be used to retrieve the total value across a date range, for example.
- Why not?
One of the things I struggle with most with table calculations (and in fact calculated fields in general) is working out why something is not allowed. Some form of further feedback, rather than just disallowing illegal actions, could be really helpful
Visibility of order of execution
I suspect that the Tableau folk have gone to an awful lot of trouble to try to make what happens where and the order of evaluation of calculations transparent to the user. Unfortunately, there are still situations where it is not entirely transparent, and it really is necessary to understand these things: what is client-side and what is server-side, when do filters get applied relative to calculations, etc. I think it would be better to expose some of these things in the interface, rather than leaving people to puzzle over them and work out what must be happening by a combination of trial and error and guesswork. This ties in closely with the previous point.
Working through the issues Robert described has been a really useful learning exercise for me and has provided lots of inspiration for my own analysis. It has also inspired me to finish off a viz I started to work on just after the Tableau conference showing earthquakes in New Zealand (New Zealand, which is where I live, suffered a 7.1 earthquake while I was at the conference and has just had another spate of aftershocks). I have borrowed some of Robert’s ideas and included the lookup table idea that made navigation of data for 100,000 earthquakes really quick: Explore your local Earthquakes.
Finally, just when I thought I had my head around everything in this analysis, Robert threw down a new challenge. He is working on another posting, repeating this analysis in Excel – and has built some very nice additional features into that version which he was sure would not be possible with Tableau. I was of course unable to resist the challenge. Although I did manage to get something going after a fashion, I had to use tricks which preclude publishing it to Public and it also takes many times as long to evaluate the analysis as it does in Excel. So it looks as if Excel is going to win this one hands-down (for now at any rate). I’ll describe that challenge, and a couple of possible extensions that would allow Tableau to compete, once Robert has posted his Excel version.
Richard, a big-time thank you very much for this fantastic guest post series. I always deluded myself into believing one of my (few) virtues would be stubbornness when it comes to solve complex problems. Compared to you, however, I am nothing else than a quitter. Kudos on your endurance, your deep knowledge of Tableau and your great writing style. I learned a ton from you. Many thanks!
I can hardly wait to see what you will come up with regarding the enhanced features I implemented in the Microsoft Excel version of the Site Catchment Analysis and how far you will push Tableau’s limits this time. Of course I would be more than happy to feature further guest posts from you here on Clearly and Simply.
The article describing the Microsoft Excel version of the Site Catchment Analysis will be next, as always including the Excel workbook for free download.