Another Look at Site Catchment Analysis with Tableau 6 (Part 3)

Written by

Techniques to Get Past the Limitations – Part 3: Performance Tips and a Wish List

This is the third and final part of a guest post series by Richard Leeke from Equinox Ltd.

The first two articles of this series focused on three different techniques for doing different levels of aggregation with Tableau: Table Calculations, Data Blending and Parameter Lookup Tables.

However, these articles only discussed one of the caveats raised in Robert’s original Site Catchment Analysis with Tableau 6. I can’t do anything about the second – missing functionality of triggering an action after the change of a parameter. But in this article I discuss the various techniques I used to optimize the performance of the Tableau workbook. Whilst the discussion focuses on the Site Catchment Analysis discussed in this series, there are several that are general hints which may apply to lots of your Tableau workbooks. I actually only worked out a couple of these while working through this analysis, but have already adopted them in other workbooks with great success.

So, today’s article wraps up my guest post series with a couple of tips to increase Tableau’s performance and (especially for the Tableau folk reading this blog) a little wish list for future versions of Tableau.

Performance Tips

I have used a number of techniques to speed up the various versions of Robert’s workbook, including a few workarounds for “tuning opportunities” with Table Calculations, some of which I discovered in the course of exploring options with this workbook.

Do as much work in the database as possible
As a general rule, the more of the work that can be done in the database, the faster the view will be, especially if that allows filters to be evaluated in the database, rather than having to pull large results sets back to Tableau for processing. Furthermore, doing the work in the original query against the primary data source is almost certain to be quicker than combining results via Data Blending. But of course Table Calculations and Data Blending allow you to do all sorts of analysis which can’t be done directly in the back-end database.

Whilst it is generally true that it is faster to do the work in the database, there will always be exceptions, and performance will definitely depend on the performance of the database. There will doubtless be lots of exceptions, where database performance means that you want to do as little work as possible there.
Problems with TOTAL()
In some circumstances using TOTAL() results in one or more long delays on refreshing a sheet. I noticed this by inspecting the Tableau logs when trying to find out what was taking so long on one of my first attempts to add the summary totals to the “Covered” sheet. This is definitely a bug and should be fixed in an upcoming maintenance release. In the meantime, if a sheet starts going surprisingly slowly after adding a TOTAL() table calculation, try replacing the call of TOTAL() with WINDOW_SUM().
Window function speed
But having said that, the next performance issue is that the WINDOW_XXX() functions do not currently scale well. I first noticed this during the beta, and worked out that the time to execute the window functions increases proportional to the square of the number of rows in the partition being analyzed. This makes these functions effectively useless when analyzing large volumes of data, so I had largely avoided using them since I came across the issue, thinking I would wait until the issue is fixed.

But I have been a little surprised how long it has taken for this issue to float to the top of the priority list, so when I found that I really needed them to get the analysis going properly for Robert’s workbook I decided to have another look at the issue.

I work in system performance engineering (which basically means ideally avoiding but more often finding and fixing system performance issues). Having the response time increase with the square of the amount of work to do is a very familiar issue. In fact this example of “square-law performance” and the process of working out what was happening and finding a workaround made an ideal little case-study for the Performance Engineering section of my company blog: Square-Law Performance. Sorry Tableau folk – but it’s a great example and it’s only one little blemish on a great release.

The WINDOW_XXX() functions perform various aggregates over a “window” onto the rows in each partition of the data being analyzed. The window may be the whole partition or it may be a different subset of the rows for each row in the partition – for example a sliding window of <N> rows either side of the current row. In defining the partitions, one of the properties which can be specified is the sort order, so the partition needs to be sorted to evaluate the function. It only needs to be sorted once per partition, because the sort order is invariant, even if the window moves as each row is evaluated – but currently the window is re-sorted for each row, which is the source of the slow-down.

For small partitions, or if the window is defined as a narrow range around the row being processed, this extra sorting is utterly negligible. But in a lot of cases, such as the uses in this analysis, the required window is the whole partition. For the first version of this analysis, a few calls to WINDOW_SUM() over the full 8,000 rows of location data resulted in the sheet taking three minutes to refresh. For large partitions this can make the functions unusable. The analysis I was doing when I first discovered this involved a partition of 120,000 rows, which took three and a half hours to evaluate (though I only worked that out after initially assuming that Tableau was in a loop and killing it a few times).

Luckily, for cases where the required window is the whole partition, there are a couple of workarounds to avoid the problem. I say “a couple” because there are two common cases.

Result only needed for one row in partition

A very common case, which applies in some of the cases in this workbook, is that all that is required is to evaluate an aggregate once for the whole partition. The field [Population within Radius (blended)] on sheet “Covered (blended)” is an example of this sort. All I need is to evaluate the total population within the selected area, and I only need it once, to display in the sheet header.

The basic window calculation that I need is this:

WINDOW_SUM(SUM([Population]))

But that will be evaluated 8,000 times, including 8,000 expensive sorts of the whole partition.

At first glance, it appears that all that is needed to fix it is this:

IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]))
END

That just returns the result for the first row in the partition and NULL for all other rows, but unfortunately it turns out to take almost as long as the original. It took me quite a while to guess my way past this one, but eventually I worked out what must be happening. Tableau actually evaluates the expression in a couple of passes. First it gets the window ready. In this case, because I haven’t specified the optional parameters defining the range of the window, it defaults to the whole partition – which it duly sorts. Then it gets on with actually evaluating the expression, which means that for the first row it calculates the result and for all other rows it just ignores the nicely sorted rows.

Having guessed that that might be what it was doing, I tried another tack. For the first row, where we want the result, specify the entire partition. For all other rows, make the window just the single row being processed, like this:

IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]), 0, IIF(FIRST()==0, LAST(), 0))
END

Sure enough, that fixed it. My three and a half hour case came down to about five seconds, and the calculations in the “Covered (blended)” sheet are now too fast to worry about.

Result needed for all rows in partition

The other common case, which applies elsewhere in this workbook, is that the calculation is needed over the whole partition for all rows in the partition and the answer will be the same in each case.

In this case, rather than evaluating the result for every row, we can just evaluate the result for the first row in the partition and copy it for every other row, whilst using the same trick to minimize the cost of the sorts on the other rows which are not going to be used. That can be achieved like this:

PREVIOUS_VALUE(WINDOW_SUM([Covered_Population],
0, IIF(FIRST()==0, LAST(), 0)))

I have raised both the TOTAL() and the WINDOW_XXX() issues with Tableau and expect that both are likely to be resolved in the near future. In the meantime, the above tricks provide adequate workarounds for all the cases I’ve encountered so far.
Optimizing Data Extracts

One last speed tip: be sure to use the “Optimize” option “Data->Extract->Optimize” on the extract after adding table calculations. Tableau checks all changes in calculated fields and pre-calculates and stores any appropriate ones (which I think basically means most calculated dimensions as long as they don’t depend on parameters or data blending). It also deletes any old versions of calculated fields which are no longer current. In some circumstances “Optimize” can make a substantial performance improvement.

One thing to watch out for is that if you are using the technique I described in the second post of this series of making duplicate copies of a connection to a data extract, you need to ensure that all calculated fields exist in the connection you select for the “Optimize” or “Refresh” options, otherwise Tableau will remove fields which it thinks are unused. I just make a habit of defining all calculated fields in the primary data source and copying them as necessary to the other connections and then I only ever refresh or optimize the extract from the primary data source.

Summary

This one relatively simple workbook has provided an excellent learning exercise for me and a great way to illustrate many different aspects of the new version 6 features.

The resulting workbook now implements everything from the version 5 workbook, with Robert’s enhancements of the use of parameters to select the location of the center of interest and the radius, a roll-up hierarchy of locations and the ability to publish to Public. And in the case of the third tab, it is at least as quick as with version 5.

Some of the key messages I take out of this are:

There are lots of ways to do anything – the first way you try may not always be the best, so if you run into roadblocks or the resulting analysis is too slow, don’t give up, try coming at it from a different direction. Hopefully this article provides some useful clues about alternative approaches.
In general, I try to do as much as possible in the data source rather than in table calculations – this will generally give the best performance and avoid restrictions on subsequent analysis.
There are currently a few quite significant performance issues with table calculations, which can make the table calculation approach much less viable. Hopefully these will be resolved soon; in the meantime, the tuning tricks I have described get around the problems for many common situations.

Wish List

There is still room for a few more enhancements which offer the chance of making things easier or quicker or both. I have touched on several ideas that I have thought of above. The key ones are summarized below.

Parameter Setting Actions

The ability to set the value of one or more parameters by visual selection from the results displayed on a sheet would have numerous uses. I have long wanted this to allow visual selections of ranges from a view with continuous axes, simply for the speed of navigation. Working through the performance issues with Robert’s workbook has also highlighted the performance benefit of being able to pass values that have been looked up in one sheet directly into the query for the next sheet.

Lookup tables

Another feature which would offer an alternative approach to getting that speed benefit would be more direct support for lookup tables (to avoid the clunky delimited list embedded in parameters that I used here).

On Parameter Change Actions

One of Robert’s original issues (which also applied in version 5.2) was that he could not automatically update the map view when a parameter changed. The ability to trigger an action on change of a parameter would have lots of uses.

Other Data Blending Join options

Although I didn’t talk about this explicitly above, I will just briefly mention this one. There are often situations in which you want summary data from another data source, but for which a join based on equality does not work. Typically this might apply to a range of values. This would allow blending to be used to retrieve the total value across a date range, for example.

Why not?

One of the things I struggle with most with table calculations (and in fact calculated fields in general) is working out why something is not allowed. Some form of further feedback, rather than just disallowing illegal actions, could be really helpful

Visibility of order of execution

I suspect that the Tableau folk have gone to an awful lot of trouble to try to make what happens where and the order of evaluation of calculations transparent to the user. Unfortunately, there are still situations where it is not entirely transparent, and it really is necessary to understand these things: what is client-side and what is server-side, when do filters get applied relative to calculations, etc. I think it would be better to expose some of these things in the interface, rather than leaving people to puzzle over them and work out what must be happening by a combination of trial and error and guesswork. This ties in closely with the previous point.

Parting Comments

Working through the issues Robert described has been a really useful learning exercise for me and has provided lots of inspiration for my own analysis. It has also inspired me to finish off a viz I started to work on just after the Tableau conference showing earthquakes in New Zealand (New Zealand, which is where I live, suffered a 7.1 earthquake while I was at the conference and has just had another spate of aftershocks). I have borrowed some of Robert’s ideas and included the lookup table idea that made navigation of data for 100,000 earthquakes really quick: Explore your local Earthquakes.

What’s next?

Finally, just when I thought I had my head around everything in this analysis, Robert threw down a new challenge. He is working on another posting, repeating this analysis in Excel – and has built some very nice additional features into that version which he was sure would not be possible with Tableau. I was of course unable to resist the challenge. Although I did manage to get something going after a fashion, I had to use tricks which preclude publishing it to Public and it also takes many times as long to evaluate the analysis as it does in Excel. So it looks as if Excel is going to win this one hands-down (for now at any rate). I’ll describe that challenge, and a couple of possible extensions that would allow Tableau to compete, once Robert has posted his Excel version.

Robert’s note:

Richard, a big-time thank you very much for this fantastic guest post series. I always deluded myself into believing one of my (few) virtues would be stubbornness when it comes to solve complex problems. Compared to you, however, I am nothing else than a quitter. Kudos on your endurance, your deep knowledge of Tableau and your great writing style. I learned a ton from you. Many thanks!

I can hardly wait to see what you will come up with regarding the enhanced features I implemented in the Microsoft Excel version of the Site Catchment Analysis and how far you will push Tableau’s limits this time. Of course I would be more than happy to feature further guest posts from you here on Clearly and Simply.

The article describing the Microsoft Excel version of the Site Catchment Analysis will be next, as always including the Excel workbook for free download.

Stay tuned.

Comments

10 responses to “Another Look at Site Catchment Analysis with Tableau 6 (Part 3)”

January 10, 2011

Acotgreave

Richard – well done on a great series of posts. These are probably some of the most complicated Tableau-blog posts I’ve come across, but very enlightening. You’re knowledge of calculated fields and ability to analyse what’s going on is very impressive.
As you probably know, I’ve barely scratched the surface with advanced calculated fields so far. However, this series of posts has, as a minimum, given me some great tips to consider when getting into that area of functionality.
Andy

Reply
January 10, 2011

Joe Mako

This was a fantastic series, a must read for any advanced Tableau user. Additionally, if you have a calculation that cannot be performed easily or quickly in Tableau, preparing data for Tableau is also an option.

Reply
January 10, 2011

Richard Leeke

Thanks for the feedback Andy and Joe.
@Andy: complicated. Yes, I was very conscious of that – I hope I at least managed to fit the “Clearly” part of Robert’s site name, I don’t think I had a chance of fitting the “Simply”.
@Joe: I agree, judging when to use pre-calculation or data shaping as opposed to trying to do everything inside Tableau is very important.

Reply
January 11, 2011

Michael W Cristiai

@Richard,
Inspiring and very welcome in-depth tutorial and analysis. The whole way through, I felt like you were there in person presenting this material, and it is always a treat to listen to you and absorb some of your passion for Tableau and its scary smart developers. You, Andy, and Joe should be on that list as well.
@Robert and @Richard,
This particular subject has many applications, and it is quite generous of you both to share this with the Excel/Tableau community at-large. Can’t wait to download and delve into the particulars.
As always,
MANY BLESSINGS!
Peace and All Good!
Michael W Cristian

Reply
January 12, 2011

Richard Leeke

@Michael,
Glad you found it useful. You’re right that Robert’s technique for finding everything near a chosen location can have lots of uses. I adopted that in the earthquake viz I pointed to at the bottom of this posting – and extended the idea slightly to lookup the reference location from one data source and use that for restricting the rows of interest in a second data source. I did that with the parameter lookup table approach, but it would work fine with data blending, too, as long as the data volume isn’t too great.

Reply
February 20, 2011

FlyingBinary

@Richard,
Great series and also useful for calculating aggregates in blended data sources. I’m sure I’ll continue to find more uses for the techniques you present here.
Thanks!
Ian

Reply
March 7, 2011

WiredBarn

Great post!
I’m in the process of tuning a set of Tableau reports that use extracts, after reading your post I’m looking at the Total calculations.
On the wish list, I would suggest adding more controls for default values of Parameters. For instance when a date is used as a parameter, I would like to be able to use a calculations based on Today() as a default. When a parameter presents the user with a list of values, I would like the list to by generated based on the user’s security settings. Hopefully we’ll see improvements in this area in future releases.
Thanks,
Tor

Reply
December 14, 2012

Roderick Lisam

Incredible explanation and solution to only displaying the result of a table calculation once for the whole partition! I never expected to find an answer! Thank you!
I used your technique for a crosstab chart in which the dimension used to compute along for the table calculation was in the level of detail shelf. Without your technique, I got overlapping text of hundreds of the same value, and a strange fuzzy appearance.

Reply
February 5, 2015

Mark Andersen

Regarding the formula above:
IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]), 0, IIF(FIRST()==0, LAST(), 0))
END
Since the inner clause should not be evaluated if first()==0, I don’t see the point of the more complex inner clause. Wouldn’t this function the same way:
IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]), 0, LAST()))
END
Secondly, what this does in effect is show the relevant value once, and zero it out for the other sections of data. When the user hovers above the bar (or whatever measure area) they see “view data” for only one portion and it is misleading (not matching the bar). If they select the row, then view data is complete.
Is there a way to prevent the user from selecting the bar itself, since it means that cross-view filters will not work, and also that they get an incorrect view data? I want the user to select the row descriptor so these two functions work correctly.

Reply
February 7, 2015

Richard Leeke

Hi Mark
Quick answers to your two points.
The first thing I should say about your first point is that this is now a purely academic question. Tableau fixed the performance bug in Window functions that led to “square law” performance a couple of releases ago (8.0 if I remember rightly). Essentially they implemented the same optimisation that I had done – though as they do that internally it is now quicker than my version was if you just leave out that convoluted inner clause.
But it is worth having another read of the relevant section of the article to understand why my optimisation worked, because it lets you understand a significant aspect of the way that the Tableau calculation engine works: namely that it doesn’t always actually evaluate expressions in the order you would assume from a simplistic interpretation of the syntax. Here’s the relevant section of the article:
At first glance, it appears that all that is needed to fix it is this:
IF (FIRST()==0) THEN
WINDOW_SUM(SUM([Population]))
END
That just returns the result for the first row in the partition and NULL for all other rows, but unfortunately it turns out to take almost as long as the original. It took me quite a while to guess my way past this one, but eventually I worked out what must be happening. Tableau actually evaluates the expression in a couple of passes. First it gets the window ready. In this case, because I haven’t specified the optional parameters defining the range of the window, it defaults to the whole partition – which it duly sorts. Then it gets on with actually evaluating the expression, which means that for the first row it calculates the result and for all other rows it just ignores the nicely sorted rows.
There’s absolutely no doubt that it used to work like that, by the way. That optimisation brought calculations down from hours to seconds on some of my larger workbooks. I also talked it through in detail with the Tableau devs concerned and understand why it was like that and what they did to fix it.
On your second point, you are quite right that there are circumstances in which the technique of only returning a value for one row in a partition can be slightly misleading. That is one of the trade-offs when using table calculations to emulate aggregate functions. Without digging into the detail of what you are commenting on, the good news is that version 9.0 will bring a whole new class of calculations to Tableau called Level of Detail calculations. They provide a much cleaner and more powerful way of doing all sorts of calculations that previously have required people to bend table calculations beyond what they were really intended for. My gut feel is that LOD calculations will solve that problem for you.
Richard

Reply