Techniques to Get Past the Limitations – Part 2: Data Blending and Parameters for Lookups
This is the second part of a guest post series by Richard Leeke from Equinox Ltd.
The first part of this series described one approach to overcoming the restrictions in Robert’s Site Catchment Analysis with Tableau 6, in this case using table calculations. If you look at the roadblocks Robert has described in his approach, table calculations are probably the first idea that comes to your mind if you are trying to overcome the issues. But whilst that approach achieves the analytical outcome Robert was seeking, it does not get the workbook back to the performance Robert had achieved with version 5.2.
However, there are other options, too: you could either use data blending or even a nifty little trick to let parameters do the required lookups. Each of those techniques makes the workbook quicker, as I explain below. Each of the three approaches comes with its own pros and cons, but all of them are well worth having a closer look.
Today’s article describes those 2 alternative approaches and again provides the Tableau workbook on Tableau Public and for free download.
Using Data Blending for Lookups
The next approach I used was to completely avoid table calculations for finding the latitude and longitude of the center, by using Data Blending. This is shown in tab “Dashboard (data blending)” in the workbook at the end of this article.
In order to use data blending to lookup the coordinates of the center, I needed to join the primary data source back to itself. That isn’t directly supported by data blending, but a useful trick is to create a second copy of the original data source, and join to that. The second copy of the data source just references the original data extract, so this does not create a duplicate of the actual data, it just allows a second row (in this case the center) to be referenced from every row in the primary data source and thus combined in the results.
These are the steps I used to achieve this.
- Create a calculated field [Center (blending)] in the primary data source, set to a copy of the [Center] parameter, as follows:
[Center (blending)] = [Center]
- Create a copy of the original data source – I called it “locations germany (center)”
- Go to Data->Relationships... and define the join condition between the original data source and the new copy. We want each row of the primary data source to join to the single center row of the secondary. So remove all of the automatically generated relationships and add a single relationship as follows:
(Primary): [Center (blending)] = (Secondary): PLZ5_str
- I then created a new copy of the calculated field [Distance], referring to the latitude and longitude values from the secondary data source (i.e. the coordinates of the center). See [Distance (blended)].
- Finally I created “(blended)” copies of all other calculated fields which depend on distance and new copies of the worksheets, replacing all references to calculated fields with the new blended versions.
In fact I changed one more thing. The calculation of the population as a percentage of the total population of Germany ([Population % within Radius (blended)]) requires the total population of Germany. This could have used the same calculation as in the previous version, but as I was using data blending this time, I chose to calculate the total population of Germany using blending, too. I achieved that like this:
- Create another copy of the original data source – I called it “locations germany (all)”
- Adjust the join condition for this secondary data source so that each row of the primary references all rows of the secondary, as follows:
(Primary): [Center (blending)] = (Secondary): [Center (blending)]
- Define the calculation for [Population % within Radius (blended)] as follows:
[Population within Radius (blended)] / SUM([locations germany (all)].[Population])
This approach turns out to be quite a lot faster than the original, but it still requires all rows in the primary data source to be retrieved before using data blending to lookup the other required values and then evaluating calculated fields and filter conditions in Tableau (although in this case I think I’ve spotted that the filters are applied before table calculations). This is perfectly fast enough with these data volumes, but will again become less manageable as data volumes increase.
Using a Parameter as a Lookup Table
What I really wanted to achieve was some way of allowing all of the calculation and filtering to happen in the data engine. After some head-scratching, I came up with a slightly ugly trick to allow the latitude and longitude of the center to be looked up directly by embedding the values in the list of valid values for the center parameter. The results of this lookup were thus available to the original queries used to populate each of the views in the dashboard, which means that all of the calculation was able to be done in the data engine. This is the fastest approach and will scale much better to large data sets. Interestingly, the speed difference is even more noticeable with the workbook published on Public than it is when displaying it locally in Desktop. This approach is in sheet “Dashboard (param)” in the workbook at the end of this article.
Another way to achieve the same effect of passing the center coordinates to the original query would be to allow the user to look up the required latitude and longitude via a separate view and then copy the values into parameters to pass to the main views. This would be a very clean solution if it was possible to populate parameters for latitude and longitude automatically from the results of a lookup on another sheet. This is not possible in Tableau at the moment, but a “parameter setting action” would be a really useful visual navigation feature with lots of uses. I have been lobbying for this for a while – hopefully it will float to the top of the priority list soon; it would really unleash a lot of the new version 6 capabilities in my view.
So for now, the ways I can see to make the center latitude and longitude available to the original query are with the ugly parameter trick I will explain below, or for the user to manually copy the values returned from a lookup into a pair of parameters.
Both of these approaches achieve the same result in terms of the remainder of the analysis. As the latitude and longitude are available to the query that is sent to the data engine, various calculations, and most significantly the filtering, is able to be performed at data engine speed, with only the required result set returned to Tableau. This only makes a few seconds difference in the case of Robert’s 8,000 locations, but with 100,000 rows the difference is huge and with even bigger data sets the client-side approach would eventually not work at all. So this approach is the only way I can see of achieving this sort of analysis on really large data sets.
Embedding a lookup table in a parameter’s validation list or copying results of a lookup to parameters may seem clunky, but we have to remember that with version 5.2 there were no parameters, so the 5.2 version of the workbook had to hard-code the location of the center in a calculated field. The 5.2 version also couldn’t be published to Pubic, since that has never supported RAWSQL.
To embed the latitude and longitude in the [Center] parameter I made the valid values of the parameter a string containing the postcode, latitude and longitude values, delimited with a suitable character, and set the “Display As“ value to display only the actual postcode. I just built the lookup table in Excel with a couple of expressions to pad and concatenate the appropriate fields, and then copied the two columns via the clipboard with the “Paste From Clipboard” option in the Edit Parameter control, as shown below. (I initially tried building the lookup table in another Tableau sheet, but as there is no way to format real numbers when converting to string, the latitude and longitude values came out with very long strings of trailing 9s, so it was tidier to resort to Excel.)
I then defined three calculated fields to pick out the postcode, latitude and longitude from the chosen parameter.
[Center (param)] = LEFT([Center_Lookup], FIND([Center_Lookup], "|") - 1)
[Center_Latitude] = FLOAT(MID([Center_Lookup], FIND([Center_Lookup], "|") + 1,
FIND([Center_Lookup], "|", FIND([Center_Lookup], "|") + 1) - 1))
[Center_Longitude] = FLOAT(MID([Center_Lookup], FIND([Center_Lookup], "|",
FIND([Center_Lookup], "|") + 1) + 1))
Having selected a center postcode, the values of latitude and longitude for that postcode are immediately available in other calculated fields, which means they can be passed as part of the original query to the database – in this case the data engine. This allows the distance calculation and the filter limiting the analysis to locations within the selected radius to occur in the data engine, which means that only a small number of rows need to be returned.
I was then able to redefine all of the other calculated fields in a way that would be evaluated in the data engine, with the exception of the percentage of Germany, which required the use of a single data blending lookup to find the total population of Germany. The resulting view is significantly faster than the other versions.
Displaying Details at Different Levels of Aggregation
The original version 5.2 workbook included summary views of population at various levels of aggregation. In version 6, Robert replaced these with the use of the new version 6 hierarchy feature, allowing roll-up and down the hierarchy within the one “Covered” sheet. This works well and with a couple of slight tweaks can give all of the functionality of the version 5.2 workbook in a much more compact format.
The “slight tweaks” were that I adjusted the order of fields in the location hierarchy to match the fields on the rows shelf of the “Covered” sheet. This means that roll-up and down of the hierarchy works better – now if you roll-up and then down again it returns to the original configuration. Part of the issue here is that the relationship between the location fields is not a true hierarchy.
I also replaced the mark labels on the bars with reference lines. This allows the totals to be displayed correctly as the location hierarchy is expanded and contracted.
Hover the mouse over the top row of the location hierarchy table and little “+” and “-“ widgets will appear, allowing the hierarchy to be expanded and contracted.
The Workbook on Tableau Public
In case you haven’t read the first part of the series, here is the workbook on Tableau Public again. Please refer to the tabs “Dashboard (data blending)” and “Dashboard (param)” to see the outcome of the techniques described above.
The third and final part of my guest series here will include some tips to speed up the performance of the workbook, together with a summary of a few ideas for possible future enhancements to make the things I’ve discussed in this series even easier.
Stay tuned.