How to use R, a Tableau connection to R and Calculated Fields to store and load the underlying data of a defined Tableau View in a Database or Text File (part 1 of 2)
My good friend Michael Martin is kind enough to contribute another guest post to Clearly and Simply.
Michael is the Managing Partner at Business Information Arts, Inc, Tableau Partner, Tableau Certified Consultant, Tableau Instructor, leader of the Toronto Tableau User Group and – as Interworks calls him for very good reason – a Tableau Maestro.
In his article, Michael will show us how to store the underlying data of Tableau Views in a database or a text file using R and Tableau Calculated Fields. Michael's post will come in 2 installments.
NOTE: It may be necessary to zoom in on some of the graphics within the body of the text below for better legibility. If you are running 32 bit Microsoft Access under 64 bit Windows, and wish to output the data in a Tableau view to 32 bit Microsoft Access, please read the Appendix to this post before proceeding. Supporting files and a README file are available for download at the end of the second part of the post. Please look at the README file first.
The author would like to thank Leonid Koyfman for his suggestions.
Here is part one:
In the Bar Graph View below (based on the Superstore dataset that comes with Tableau), yearly sales are summed and colored by profit and sized by discount:
Reference lines show average sales within each region and customer segment by year for all categories.
How can we “capture” the data in this view and save it in a database for further analysis and re-use?
Tableau does let you view underlying data and copy it to the clipboard, export it to a text file, or even a Microsoft Access database, but these are manual operations. What if you could write a Tableau Calculated Field to export the underlying data in a Tableau View to any Database Server that supports the ODBC protocol?
You can, using R, and an R language library called RODBC.
Tableau opened the door to using R language libraries within Calculated Fields in version 8.1. This article walks you through downloading and installing R on a desktop computer, configuring Tableau to use your R installation, and writing the Calculated Fields required to export data from a Tableau view to a database such as SQL Server or Microsoft Access.
Download and Install R
The first step is to download the R platform from:
https://www.r-project.org/about.html
or there are mirror download sites at:
https://cran.r-project.org/mirrors.html
As of this writing, the most current version of R is version 3.2.5. The examples in this article should work with versions 3.0 and higher. The computer this writer uses runs R 3.2.4.
Download the R distribution, which should contain a 64 and 32 bit implementation of R. If you are running 64 bit Windows, you can install the 32 and 64 bit versions of R.
Note: If you are running 64 bit Windows and 32 bit Microsoft Office and want to write data to an MS Access database, make sure to install the 64 and 32 bit R implementations. If you are running 32 bit Windows, your only option is to download the 32 bit R distribution.
In your Windows Program Groups (assuming 64 bit Windows), you’ll see:
which confirms that both 32 and 64 bit R were installed. R comes with a simple editor called Rgui, that you can use to write R scripts. As you can see, this writer is using R version 3.2.4.
I suggest you head over to www.rstudio.com and download an alternative development environment named R Studio.
It’s a free download, easy to install, and easy to use.
Download, Install and Load R Packages Rserve and RODBC
You can extend the functionality of your base R installation by downloading and installing packages. R has a very dedicated and productive developer community, and most packages from this community are well documented. To run the examples in this post, you will need to install two R packages that are not included in your base R installation: Rserve and RODBC.
Start up RStudio and invoke the Install Packages… command from the Tools menu.
An Install Packages dialog will appear. Enter Rserve and RODBC (separated by a comma) in the packages field as shown below and click on Install.
R Studio will download and install these two packages within your R installation. You should then see something that looks like:
By the way, both of these packages should be available to the Rgui Editor, should you choose to use it.
The next step is to load the RODB and Rserve packages. Within RStudio, by default on the right hand side of the screen, you will see a list of available packages:
Locate the RODBC and Rserve packages and tick the box to the left of each of the package names. RStudio loads the RODBC and Rserve packages. The R Scripts you will write as you work through this post also load RODBC, but pre-loading RODBC in RStudio ensures this package will be available when referenced from Tableau.
As Rserve is a binary .exe file, we also need to start it from within RStudio so that Tableau can access R packages within Calculated Fields.
Type Rserve() and press ENTER at the > prompt as shown above.
Configure Tableau to use your R Installation
To use Tableau (through version 9.2) with R, select Help -> Settings and Performance - > Manage R Connection as shown below
and enter localhost (or 127.0.0.1) in the Server field and 6311 as the Port.
In Tableau 9.3 and above, Select Help -> Settings and performance -> Manage External Service connection as shown below.
Enter 127.0.0.1 or localhost in the Server field and 6311 as the Port.
Then click on the Test Connection Button, and you should then see:
You are now ready to use R with Tableau.
NOTE: The supplied tutorial Tableau Workbook (a download link is at the end of part 2 of this post) was made in Tableau 9.3, and cannot be opened with earlier versions of Tableau Desktop. You can however use earlier versions of Tableau (starting with version 8.1) to re-create the examples shown in parts 1 and 2 of this post.
Set Up your Database and an ODBC Data Source
You then need to export the Excel Superstore Orders data that comes with Tableau to a database. The examples in this article use a SQL Server Database, but you could export the data to Microsoft Access, Oracle, or any database server that supports ODBC.
NOTE: Microsoft SQL Server and Microsoft Access versions of the Superstore data set provided in Excel by Tableau that the author used to create the tutorial workbook are provided for download to use when re-creating the examples discussed in both parts of this post. Download links for the data and other resources are at the end of the second part of this post.
If your version of Access is 32 bit, but you’re running 64 bit Windows, see the appendix at the end of this article, which describes how to configure a 32 bit ODBC connection under 64 bit Windows before proceeding. Make sure that you export the data in the Orders sheet in Excel to your database. This should result in there being a table named Orders in the database.
The next step is to configure an ODBC Data Source (in Windows 7 or 8 via Control Panel - > Administrative Tools -> ODBC (in OS X, use ODBC Manager). To configure an ODBC data source in Windows 10 check out this link which explains how to run the ODBC Connection Manager within Windows 10:
5 Ways to Open ODBC Data Source Administrator in Windows 10
Again, if you are running 32 bit Access under 64 bit Windows, please read the appendix at the end of this post before proceeding.
Below are screens for creating an ODBC Data Source (Windows) for Data in SQL Server for a database named Tableau_Superstore_Orders. This database contains the Superstore Sales Data (in Excel) that ships with Tableau.
Open the ODBC Administrator Tool, select the System DSN Tab and then Add to add a new Data Source as shown below:
For SQL Server, select either the 10.0 (SQL Server 2008) or 11.0 (SQL Server 2012) driver as shown below, and then click on Finish.
Another dialog will then appear:
Give the ODBC Data Source a name and indicate the SQL Server Instance name, and then click on Next.
Another dialog appears:
Indicate how the login to the database will be authenticated. By default, Integrated Windows authentication has been selected. If you administrate the machine SQL Server is installed on, this option should work. Otherwise, enter credentials for a SQL Server user who is authorized to log into the database. Then click on Next.
Change the default database to Tableau_Superstore_Orders in the next Dialog that appears as shown below, and then click on Next.
Another dialog will open. Accept the default entries and click on Finish.
A dialog appears that lets you test the ODBC connection you just configured. If there are no issues, you will see a message like what you see below:
Closing the dialog boxes above will bring you back to the main screen of the ODBC Administrator. Click on OK to close the ODBC Administrator.
Create a new Tableau Workbook and Connect to the Database
We can open Tableau. The author used Tableau 9.3 when writing this post. Create a new file and make a connection to the Tableau Superstore data you exported to a database. The author exported the data to SQL Server. The Excel Superstore sales data is in a sheet named Orders, so there should be a table named Orders in the database you exported the Excel data to. If you download and open the tutorial Tableau workbook mentioned earlier (in Tableau 9.3 format), you could open that workbook as well, and use it for reference as you build your new workbook based on the content of this post.
Create Calculated Fields in Tableau
After connecting to the Superstore Data, define a new Calculated Field named OrdersSQLString.
The calculation is:
'SELECT * FROM Orders'
Note the single quotation marks surrounding the text of the SELECT statement.
Now define another calculation that connects to the Superstore Data in SQL Server and writes out the results from a SELECT query to a text file. The calculation will be named RunOrdersQuery.
SCRIPT_STR("
library(RODBC)
crs$odbc <- odbcConnect('Tableau_Superstore_Orders', believeNRows=TRUE)
crs$odbc.tables <- sqlTables(crs$odbc)$TABLE_NAME
crs$dataset <- sqlQuery(crs$odbc,.arg1)
write.table(crs$dataset, file = 'C:\\Data\\Clearly_And_Simply\\RODBC_Article\\OrdersQuery.csv', row.names = FALSE, quote = FALSE, sep = '|')
close(crs$odbc)
'Query Results from refreshed Data Connection'"
,[OrdersSQLString]
)
NOTE: It has been reported that including the '$' character within the name of an RODBC object name (for example crs$odbc, crs$odbc.tables, crs$dataset) in a calculation could cause a runtime error on some computer systems and display 'Object not found' error. Leonid Koyfman suggests replacing the '$' character with an underscore ( _ ) in the affected calculation.
Also note that the calculation above contains a reference to a hard coded path to a file on disk. You can alter this path in the calculation to a location of your choice on your system. You will need to create the required directories and sub-directories (i.e. folders) on the path to the destination file on your system before running the calculation. The RunOrdersQuery calculation will not create them.
Calls to R functionality within Calculated Fields have to follow certain conventions. The calculation must start with the word SCRIPT_ and then specify a return value, in this case a String (STR), followed by an open parenthesis and a quotation mark. The string return value of the calculation is the informational message “Query Results from refreshed Data Connection”.
The calculation then proceeds to connect to the Tableau_Superstore_Orders ODBC data source you created and runs a query of the value contained in the incoming parameter .arg1.
The write.table statement within the calculation writes out the results of the SQL command to a text file to file name indicated in the command. You can change the file path and name as desired. Note that you need two backslashes to indicate drive and directory names when constructing the file path. Note the | (Pipe) delimiter which prevents scrambled data in the output text file if any data returned by the query contains commas in a data field.
The calculation then closes the ODBC connection and displays the informational message, which - as mentioned earlier - is the string return value for the calculation. Immediately following the informational message is a closing quotation mark that offsets the quotation mark that follows the open parenthesis in the first line of the calculation.
The last line of the calculation contains a comma, and then the value of .arg1 – the value of the OrdersSQLString calculation you defined earlier.
The Calculated Field ends with a closed parenthesis.
Disable Auto-Updating
For our Calculated Fields to work as expected, we need to disable Auto-Updating of worksheets.
Why do you have to update the view (and other views with similar queries) manually?
We do the manual update to control when the Table Calculation that executes an SQL command actually runs. This is similar to re-calculating an Excel worksheet manually. If we left auto updates on, every change made to the view (adding or removing fields, formatting, filtering, etc.) would cause any calculation running SQL commands to run, which could cause all sorts of problems in your database. In this example, the text file would simply be re-written to disk. Turning auto update off solves this problem.
From the Worksheet menu, select Auto updates, and untick Auto update worksheet.
Create a Tableau View
After you have written these two Calculated Fields, set up a Tableau view that looks like:
Note that Tableau considers all calculations that use R to be Table Calculations.
Now run a manual update of the worksheet by clicking on the Run update icon or tab the F9 function key.
Running the update executes the RunOrdersQuery Calculation. The view should then look like:
We’re seeing the sum of sales from our connection to SQL Server, which is no surprise. We see the values of the Orders SQL String Calculation and the return value of the RunOrdersQuery Calculation above the bar.
Create a New Data Connection
We now want to configure a new data connection to the text file that the RunOrdersQuery calculation should have created.
Create a new sheet, configure a Text File connection, and point to the folder specified in the RunOrdersQuery calculation.
After making the connection, and back on the Tableau worksheet, set up a view that looks like:
If you see what appears above, the “proof of concept” we have worked through worked expected. You’ve connected to a database using ODBC within a Calculated Field, and re-directed the output from a SQL Query to a table in that database to a text file.
If you navigate to the folder on your system where the text file is, and open the text file in an editor, it will look like:
Parameters and Calculated Fields for User defined Conditions
We’ll now define another “proof of concept” calculation that includes a WHERE clause. Start by creating two parameters: Category Condition and Regional Condition.
If you haven’t used parameters in Tableau before, check out:
Tableau Knowledge Base: Create Parameters
Create a new sheet and place the two parameters on the sheet as shown below:
Create a Calculated Field called Category WHERE
This calculations returns the value of the currently selected category in the Category Condition Parameter
Create another Calculated Field named Regional WHERE
This Calculation returns the value of the currently selected region in the Regional Condition Parameter.
Next: a Calculated Field named Full SQL Statement
We can now write another calculation named RunFilteredOrdersQuery, which will take the calculation Full SQL Statement as an argument.
SCRIPT_STR("
library(RODBC)
crs$odbc <- odbcConnect('Tableau_Superstore_Orders', believeNRows=TRUE)
crs$odbc.tables <- sqlTables(crs$odbc)$TABLE_NAME
crs$dataset <- sqlQuery(crs$odbc,.arg1)
write.table(crs$dataset,file=
'C:\\Data\\Clearly_And_Simply\\RODBC_Article\\FilteredOrdersQuery.csv', row.names = FALSE, quote = FALSE, sep = '|')
close(crs$odbc)
'Query Results from refreshed Data Connection'"
,[Full SQL Statement]
)
This calculation works just like the RunOrdersQuery Calculation. It establishes an ODBC connection to the SQL Server Databases, and runs the SQL Query contained in the Full SQL Statement Calculation. Remember to specify the correct path to the output file in your version of the calculation.
The Full SQL Statement Calculation uses parameters to gather input for a WHERE clause. The parameters are currently set to the Office Supplies Category in the South.
Additional Tableau Worksheets
In a new sheet, with a connection to the Orders table in your back end database, disable automatic updates for this sheet as shown previously, and set up a view that looks like:
Now run the RunFilteredOrdersQuery Calculation by manually updating the worksheet.
You’ll then see:
Again, the sales numbers are coming from SQL Server.
Create a new sheet and make a Data Connection to the text file that the RunFilteredOrdersQuery calculation should have created.
When we go back to the Tableau worksheet, we see that sales are constrained to the Office Supply Category in the south
If you were to go back to sheet 3, change the parameter values, manually update the worksheet (which re-runs the RunFilteredOrdersQuery Calculation), then come back to sheet 4 and refresh the FilteredOrdersQuery Data Source, the Bar Graph in sheet 4 will reflect the new filtering.
Appendix: 64 bit Windows and 32 bit Microsoft Access
If you are running 64 bit Windows and have 32 bit Microsoft Access, this appendix describes the steps to follow to configure your environment to run 32 bit R and create a 32 bit ODBC Data Source.
Go to https://cran.r-project.org/bin/windows/base/.
Download the R platform installer that contains 32 and 64 Bit versions of R. When running the installer you downloaded, 32 and 64 bit R will be installed by default – multiple versions of R can be installed on a single PC. If you have not already done so, download and install RStudio as described earlier in this post. If you are running 32 bit Windows, you are also running 32 bit Tableau, and only the 32 bit version of R will be installed on your system.
Next: create a 32 Bit ODBC Data Source using the 32 bit version of the Windows ODBC Connection Manager.
Using Windows Explorer, navigate to C:\Windows\SysWOW64\odbcad32.exe (under Windows 7 and Windows 8). If you are using Windows 10, check out
5 Ways to Open ODBC Data Source Administrator in Windows 10
Double click on odbcad32.exe to load it. Select one of the available Microsoft Access database drivers to configure your Data Source. Create an ODBC data source as described earlier in this post (a User or System Datasource). Creating the data source includes a step where you navigate to your MS Access database file with Windows Explorer and select it with your mouse.
Next: Open R Studio – tell RStudio that you want to use the 32 Bit version of R. Select Global Options from the Tools menu as shown below and select the 32 Bit R Installation as your default R implementation to use.
Close and restart RStudio to save your changes.
Next: Install the Rserve and RODBC R Packages within your 32 Bit R Installation. From the Tools menu, select Install Packages and type in Rserve, a comma and then RODBC in the Packages field. Tick the “Install Dependencies” option - then click on Install.
As described earlier in this post, configure Tableau to use your 32 bit R installation.
Load the RODBC package and then load and start Rserve as described earlier in this post.
Use the 32 bit MS Access ODBC Data Source name you defined in your Wrapper Calculations that run SQL statements.
If you are running 64 bit Windows, and want to run 64 bit R later, follow the instructions you followed to use 32 bit R in reverse. Select Global Options from the Tools menu as shown below and select the 64 Bit R installation as your default R implementation to use. Close and restart RStudio to save the change.
To recap:
- Download an R platform installer that contains 32 and 64 bit R versions
- Configure your RStudio installation to use 32 Bit R
- Within RStudio, install the Rserve and RODBC Packages
- From the RStudio Console, load the RODBC package and start Rserve
- Create a 32 bit ODBC Data source using C:\Windows\SysWOW64\odbcad32.exe
- Configure your Tableau to R connection as described earlier in this post.
- Reference the name of the 32 bit ODBC data source you created in your Calculated Fields that reference the RODBC package and execute SQL statements.
Remember that SQL statements to be executed must be written in the dialect of SQL that your back end database expects, in this case Microsoft Access.
[End of Appendix]
Part 2 – Coming soon
So far, we have only run SELECT queries. The second part of the article will show how to run INSERT, DELETE, UPDATE or INSERT INTO (append data) queries using the same approach.
Stay tuned.