Approach, algorithm, VBA code and performance optimization of the Word Cloud with Excel implementation
The previous post Word Clouds with Microsoft Excel introduced another workaround to create a word or tag cloud in a standalone Microsoft Excel workbook.
The article briefly describes the approach, the user-defined settings, the result and the drawbacks of the workaround. It also provides 3 different versions of the workbook for free download.
Today’s follow-up article will have a closer look into the machine room: some more details on the approach, the data structure, the algorithm, the implementation in VBA and a dirty little trick to optimize the performance of the code for inserting and formatting freeform shapes on a worksheet using VBA.
As a little bonus, today’s post also includes a tiny Excel tool to easily split continuous texts into a list of words and their count, i.e. the data structure needed for the word cloud workaround. Of course the tool is provided for free download.
As already mentioned in the previous post, there are other Excel word cloud workarounds available already, either based on the cell grid (one big cell or a cell range) or on linked images. Those workarounds have their limitations, though, especially regarding flexibility and suitability for larger data sets (i.e. many words).
Another option are freeform shapes, They are independent from the underlying cell grid, can be relocated, resized and formatted as needed and – as far as I know - there is no limitation like a maximum number of shapes on a worksheet. That’s why using freeform shapes seemed to be the most flexible approach to create a word cloud.
The idea is simple: a VBA procedure loops through the entire data word by word, inserts a freeform textbox for each word, adds the word, resizes the font size according to the frequency of the word and relocates the textbox inside a given frame (a predefined rectangle shape).
The challenges of this approach were the following
- Implement an efficient algorithm to find the correct position of a textbox in the cloud and make sure there are no overlaps with other textboxes
- Optimize performance in order to make the implementation usable on interactive Excel dashboards in production
Before looking at the algorithm, the VBA code and the performance optimization, let’s have a look at the required data structure first.
The Data
The word cloud implementation in Microsoft Excel provided in the previous post expects the source data to be a two column table: a list of unique words in the first column and their counts (frequencies) in the second column.
If you have your text in a one-column list with all the words (a non unique entry list), you can easily create the required data structure with a standard Pivot Table: just drag the words to the rows and the count of the words to the values area of the Pivot Table.
However, if you only have a continuous text, e.g. a Word or PDF document or a text snippet copied from a website, you have to split the text into single words first. I created a little tool doing this with a few mouse clicks:
Download Split Text Tool (Microsoft Excel 2007-2013, 115K)
How to use the tool:
- Copy the text into a simple text file (.txt)
- Open the tool, go to the worksheet “Words” and click on the command button at top right
- Browse to the text file in the following Open File dialogue window, click on “Open” and wait until the VBA code is finished
- Go to the worksheet “Unique Words and Count”. Copy the data from this Pivot Table and paste it into the Word Cloud tool provided in the previous post
I tested this tool with a few examples and it seems to work well, even for very long texts, e.g. the King James Bible with more than 780,000 words. It took a while, but it worked. However, please be advised that the tool doesn’t include any logic or intelligence. For instance, it doesn’t split “don’t” into “do” and “not” or “wanna” into “want” and “to”, etc.
Anyway, it was sufficient for my purposes and I thought some of you might find it useful, too.
The Algorithm
The basic approach of the algorithm is starting with the most frequent word and locating it at the center of the cloud frame. Then, it loops through the entire list of words sorted descending by word count and positions the word as close to the center of the frame as possible. This way all words will be arranged concentrically around the center. The higher the frequency of the word (i.e. the larger the textbox), the closer it will be to the center.
Here are the main steps of the algorithm:
Step 1: Delete the existing word cloud
Step 2: Calculate the distances of each pixel in the cloud frame from the frame center and sort the resulting array ascending by distance
Step 3: Get the source data (word list and word counts) from the worksheet and sort it descending by count
Step 4: Calculate the decrement of the font size for one word count based on the defined minimum and maximum font size
Step 5: Loop through all words in the list (sorted descending, see step 3) and
- add the shape
- rotate the shape, if required
- calculate the font size of this word (according to the decrement, see step 4)
- add the text and change the font size
- get the size of the current shape
- find the next possible position of the shape as close to the center as possible (using the distance matrix, see step 2)
- if a valid position is found, reposition the shape, else delete the shape
- if required, add a tooltip
- update the allocation matrix storing the pixels in the cloud frame which are already taken by a texbox
Step 6: If required, group all shapes and the cloud frame
The Implementation in VBA
The VBA implementation is pretty lean, only 2 modules with 428 lines of code in total:
- The module modSort is a generic helper module, containing the Quick Sort function provided by Chip Person here: Chip Pearson's Quick Sort. It is part of the project, but it can be used in any other project, too
- The module modCloud contains all functions and subs to create the word cloud
I commented the entire code as good as possible, including a description of each constant, variable and code section. Hence, I will not explain every step of the code here. If you are interested in understanding how the implementation works, please have a look at the code and the comments. If you have a question, please feel free to leave me a message in the comment section below.
Here are just some basics to give you a first overview:
- The function WordCloud is the heart of the solution, i.e. this function creates the word cloud. Except for rngWordList (the range with the data) all parameters of this function are optional, i.e. you can, but do not have to provide them. The function returns the number of words in the word cloud
- The subs DetectNextWordPosition and UpdateAllocationMatrix and the function ShapeFits are helper routines for the algorithm and called inside of the function WordCloud
- The sub CreateWordCloud simply calls the function WordCloud and passes through the parameters defined on the worksheet, like the minimum and maximum font size, etc.
- If you do not specify the optional parameters, the code will either use the defaults defined in the parameter declaration (e.g. screen updating is false) or the constants defined at the top of this function (e.g. font type is Calibri). If you want to set other defaults, simply change the parameter declaration or the constants
- Please also note the global variables at the top of the module, like the allocation matrix (to store the positions already taken by a word) or the distance matrix (all distances between the coordinates and the center)
Performance Optimization – General Remarks
If you need the word cloud as a one-off solution only, e.g. to paste it into a PowerPoint presentation or Word document, performance isn’t really important. However, if you want to use the word cloud on an interactive Excel dashboard in production, the performance of the algorithm is crucial for the user experience and acceptance.
Many people think VBA would be slow per se, but this is not the case. Agreed, VBA is slower than a compiled programming language like C# or C++, but you can still produce amazingly fast algorithms with VBA.
The most performance crucial parts of VBA applications in Microsoft Office are interactions between VBA and the application itself. The fewer times the code has to go through the “barrier” between VBA and e.g. Excel, the faster the code will run. In most use cases, those interactions are reading data from and writing data back to a cell range on a worksheet. The best approach for those usual suspects is reading the data from a range into a VBA array, performing all calculations and manipulations inside VBA and finally writing the results from the VBA array back to a range on the worksheet in one go.
Creating a word cloud with freeform shapes isn’t one of those usual suspects, though. In order to position the textbox correctly in the cloud, you have to know its size. And here is the challenge: you only know the size after you inserted the textbox, assigned the word and set the font size according to the word count in the data. To put it another way, you have to go through the VBA-Excel-barrier word by word and this leads to a performance issue, at least if you more than only a few words.
Performance Optimization – The Trick
Truth be told, there is not much you can do about going through the barrier, inserting the text and changing the font size word by word.
However, there are a lot of other format settings which are not word specific, like line and fill visibility, margins, text anchors, word wrap, auto size, font name and font color. All these settings are the same for all textboxes in the cloud. Doing all this formatting shape by shape would considerably increase the number of times the code has to go through the barrier and thereby definitely hit the overall performance.
That’s why I used a dirty little trick to minimize the formatting statements: the idea is to work with the default shape properties. An Excel workbook has a default for all freeform shapes. Every newly inserted shape will automatically get the default settings. You can change this default manually (right click on a shape and select “Set As Default…”) and - of course - you can also do this by VBA.
The trick is to temporarily set the default properties of a textbox shape to all the formatting which are not word specific (see above). This way, every textbox inserted by VBA will already have the desired formatting, except for the font size.
Here is the step-by-step of this trick:
- At the very beginning, the code inserts a temporary dummy shape to store the existing defaults
- Then it inserts another dummy shape and formats it with all the settings which are the same for all words in the cloud
- Next step is making this second dummy shape the default textbox using the .SetShapesDefaultProperties method
- Afterwards the algorithm runs and every textbox added by the code will already have the desired formatting
- Finally, after the algorithm is finished, the code sets back the default shape to the settings of the temporary shape created in step 1 and deletes the temporary shapes
It may sound a bit clunky, but it definitely pays off in terms of performance. I did only a few comparisons, but usually it cuts down runtime by 50%. A word cloud with 500 words needs ca. 2 seconds with this trick, but more than 4 seconds if you do all the formatting word by word. This definitely justifies the extra coding, I think.
By the way, I already used this dirty little trick in the Marimekko Charts I published last year: Marimekko Charts in Microsoft Excel and Pivot-like Marimekko Charts in Excel.
What’s Next?
I already mentioned in the previous post that Word Clouds – although they definitely have their place in the world of infographics - are very limited for serious data analyses and visualizations.
It may sound self-contradictory, but I want to stay with Word Clouds for one more post. The next article will describe how to create Word Clouds in Tableau (which is much easier than with Excel), provide a few tips and tricks on working with Word Clouds and especially demonstrate why Word Clouds are a poor data analysis tool compared to other chart types which are doing a much better job analyzing and visualizing text.
Stay tuned.