By Yan Naung Oak, Phandeeyar
This is an abridged version of a module for an extractives data training online course that will be featured on School of Data’s website.
Extractives data comes from a wide array of sources. It’s the job of the person working with this data to extract, combine, analyse and communicate it. That’s where data visualisation comes in. If you have a basic grasp of working with data, you are probably familiar with basic chart types that are used in data visualisation.
There are many online resources which list taxonomies of data visualisation types and guides you to choosing the appropriate tool and type of data to be used with each type of visualisation. Datavizcatalogue.com and Datavizproject.com are great examples of taxonomy sites. Dataviz.tools is a very useful site that catalogs all the different tools available for data visualisation. The Financial Times’ Visual Vocabulary chart provides a handy guide to match data types to visualisation types.
When we are working with data from the extractives sector, we often find that there is a need to communicate “flows”. According to FT’s visual vocabulary chart, sometimes we need to “Show the reader volumes or intensity of movement between two or more states or conditions. These might be logical sequences or geographical locations”. Examples of flows in extractives data include:
In this module, we will create visualisations of the first of those examples, and the we will use data about the global uranium industry to demonstrate how to prepare data and create these visualisations.
We will be using a free-to-use online tool called RAW. RAW is designed to create highly customizable, static visualisations (i.e. not interactive) that provides the functionality to easily create interesting visualisation types that many off-the-shelf tools usually do not provide.
RAW is an especially useful tool for designers because it lets you export visualisations as SVG files which can be further edited using vector graphics software such as Illustrator. You can check out some of the beautiful visualizations that are created using RAW in the gallery section of their website.
For developers, RAW offers a lot of customizability too. It is completely open source, and is build on top of D3.js, which is the most popular web data visualisation framework. If you’re ambitious and know a bit of D3, you can even add your own new chart types to RAW.
If you’re sold on how awesome RAW is, great! Let’s get started.
Sankey Diagram of Revenue Flows
The first kind of chart we will make is a sankey diagram that shows the revenue flows to government from the extractives sector as presented in an EITI report. Specifically, we will look at the revenue flows reported in Kazakhstan’s 2014 EITI report. “Wait, I thought we were going to focus on data from the uranium industry? What’s that got to do with Kazakhstan?” I hear you ask. Stay tuned, you will see why in the next section.
You can access the PDF file of the report from the global EITI website here, or from the google drive from the data section of the EITI website. We are specifically interested in this table on page 61 of the report:
We can use Tabula to extract the report. After a bit of manual cleaning, we can get a table in Excel or Google Sheets that looks like this:
The first 3 rows of data are directly from the PDF table’s first five rows of data (excluding the share of total as % row). The last row “Non-Extractive Receipts”, is calculated with a simple formula, the “Tax Receipts Total” row minus the sum of the “Oil and Gas Receipts” and “Mining Receipts” rows.
Let’s look at what the column names mean:
We have to always remember the context behind the data we are trying to visualise. Especially in the extractives sector, the data is very complicated and tied up with the individual country and/or company’s policies. In Kazakhstan’s case, the total receipts are broken down into state budget and the national fund. The state budget is then broken into the republican budget and the local budget. In addition to that, there is also a special tax that companies in the oil sector have to pay that is not included in the state budget but included in the national fund. Writing it all out in text makes it sound quite confusing, and that’s why we’re visualising it in the first place.
RAW is incredibly easy to use, but the most difficult step is making sure the data is in the correct shape for the chart that you are trying to make. Notice the color coded cells in Table 2 above? Those are the figures that we want to chart using RAW. But why are we ignoring the first two columns and the first row? It’s because the figures in those columns and row are just sums of the other figures, and RAW will automatically sum up the figures for you. In general, you only want to give to RAW the most disaggregated data.
Now that we know which are the figures we want to use, we still have to reshape it into a format that works for RAW’s sankey diagrams. Sankey diagrams have a series of stages, with the flows diverging or converging at each stage. Hence, we have to reshape the data so that it will look like Table 3 below. The color codings on the cells with the numbers are the same as in Table 2, so you know where each of the numbers go.
As you can see, we have divided the categories into different steps to show how each item is broken down into subcategories. Once you have this table of data prepared, we can go over to the RAW web app (apps.rawgraphs.io) to start visualising.
In the first screen that you see, you just copy and paste the data directly from your spreadsheet. Make sure to change the format of the numbers in your table so that they don’t contain any thousand separator commas (i.e. we want it not like this: 1,000,000, but like this: 1000000).
If the data is acceptable by RAW, the bar below the text box will turn green with a little thumbs up icon, and it will tell you how many rows of data has been loaded. In the top right corner, you can change the view of the data to a table view to see the data more clearly.
Scroll down. Once the data is loaded, RAW will let you choose the type of chart we want. The sankey diagrams that we want are called Alluvial Diagrams in RAW (there’s a subtle difference between the two but the terms are often used interchangeably. You can refer to the dataviz project’s pages on sankey and alluvial diagrams to see the difference). Click on Alluvial Diagram in the list of charts.
Next, we have to choose which columns from the data we want to visualise. Since we have pre-prepared the data to fit the sankey diagram format on RAW, this step is quite simple. Drag the column names into the boxes as shown below.
After that, you’re basically done! Scroll down further to see what the chart looks like.
The chart updates live depending on what columns you drag into the boxes in the “Map Your Dimensions” section, so you can play around to see what kind of changes your choices make to the chart. For example, if you don’t include anything in the “Size” box, RAW will just assume each of the flows are of the same size, as seen below:
There are some limited options for changing colors and dimensions on the left, but for real customization, RAW itself is not the best tool. It is best used in conjunction with a vector graphics editor like Illustrator to really polish up your charts. RAW especially accommodates for this kind of importing to a graphic editing software. Scroll down for the Download section to see how.
If you are satisfied with your chart and want to use it as an image, choose “image (png)” from the dropdown, give your file a name, click download, and you’re done! However, there are two other formats that you can get the chart in. Select “vector graphics (svg)” to get it in format which can be edited further in a vector graphics software. If you want to embed the chart in a web page, you can copy and paste the code in the “Embed SVG Code” box into your HTML. There is an additional option to download the chart’s data model in JSON format, but that option is for more advanced users and we won’t cover that in our tutorial.
That’s it! Making charts in RAW is super quick and simple. No need to register for accounts, everything is completely free (not “freemium”), and it’s all done on a simple web app on a single page.
Bump Chart of Uranium Production by Country
Next, let’s try another useful chart type that takes data in a different shape from the sankey chart.
In this section, we will visualise how uranium production has changed over time by country. The dataset we will use is from the World Nuclear Association’s page on World Uranium Mining Production. We want this table:
Since the data is already neatly in a table, we don’t even have to first import into a spreadsheet. Simply drag your mouse cursor over the HTML table to select all the rows except the last 3 (as shown in the screenshot above), and then copy and paste into RAW.
Next, we use a function that is built into RAW to get the data into the correct shape for this chart. What we want to do is stack the data, so that all the production values are in one column and the year labels are all in one column as well. To do that click on the text that say “Click here to stack it” and choose “Country” as the dimension. Your data should change from looking like this:
See what happened? We stacked the columns for each year into just two columns.
Since the data is ready and in good shape, we now scroll down and choose the “Bump Chart” type, and map your dimensions as follows:
And that’s it! Here’s our beautiful Bump Chart:
By default it is organised as a ranking from largest to smallest, which is exactly the way we wanted to visualise it, but feel free to play around with the “Sort By” options. Now you can see why we focused on Kazakhstan in the first section of the tutorial. That’s the power of a bump chart.
Click here for the archives to see our full list of posts.