In our efforts to empower citizen groups around the globe to hold their governments accountable, we here at PWYP-US aim to provide you with every tool you might need in the fight for transparency. And while there are many ways to use data to investigate extractive industry payment disclosures (like QGIS), often the biggest hurdle is to simply extract the data from its location on the web into a usable format. This short guide will cover the use of one tool that can simplify that process for web-based tables: the Scraper web extension for Google Chrome. A web scraper automatically collects information from the internet, allowing users to easily transfer that information to a format from which they can analyze or manipulate it. This can greatly speed up the process over manually copying data and reduce the possibility of error.
The Scraper extension is free and can be found here in the Chrome web store. It is easily installed by clicking “Add to Chrome” in the top right corner and following the short instructions.
Once Scraper is successfully installed, it can be found in the top right corner of your browser near the Menu button.
How and When to Use Scraper
Using Scraper is truly simple, but it is not often clear where it is best employed. Many online data tables you will find can simply be copied and pasted directly into a spreadsheet program and used from there. Scraper comes in when the web page formatting gets in the way of a simple cut and paste operation.
For the rest of the demonstration, we will use tables found in the 2015 form 20-F from Total S.A. filed at the US Securities and Exchange Commision. It can be found here, with the pertinent tables on page S-18.
The “Report on the payments made to governments” contains information on Total’s payments to governments in all countries of operation. This includes payment streams such as royalties, taxes, rents and fees. The data is categorized into payments by project and payments to various levels of government in a country. Once it is in machine-readable format (such as an excel file) we can compare it to other data we may have, be it from USEITI, government data, or other countries’ EITI reports.
To use Scraper, first identify a table from which you want to extract data. We will use table 3.1: “Reporting by country and type of payment” on page S-19.
Once you have your table identified, use your cursor to highlight part of the table, being sure to select two or more full rows.
Then, leaving the cells highlighted, we have two options:
Option 1: Go to the Scraper icon in the top right corner of the browser and select “Scrape similar…” from the dropdown menu
Option 2: Right click and select the “Scrape similar…” option from the cursor dropdown menu
From there, Scraper will open a new window containing the scraped data. Scraper will automatically locate the bounds of the table and find the column names from only the few rows we highlighted in the previous step.
We can now view the data in the right panel and check that it has been scraped properly. If we decide it looks good enough to export, we simply click “Copy to clipboard”
Then go to our preferred spreadsheet program and paste with Ctrl + v
Now our data is in Microsoft Excel. But before we start any analysis, we need to perform some data checking and cleaning to make sure the data is 100% correct and usable. We can tell from a first look that there are a lot of empty columns and the column headers are not lined up with the values.
First, we need to move the column headers into the right positions. By cross checking with the original html table on the SEC website, we can simply drag and drop the headers in the Excel spreadsheet into their correct locations.
For example, we see that “License bonus” in the Excel spreadsheet has ended up over the “License fees” column.
Now, before we delete the empty columns, we need to make sure they are really empty. In doing this, we will find that one cell has been split incorrectly.
Checking back to the original table, we see that (1) is a footnote indicating that the (599) is a negative payment. We can fix the inconsistency by clearing both cells and replacing
(599 with -599. Now we can delete the empty columns as well as the first row.
Our spreadsheet should look like this:
Finally, we need to check that the data matches correctly. By using the spreadsheets =sum() function, we can compare the imported “TotalofPayments” in column I to a total we calculate ourselves using the imported values in columns B through H.
This is done by selecting the empty cell in column J and entering =sum(B2:H2)
This gives us a total calculated using the numbers in the spreadsheet that should equal the imported number under “TotalofPayments”. We can check this visually or set an equality function in the next cell by entering =J2=I2 (meaning: is J2 equal to I2?)
The result is “TRUE”, which means that the data in that row imported correctly. To check the remaining rows, we can easily expand the =sum function and the equality function to the other cells in columns J and K.
First: select both cells:
Then, clicking on the lower right-hand corner of cell K2, drag down all the way to the last row of our data and release.
This action will copy the formulas in J2 and K2 to all the cells below while updating the row information to be consistent. Now it is very easy to see that all of the cells in column K are TRUE and that our data has been imported 100% correctly.
Tools like Scraper can make data collection go from an arduous chore to a simple task, greatly increasing the efficiency and accuracy and leaving us more time to examine the numbers. Stay tuned for a walkthrough of a similar program, Tabula, to learn how to scrape charts from pdf reports.
Tommy Morrison is a Research Assistant at PWYP-US, follow us on Twitter @pwypusa
Want updates from Extract-A-Fact? Sign-up for our newsletter
Click here for the archives to see our full list of posts.