With new disclosure laws in effect in the European Union, Norway, Canada, and now the United States, there has never been a better time to be on the front lines of the fight for transparency and accountability in the extractive industries. Project level payments are being disclosed by oil and gas companies like BP, Statoil, and Shell. However, as any seasoned data extractor will know, this information is often released in PDF format, making it difficult to transfer the data tables into a spreadsheet application where it can be put to use. While companies listed and registered in the UK are required to disclose information in XML format (company reports are made available for download in .csv files), not all countries require this, making data extraction seem an arduous task.
Tabula is a powerful and extremely useful open-source web application for extracting data locked in tables in PDF documents. Similar to the Google Scraper application we introduced in an earlier post, Tabula can expedite the process of getting information into a usable format when copying and pasting is not an option.
NOTE: Tabula will only work on optical character recognition (OCR) enabled PDF documents, not image-based documents. Put simply, PDF reports with OCR make the content searchable and interpretable by software. Even if you don’t know which type your document is, Tabula will prevent you from uploading the wrong kind.
Tabula can be downloaded at Tabula.technology and works for Windows and Mac users. Follow the instructions on the page and make sure to have a version of Java installed.
Once you have Tabula installed, double click the "tabula" application file in the Tabula folder and it will open up to a page in your web browser.
NOTE: the command prompt will open and run for a few seconds before the page opens in the browser
The first step is to import the PDF from which you want to extract data into Tabula. For the remainder of this post, we will be using the BHP Billiton Economic contribution and payments to governments Report 2015, which the company released voluntarily in September, 2015. Some company reports include this information in .csv files, but so far BHP Billiton has only provided PDF reports.
If we try a simple copy and paste operation into a spreadsheet application with any of the tables in the document, we will find that all the information is imported into a single column or a single cell.
Download the report from the company site linked above, then use the “Browse” button in the Tabula page to find the saved PDF file. Select it and click “Import”.
The file may take a minute to upload, depending on its size.
Once the file is uploaded, you will be taken to a page with a viewer showing the PDF report.
Before we get started extracting the data, it is important to note some reports will pose additional challenges. If we have a look at the data in the report, we will notice the tables are spread over two pages, as if to be read side by side in a book. This would make manually copying the data a real chore, but luckily with Tabula it will take just a few extra steps to reconcile. The image below shows an example of how BHP is presenting the table with columns extending across two facing pages.
From here, we have two options for extracting the data in the tables. The first option is to select the “Autodetect Tables” button and let Tabula do the work of finding the tables in the PDF document. This option may be OK in some circumstances, with small, orderly documents, but with the large number of tables in this payments to governments report, Tabula will be slow to find the tables and may not be as precise as we want. For these reasons, we will use the second option, which is finding and highlighting the tables ourselves using the cursor.
To do this, just scroll down in the viewer window to the table you would like to extract. The first one in the BHP Billiton report is table 7.1 on pages 14 and 15.
To select the data, place your cursor at the top left corner of the data table, click then drag to the bottom right corner of the table. If the data table is longer than the is visible in the viewer, use your track wheel or arrow keys to scroll down while keeping the cursor clicked.
It is a good practice to select only the column headers and the data below them; selecting table titles or additional parts of the page may cause your extracted data to become jumbled.
Once you select the table portion on page 14, scroll down to page 15 and repeat the process for the remainder of table 7.1.
Once we have a selection highlighted, select “Preview & Export Extracted Data” from the toolbar.
After loading for a moment, we should see a preview of the extracted data in an organized table.
At this point, it is good to have your PDF report open in another window so we can check a few values to be sure the data has been extracted properly.
In this case, we will see that the table has been extracted correctly, except for some confusion with the column titles showing in two rows. This problem is easily taken care of later.
If your data does not appear to have extracted correctly (e.g. multiple values in one cell), try to reselect the table(s) by clicking the “Revise selection(s)” button on the left toolbar and repeating the above steps.
Now, using the toolbar we can select the format of the exported data (CSV, TSV, JSON), and export with the “Export” button. The exported file will be downloaded and can be opened via your “Downloads” folder.
Exporting as a CSV will produce a file that can be opened with Microsoft Excel, while the other file types require you to choose the application you want to use.
Additionally, we can copy to the clipboard with the “Copy to Clipboard” button and paste the tabular data directly into a spreadsheet application. This method will format the pasted data correctly into rows and columns in Excel unlike the process of copying and pasting directly from the PDF document.
With our data now in Microsoft Excel, we can perform one last round of checks to verify and then analyze the data in Excel, or in another data analysis/visualization software.
As noted in other Extract-A-Fact posts extracting and exploring data will often require a few steps of cleaning and organizing before the dataset can be most useful.
In this instance, we must first fix the column headers.
Don’t forget to change the column headers for the data extracted from page 15 (the second half of the table) at the bottom of our Excel data.
Once that is taken care of, we need to bring the bottom half of our table in line with the top half. To do this, highlight the section starting with “Royalties” (this is the first column header from page 15) all the way to the final data point in the bottom left. Copy this section and paste it into the first cell next to “Total Taxes Paid” in column E. Delete the bottom section.
You should have a single data table with columns A through M, ending at “Taxes collected (2)”.
Now we can use the strategy outlined in the Google Scraper post to verify that the data in each row is correct. We do this by summing the values in each row except the “Total” column and comparing that sum to the value in the “Total” column.
We need to be careful with this report, though, because there are multiple “Total” columns across each row.
We will use the “Total Payments to Governments” value to check our data. Use the =sum() function. You may want to move the other Total columns to the end of the table (E and K) or delete them before you start any analysis so you don’t count any values twice.
After summing the corresponding cells, use an equality function =(N4=L4) in the adjacent column to easily see whether the values match.
The TRUE result assures us the data has been extracted correctly. Drag both these equations down the length of the data table to ensure that every row is correct.
NOTE: Be sure to remove any superscripts from the data (e.g. 7,800*) to ensure that Excel recognizes the cell as a number and not text. Also, check that all negative values are properly imported; they are normally indicated with parentheses in the company reports [e.g. (12,500) ]
Congratulations, you have successfully extracted payment data from a PDF table!
Tabula is also capable of extracting multiple selections at once. Just go to the PDF viewer and, using the cursor, highlight as many tables as you want before clicking “Preview & Export Extracted Data”.
Doing this will require some extra diligence when checking the data, but can be an amazing time saver when extracting data from multiple tables.
Tabula is a powerful tool that can greatly increase the efficiency and accuracy of any data extraction project involving PDF documents. We hope you will take advantage of this resource (and our other trainings) in your fight for extractives accountability .
Let us know if you have any questions or need help troubleshooting in the comments section below.
Click here for the archives to see our full list of posts.