This tutorial is the first in a series that will help you learn how to analyse data about the extractives industry using Python.
We start off with a few FAQs before diving into the tutorial.
What are Python, R, Jupyter Notebooks, and Pandas?
Python and R are the most widely used programming languages for data science.
Python and R are just the languages, but you need a program on your computer where you can write, test, and run programs using those languages. For that purpose, you need a Development Environment, which is a program that lets you write, test and run code. For Python, the most popular Development Environment is called Jupyter Notebooks (which run right on a web browser), and for R, the most popular Development Environment is called RStudio.
On top of these languages, there are libraries which allows you to do specialised things. In Python, for example, the Pandas library is the most widely used library for data analysis.
Why are they useful?
How are they better than Excel, Tableau, etc?
What is the difference between R and Python?
Python is a general purpose programming language, which means that in addition to doing data science, you can write anything in Python ranging from web apps to online games. People with a background in programming usually find Python to be easier to learn than R.
R is a programming language that is especially made for data science. As a result you can’t write all kinds of things like games and apps with it, but it is awesome if you want to work on data. People coming from a data analysis or statistics background usually find R to be easier to learn than Python.
In terms of working with data, such as wrangling, analysing, and visualising, both Python and R works great for pretty much anything you can think of. The two languages are very different in terms of style and syntax, but their functionality is very similar.
What do I need to get started?
For this tutorial, we will just stick to Python, and we will just run it using an online version of Jupyter Notebooks that does not require installing anything on your computer. You can get started by going here: https://try.jupyter.org/
To learn more, head over to GitHub for the full tutorial.
In our first video training session, we presented a walkthrough of how to organize USEITI data for use in the open source mapping software QGIS. Fortunately, that dataset included geographic identifiers called Federal Information Processing Standard (FIPS) county codes--five digit codes identifying counties and county equivalents throughout the United States. However, not every dataset will include a geographic identifier alongside data attributed to a location. Google Refine is a powerful and versatile tool that can allow users to clean, manipulate, and transform their data. In this post we will walk through the process of using Google Refine to add geographic coordinates to a dataset.
Step 1 - Download and Install Google Refine
Navigate to the OpenRefine download page, and download Google Refine 2.5 for your operating system. Google Refine operates as a hybrid desktop and web application. When you run Google Refine, a browser window should open automatically and present you with the Google Refine web interface. Despite operating within a web browser window, Google Refine does not require an active internet connection to work. As long as the Google Refine application is running, you can navigate to http://127.0.0.1:3333/ to access the web interface.
Before we move to the next step, take a moment to download the following .csv file. This dataset was downloaded from ResourceProjects.org, and was reduced to only include 2015 projects carried out by Tullow Oil. Google Refine is a powerful piece of software, however, it can quickly get bogged down with very large sets of data. This file was limited to one company for the purposes of this tutorial.
Step 2 - Upload your dataset to Google Refine
To get started, click ‘Create Project’. You will be presented with a number of options for data inputs. We will create a new project using data from ‘this computer.’ Select the file downloaded in the step above, and click next to start the process of uploading the dataset.
Step 3 - Add a new column to fetch location information
With the dataset uploaded, Google Refine will present a preview of the entries. Review the data and headers to make sure everything appears as it should. At the bottom of the window check that the ‘Parse next’ box is ticked so that the first row entries are parsed as column headers.
Click the ‘create project’ button in the upper right corner to proceed to the main working space of Google Refine. As noted above, we will be adding in additional geographic information to this dataset. To do so, click the triangle in the ‘Paid to’ column and navigate to ‘Edit column’ > ‘Add column by fetching URLs…’
A window will pop up as shown below. Name the column and enter in the following text into the ‘Expression’ box. (Click here to learn more about General Refine Expression Language)
Click ‘OK’ and the expression will produce a column containing what is essentially the output of a search of the google maps application programming interface (API) on the basis of each term in the ‘Paid to’ column. This operation will typically take several minutes to complete depending on the size of the dataset. While you wait for the process to complete you can experiment to get a better of sense of how this function works. Enter the expression we just used, leaving off the last portion, into the address bar of another browser window:
Fill in the name of any location around the world after the “=” and you will see a page with all the relevant location information for that location. This should give you a better sense of what is happening under the hood with the fetching URLs function in Google Refine.
Step 4 - Add another column to parse the information from the previous step
Once the process has completed, you will see a column filled with a long string of text and numbers.
To clean this up we will add another column parsing through that data. Click on the triangle in the new column you created in Step 3 containing all the Google maps information, and select ‘Edit column’ > ‘Add column based on this column…’ Write in a title for this new column and enter in the following text into the ‘Expression’ box:
Click ‘OK’ and the new column will populate with a neat seat of latitude and longitude coordinates separated by a comma derived from the data in the column we produced in Step 3.
STEP 5 - Export your project
The final step is to click ‘Export’ in the upper right corner of the Google Refine window. Select ‘Comma-separated value’ or ‘Excel’ from the dropdown list of file types.
You can then open the exported file in a desktop application to delete the column containing the unparsed location information while leaving the second column we created that includes the latitude and longitude coordinates. Google Refine is ideal for refining, cleaning and adding to a dataset, but operations like deleting rows and columns should be done in programs like Excel.
While this post demonstrates how latitude and longitude coordinates can be derived from a country name, the exact same process can be carried out for any other location. If instead of country names the dataset contained the names of cities or provinces, the same steps can be used to obtain the latitude and longitude coordinates. Location information can help you to create persuasive maps and other visualizations of your data. To learn more about what can be done with extractives data and mapping, navigate to the training section of Extract-A-Fact.
Always on the lookout for interesting data, I was excited when I recently came across a comprehensive trove of data on offshore production in the Gulf of Mexico from the Bureau of Ocean Energy Management (BOEM).The datasets at data.boem.gov include:
However, because the data is interspersed between four datasets, it is downright complicated to find out which leases a company owns, what the lease attributes are, and how much oil and gas has been produced from the lease. With that in mind, I have created an interactive map application that combines the geographic data, the ownership data, and the production data into one easy-to-use tool.
See the full-sized map here.
This map allows the user to view:
Take a look at the map, hosted on our Github page at the following link: https://pwypusa.github.io/pages/gulf_explorer.html
Click around and let us know what you find out, here or on Twitter @PWYPUSA!
The Aleph search tool, built and maintained by OpenOil, is a vast database of public documents filed by oil, gas, and mining companies in some of the biggest legal jurisdictions in the world. Aleph gives us access to millions of documents all in one place that we can search by content, but finding the exact information we want can seem intimidating. The folks at OpenOil have provided some good resources for getting started with Aleph; find them here and here. In this blog post, I will help make navigating this system even easier by providing a step by step guide on how to automatically have the information you are most interested in delivered from Aleph straight to your inbox.
At PWYP-US, we are particularly interested in (and excited about) the payments to governments reports that oil, gas, and mining companies listed on EU stock exchanges have recently disclosed. Major companies like BP, Shell, and BHP Billiton have all published reports for 2015. As more and more reports come online, there is a possibility that some may slip through the cracks. Luckily, Aleph has a search tool that can let you know when any new filings become available.
The first step is to navigate to aleph.openoil.net and register for an account.
Once that is done, on the home page click on “Alerts” in the upper right-hand corner. You should see this screen:
Click “Add” to set up a new alert.
You now have two fields to populate, “Query” and “Label”.
The “Query” field is what Aleph will use to search the database. Aleph has some advanced searching capabilities, which you can read about in depth here. We are going to use the composite querying function.
Setting up an alert for new payments to governments reports
If you want to find documents related to payments to governments reports, searching “payments to governments” is a little too imprecise. We need to get more specific .
The payments to governments reports are mandated by law, and as such, have a specific form name in each legal jurisdiction.
So, let’s construct a query that will cover all of these:
"Article L. 225-102-3" OR "DTR 4.3A" OR "Section 1504" OR "13q-1"
Add that to “Query” under your new alert and give it a name under “Label”. Select whether you want to be updated daily or weekly, then hit submit.
And there you have it!
Aleph will now automatically update you via email on any new documents that fit the conditions you outlined in your query. Currently, the London Stock Exchange is the only jurisdiction from which Aleph pulls filings that require payments to government reports. However, the first mandatory disclosure reports will be released in Canada next year, and in the United States beginning in 2018. Including Section 1504 and the French article number may mean you get notifications that don’t contain actual mandatory disclosure reports, but making your search somewhat broad ensures that if a company filing, an Extractive Industry Transparency Initiative report, or contract mentions any of the search terms, you will be notified.
*Note: Running this query we noted that using "13q-1" occasionally returned results that were not relevant. However, these should be minimal, and including "13q-1" ensures a more robust query than only using "Section 1504".
Tommy Morrison is a Research Assistant at PWYP-US, follow us on Twitter @pwypusa
Click here for the archives to see our full list of posts.