This tutorial is the first in a series that will help you learn how to analyse data about the extractives industry using Python.
We start off with a few FAQs before diving into the tutorial.
What are Python, R, Jupyter Notebooks, and Pandas?
Python and R are the most widely used programming languages for data science.
Python and R are just the languages, but you need a program on your computer where you can write, test, and run programs using those languages. For that purpose, you need a Development Environment, which is a program that lets you write, test and run code. For Python, the most popular Development Environment is called Jupyter Notebooks (which run right on a web browser), and for R, the most popular Development Environment is called RStudio.
On top of these languages, there are libraries which allows you to do specialised things. In Python, for example, the Pandas library is the most widely used library for data analysis.
Why are they useful?
How are they better than Excel, Tableau, etc?
What is the difference between R and Python?
Python is a general purpose programming language, which means that in addition to doing data science, you can write anything in Python ranging from web apps to online games. People with a background in programming usually find Python to be easier to learn than R.
R is a programming language that is especially made for data science. As a result you can’t write all kinds of things like games and apps with it, but it is awesome if you want to work on data. People coming from a data analysis or statistics background usually find R to be easier to learn than Python.
In terms of working with data, such as wrangling, analysing, and visualising, both Python and R works great for pretty much anything you can think of. The two languages are very different in terms of style and syntax, but their functionality is very similar.
What do I need to get started?
For this tutorial, we will just stick to Python, and we will just run it using an online version of Jupyter Notebooks that does not require installing anything on your computer. You can get started by going here: https://try.jupyter.org/
To learn more, head over to GitHub for the full tutorial.
A change in government often brings significant shifts in policy. Major initiatives taken up by a previous administration can be slowed or reversed, and information that was once publicly available may be taken down or censored. The White House webpage provides some clear examples of this phenomenon. Following the inauguration this past January, press reported that the Trump Administration White House homepage underwent some changes, such as striking references to climate change and removing a spanish language option. Fortunately, if a user wants to view the content from the White House homepage of President Barack Obama, it is still possible to do so by navigating to https://obamawhitehouse.archives.gov.
Citizens can also use the Internet Archive “Wayback Machine” to access www.WhiteHouse.gov and see content for any given day going back several years. These archive solutions are helpful for viewing web content, but hosted files on these pages still have the potential to get lost. Documents that are hosted on pages can become inaccessible as other content is changed.
Since 2010 the Publish What You Pay coalition, academics, industry, investors and other actors submitted hundreds of comment letters to the Securities and Exchange Commission (SEC) to influence the agency’s Section 1504 rulemaking. Every single comment that has been submitted to the SEC is available on the regulatory agency’s website. The comments are available as pdf files on four separate comment records: 2010, 2010-2012, 2013-2015, and 2015-2016. Because of the current wave of government self-censorship, we wanted to make sure we could preserve the evidence in the Section 1504 record. This post will provide the steps to download all linked documents, such as pdf files, from a website. The SEC comment record will be used as an example, but the same steps can be used to download and preserve files hosted on any site.
As with other data scraping and organizing processes, the steps described in this post could be carried out manually. For example, scraping data from a company pdf report can be done manually, with a user entering in data line by line into a spreadsheet, but that is a time-consuming process. As we described previously on Extract-A-Fact, there are tools to help speed up data scraping. To automate the downloading of all linked files on a website, we will use the Google Chrome extension, Chrono Download Manager - see the tutorial below.
Step 1 - Install the Chrome extension
Navigate to the Chrome web store page for the Chrono Download Manager and click the ‘Add to Chrome’ button in the upper right. A notice will pop up and you can safely click ‘Add extension’ to confirm installation. When the installation completes you should find a new icon in the upper right corner of your Chrome browser.
Step 2 - Download linked files
Before proceeding, we recommend you set a dedicated folder for downloads. Navigate to chrome://settings in your Chrome browser and set a specific downloads folder. See the image below for an example.
Next, navigate to the page with the files you intend to download. In this case we will use the most recent 1504 comment record. Once on the page, click the Chrono Download Manager icon in the upper right. Select the ‘Document’ tab in the window that pops up.
The ‘Document’ window presents a list of all the links on the page that are interpreted as documents. In this case, we are only concerned with downloading the pdf files. To narrow the selection, click the ‘pdf’ check box as shown below.
Once you’ve selected all the relevant documents you can click ‘Start all’ in the lower right of the window to download the files into the folder you selected in the Chrome browser settings.
*Optional Step 3 - Categorize the downloaded files
If you follow the steps above you will be able to successfully download all of the files from a webpage, which will simply be listed by their filename (e.g. s72515-1.pdf). To help organize the files, you can have Chrono Download Manager automatically attach the descriptive text corresponding to each file. Click the first document highlighted in green (see image above), scroll down to the last pdf and press shift+left mouse button on the last highlighted pdf. With all of the pdf files checkmarked and selected, click the ‘Task Properties’ tab as shown below.
Click the text box next to ‘Naming Mask’ and select ‘*text*.*ext*’ then click ‘Start All’ to download all of the files. You’ll find that the downloaded files will now appear in the folder with a descriptive title (e.g. Jana L. Morgan, Director, Publish What You Pay – United States) rather than the numbered file name.
In our first video training session, we presented a walkthrough of how to organize USEITI data for use in the open source mapping software QGIS. Fortunately, that dataset included geographic identifiers called Federal Information Processing Standard (FIPS) county codes--five digit codes identifying counties and county equivalents throughout the United States. However, not every dataset will include a geographic identifier alongside data attributed to a location. Google Refine is a powerful and versatile tool that can allow users to clean, manipulate, and transform their data. In this post we will walk through the process of using Google Refine to add geographic coordinates to a dataset.
Step 1 - Download and Install Google Refine
Navigate to the OpenRefine download page, and download Google Refine 2.5 for your operating system. Google Refine operates as a hybrid desktop and web application. When you run Google Refine, a browser window should open automatically and present you with the Google Refine web interface. Despite operating within a web browser window, Google Refine does not require an active internet connection to work. As long as the Google Refine application is running, you can navigate to http://127.0.0.1:3333/ to access the web interface.
Before we move to the next step, take a moment to download the following .csv file. This dataset was downloaded from ResourceProjects.org, and was reduced to only include 2015 projects carried out by Tullow Oil. Google Refine is a powerful piece of software, however, it can quickly get bogged down with very large sets of data. This file was limited to one company for the purposes of this tutorial.
Step 2 - Upload your dataset to Google Refine
To get started, click ‘Create Project’. You will be presented with a number of options for data inputs. We will create a new project using data from ‘this computer.’ Select the file downloaded in the step above, and click next to start the process of uploading the dataset.
Step 3 - Add a new column to fetch location information
With the dataset uploaded, Google Refine will present a preview of the entries. Review the data and headers to make sure everything appears as it should. At the bottom of the window check that the ‘Parse next’ box is ticked so that the first row entries are parsed as column headers.
Click the ‘create project’ button in the upper right corner to proceed to the main working space of Google Refine. As noted above, we will be adding in additional geographic information to this dataset. To do so, click the triangle in the ‘Paid to’ column and navigate to ‘Edit column’ > ‘Add column by fetching URLs…’
A window will pop up as shown below. Name the column and enter in the following text into the ‘Expression’ box. (Click here to learn more about General Refine Expression Language)
Click ‘OK’ and the expression will produce a column containing what is essentially the output of a search of the google maps application programming interface (API) on the basis of each term in the ‘Paid to’ column. This operation will typically take several minutes to complete depending on the size of the dataset. While you wait for the process to complete you can experiment to get a better of sense of how this function works. Enter the expression we just used, leaving off the last portion, into the address bar of another browser window:
Fill in the name of any location around the world after the “=” and you will see a page with all the relevant location information for that location. This should give you a better sense of what is happening under the hood with the fetching URLs function in Google Refine.
Step 4 - Add another column to parse the information from the previous step
Once the process has completed, you will see a column filled with a long string of text and numbers.
To clean this up we will add another column parsing through that data. Click on the triangle in the new column you created in Step 3 containing all the Google maps information, and select ‘Edit column’ > ‘Add column based on this column…’ Write in a title for this new column and enter in the following text into the ‘Expression’ box:
Click ‘OK’ and the new column will populate with a neat seat of latitude and longitude coordinates separated by a comma derived from the data in the column we produced in Step 3.
STEP 5 - Export your project
The final step is to click ‘Export’ in the upper right corner of the Google Refine window. Select ‘Comma-separated value’ or ‘Excel’ from the dropdown list of file types.
You can then open the exported file in a desktop application to delete the column containing the unparsed location information while leaving the second column we created that includes the latitude and longitude coordinates. Google Refine is ideal for refining, cleaning and adding to a dataset, but operations like deleting rows and columns should be done in programs like Excel.
While this post demonstrates how latitude and longitude coordinates can be derived from a country name, the exact same process can be carried out for any other location. If instead of country names the dataset contained the names of cities or provinces, the same steps can be used to obtain the latitude and longitude coordinates. Location information can help you to create persuasive maps and other visualizations of your data. To learn more about what can be done with extractives data and mapping, navigate to the training section of Extract-A-Fact.
Within the EITI process, the issue of beneficial ownership has gained momentum. After a successful pilot phase to which 11 countries voluntarily signed up, the 2016 EITI standard now requires all 51 implementing countries to ensure that companies disclose their beneficial owners. These are early days, however, and so far reporting on beneficial ownership is showing significant gaps, as well as a high degree of variance in the information that is disclosed.
So what can you do to find out about the owners of a company, if the information is not yet available in an EITI report? And even if it is, how can you verify that the information provided is actually correct? Financial regulators in many countries already require companies to disclose information on their shareholders and subsidiaries, so that our corporate filings database Aleph can help you to find it – and here is how.
1) Who is controlling Kansanshi Mining PLC?
Let’s have a look at Zambia’s 2014 EITI report. On p.15, we find a chart with the top 5 payments to government by operating company. We see that combined, they make up for 70% of all payments to the government of Zambia. Of special interest is Kansanshi Mining Plc, which alone accounts for 32.86% of the payments. So what is the parent company of Kansanshi Mining Plc? Since we are dealing with the 2014 report, we will try to find a filing from 2014.
The search leads us to First Quantum’s annual information form for 2014, filed to the Canadian Stock exchange authority’s filing system SEDAR. On page 5, the report includes a hierarchical table with its subsidiaries. Here, we learn that First Quantum has an 80% interest in Kansanshi Mining Plc. Other top-players from the EITI report are also included in the list. Kalumbila Mines Ltd belongs to First Quantum, which has a 100% interest in the operation, as well as First Quantum Mining and Operations Ltd. But to whom do the other 20% of the Kansanshi operation belong? Further down on p.15, the project has its own section. First Quantum states that the other 20% of Kansanshi are owned by a subsidiary of Zambia’s state-owned ZCCM.
In this case, we have been able to confirm the information provided by the ZEITI report, which includes a list of beneficial ownership structures on p.128.
2) Mopani Copper Mines Plc
Things become more interesting, however, if we look at another company in the Zambia EITI report: Mopani Copper Mines Plc, the third largest contributor to government payments. Trying the exact same search terms in Aleph will lead to a miss. So we adjust them:
The new search leads us to reports from mainly two companies; Glencore Plc and Katanga mining Ltd. Since we are looking for the ultimate owner, Glencore is a more likely candidate, because it is a multinational enterprise. A 2014 report is included, with a listing of ownership structures. On p. 186, we read that Glencore has a 73.1% interest in the Mopani Copper Mines Plc. This time, the information provided by the Glencore report does not match ZEITI’s information. According to the latter, Mopani copper Mines Plc is owned by 73.1% by First Quantum, 16.9% by Glencore Xtrata and 10% by a ZCCM subsidiary. Strikingly, we have the exact same figure for the majority owner, namely 73.1%. It looks as if the ZEITI report confused both companies.
To summarise: using our Aleph database helped validating the figures stated in the EITI report, and in the second case, it even helped identifying mistakes. In other cases, the database can help filling out missing information about stakeholders. However, it is also important to note that the availability of information depends on the respective financial regulations. Keeping this in mind, the Aleph database proves to be a powerful tool to complement existing research as well as to support access to publicly available data on intercompany ownership structures.
Of course, all this is Aleph working networks of corporate affiliation structures – so it is not yet leading to the ultimate beneficial owners, which will always be a natural person. We will follow that up in a separate post.
Click here for the archives to see our full list of posts.