Webscraping isotope data with rvest
Hydrologists use isotope signatures of precipitation water as a tracer to follow water flow pathways in the ground and in rivers. The International Atomic Energy Ageny (IAEA) has setup a Global Network of Isotopes in Precipitation (GNIP). We will have a look into this data set in this tutorial series about working with isotope data.
- Theoretical Background
- How is the server set up?
- What is rvest?
- Master the login form
- Extract station list
- Convert to tidy data.frame
- Extract links
- Prepare download
- Download data
- Read csv files
- The short version
- Conclusion
- What can you do next?
First thing we have to do is to download the data from their website. Of course we want to autotmate this! We will use a webscraping package called rvest to get this done.
Goals:
- get all the download links for a certain region
- download all the
.csv
files
Challenges:
- master the login page form
- maintain a constant session from the login to the end of the last download
- implement file download which is not explicitly implemented in the rvest package
- do everything in a tidy dataframe
In this post we will give you a short theoretical background as a motivation about why we want to work with this data. Then we will show you how to download the data with rvest step by step.
In follow-up tutorials linked to this tutorial we will show you:
- how to derive the Local Meteoric Water Line (LMWL) for all stations with linear regression
- how to show the variation of parameters on an interactive map
- make correllations between the isotope signatures and parameters like longitude, latitude, elevation and climatic variables
Theoretical Background
Water containts not only the usual \(H\) and \(O\) atoms. It contains also small amounts of the slightly heavier isotopes \(^2H\) and \(^{18}O\). Isotopes are atoms containing one or multiple additional neutrons in the core than the more abundant variety of the atom.
Because of their slightly heavier mass (compare table below), they behave different from the abundant \(H\) and \(O\) isotopes. This shows mainly when the water evaporates or condensates: The heavier isotopes are slightly less likely to evaporate and a bit more likely to condensate before the lighter isotopes. Hence water that stems from evaporation (like clouds generally do) are depleted in the heavy isotopes (compared to the water remaining in the ocean). When clouds rain off, the water remaining in the cloud get’s even more depleted of the heavy isotopes, since these condensate and rain down first.
isotope | name | neutrons | natural abundance* | molecular weight | stability |
---|---|---|---|---|---|
\(^1H\) | Protium | 0 | \(99.988\, \%\) | \(1.0078\, u\) | stable |
\(^2H\) | Deuterium | 1 | \(0.012\, \%\) | \(2.0141\, u\) | stable |
\(^3H\) | Tritium | 2 | \(10^{−15}\, \%\) | \(3.0160\, u\) | decays |
\(^{16}O\) | 8 | \(99.76\,\%\) | \(15.994\, u\) | stable | |
\(^{17}O\) | 9 | \(0.04\,\%\) | \(16.999\, u\) | stable | |
\(^{18}O\) | 10 | \(0.20\,\%\) | \(17.999\, u\) | stable |
* abundance varies slightly in different water bodies (ocean, groundwater, river, cloud)
Source: https://de.wikipedia.org/wiki/Wasserstoff#Protium, http://www.wolframalpha.com/input/?i=16O+isotope
Due to this fractionation of the isotopes researchers can measure a difference in the isotope “fingerprint” of rainwater and water on the ground. The atmospheric fingerprint varies a lot, depending on the temperature and humidity conditions when the vater evaporated and during the time it traveled through the atmosphere.
Water on the ground is a more stable mix of isotopes. Therefore every rain event brings in a new pulse of water with different isotope signature. Hence the \(H\) and \(O\) isotopes can be utilized as a natural tracer to visualize the flow pathways the water takes after raining down to the ground. Following the water with the isotope signature of the last rainfall shows reasearchers where the precipitation water ended up. Measuring a mixed signature between the ground water and the precipitation water means that the former stored water and the fresh water have mixed. Since isotopes don’t chemically react, decay or get bound, they are also a very useful tracer. In the end they still are water and behave like water in most ways.
The International Atomic Energy Ageny (IAEA) has setup a Global Network of Isotopes in Precipitation (GNIP). They maintain a database where measurements from different institutions from all over the world are gathered together and made available for everyone to use. The database access system is called WISER (Water Isotope System for Data Analysis, Visualization, and Electronic Retrieval). You have to register at websso.iaea.org, but then it is free to use for everyone (they just ask for your name, thats it).
They also provide a similar database for isotopes signatures in rivers, the Global Network of Isotopes in Rivers (GNIR).
How is the server set up?
Generally the database can be accessed with the follwoing link: https://nucleus.iaea.org/wiser/gnip.php?ll_latlon=&ur_latlon=&country=&wmo_region=&date_start=1953&date_end=2016&iso_o18=on&iso_h2=on&result_start=0&result_end=1000&action=Search
This page is only accessable for registered users, hence if you are not logged in you get redirected to the login page. If you haven’t done this register to create an account.
If you visit the page and fiddle around with the form to select the region, the isotopes and other options, you can see how the url changes with these options. It is the usual syntax of the http request-response protocol GET:
You have
country=
for selecting the country (for examplecountry=Germany
)wmo_region=
which you can give a number (likewmo_region=6
for the european wmo region)- with
date_start=
anddate_end=
you can define the date range. Just give it a year iso_018=on
,iso_h2=on
andiso_h3=on
define which isotopes have to have been measured at the stations you want to selectresult_start=0
andresult_end=1000
define how many enries are shown on one page. You have the option to setresult_end
to 10, 20, 30 and 1000 (The 1000 corresponds to the “All” button on the page). Setting it to 1000 comes handy to download all available files (although if you select the whole world you end up with even more than a thousand stations).
What we are looking for are the \(^{18}O\) and \(^2H\) isotopes. So we set iso_018=on&iso_h2=on
and leave iso_h3
out. We aks for stations of the whole world by setting wmo_region=
and leaving it without any number. Add the end we have to add action=Search
to initiate the search for stations. The url above is already setup with these options.
What is rvest?
rvest is a package for webscraping—extracting content from web pages. It is not only about downloading files from the internet, but crawling through web pages, finding certain links inside the page and follow them to the next page and so on. You can select elements of the page by their DOM (their html structure), their id or class selectors. Then you can not only extract the text of the page, but also the html attributes like the href=
of a link or the src=
of an image (which contain the url of the link and the image respectively).
Let’s go ahead and install rvest:
After installing we have to laod it:
Master the login form
To be able to access the data or even display the station list, we have to be logged in. After loggin in we have to maintain this login information when going to different pages. In rvest this is done by establishing a session and passing the session info to the next commands.
If you try to surf to the above mentioned link
without beeing logged in you find yourself beeing redirected to another page with the following url:
Note how this page contains a redirect to the original link. So after you log in you get redirected to the page you originally requested.
With rvest we will establish a session with the link to the login page:
We then have to fill in the login form.
First we extract the form:
Let’s have a look:
You see one input field for 'User'
and one for 'PASSWORD'
.
We then fill these out:
Of course you first have to store a variable with username and the password:
Next we do something we learned from this stackoverflow answer to prevent an (cryptic) error when submitting the form:
Now we are ready to submit the form:
We save a new session with the return of the form submission. The return is a redirect to the original page (the page with the station list). We now are logged in with a cookie stored in our session and are redirected to the page we originally intended to view.
Extract station list
With the session stored under nucl2
we now can extract the data from the table displayed on the page.
We use [[3]]
to extract the third table.
Convert to tidy data.frame
We want to work with a tidy data.frame (also called tibble
). A data.frame
stores observations (the rows) in variables (the colums). Each column can have a different data type (like integer, character, logical, etc.). Have a look at help(typeof)
to learn more about object types. What the rows of a column are not allowed to contain is other object classes like list
or matrix
. Consult help(class)
to learn the difference between an object class and the object type.
Therefore traditionally everyone tended to introduce a new variable (for example data <- list()
) to store data that don’t fit into the data.frame. The problem is, there is no direct link between the data.frame (let’s say it is called stations
) and the data stored in the list called data
. If we change one of the objects (for example we drop some rows of the data.frame), the other object doesn’t change. So later on we cannot link the rows of the table to the elements of the list.
With a tibble we can store all the data we will later download in a column of the station list. This way we maintain a constant link between the data and the station list.
Let’s convert the station data.frame to a tibble. The package tibble
is automatically loaded when loading the package dplyr
(which contains additional useful functions like mutate
for working with tidy data).
Convert the station list:
If we display the tibble you see the difference to a data.frame:
With the mutate()
function we can introduce new columns. Here we replace the column WMO Code
:
With this we add back the leading 0s of the WMO Code that where stripped in the table read process.
Extract links
We now extract all links from the page:
With the part css="a"
we extract all html <a></a>
elements (the syntax for inserting links into html pages).
The following shows the links in one row of the station list:
Of course we only want the links referencing a csv file. All csv links have “csv “ in the link text. So we make a list about which fo the links contain the keyword “csv”:
Now we extract the href
attribute of all links linking a csv file (note how we use is_csv_link
here):
The extracted links we add into the station list with mutate
:
Prepare download
First we create a download folder:
For the download we need a destination file name for every csv file. We will use the station WMO code for this:
Download data
We now are ready to download the data. We use a loop over all the csv_links
.
Because we can’t pass the session info to download.file()
we will use a different approach to download the files:
With the command jump_to()
we maintain the session while following the links. We then write the content of the loaded link to a file with writeBin()
(the idea came from this stackoverflow answer).
We set Sys.sleep(1)
to prevent sending too many request in short time to the server. Some servers deny service when we send requests in sequence too fast.
Read csv files
Finally we want to add the station data to our tidy data.frame. For reading we use the package readr
, which reads csv files a lot faster and reads it directly into a tibble. Advantage of the latter is for example, that column names get preserved in the format they have in the csv file. We also need the function map
of the
purr
package. map
is an apply function that can be executed on the elements of a vector. Install these packages with install.packages("readr")
and install.packages("purrr")
.
To check what we have done just now we display the data for the first station:
Finally have a look with what a “tidy dataframe” we ended up with:
The short version
Using the pipe operator %>%
very often you can get rid of a lot of the temporary variables and compact the code quite a lot:
Conclusion
We now downloaded all the data and added it to our tidy data.frame.
In follow-up tutorials we will show you how to:
- derive the Local Meteoric Water Line (LMWL) for all stations with linear regression
- show the variation of parameters on an interactive map
- make correllations between the isotope signatures and parameters like longitude, latitude, elevation and climatic variables
What can you do next?
Try to download data from the Global Network of Isotopes in Rivers (GNIR)