Seminar at King’s College London

On the 12th February 2016 I gave a seminar on “why and how open data and open APIs can improve research” for the Environmental Dynamics research group seminar series at the Department of Geography of King’s College London (slides available here). The audience was made of students, researchers and academics, all interested and participative, which made the experience rather enjoyable for me. More information on this series of seminars is on the King’s geocomputation blog.

During my seminar I showed how to assemble data requests using the National River Flow Archive’s  RESTful APIs, how to parse and convert server responses using the R language and the rnrfa package. Here are links to the demos and a web mapping/reporting application I built using the rnrfa package as backend tool. If you are interested, the source code of the web app is available as Rmd file from GitHub.

I mentioned that the NRFA APIs are experimental and periodical updates may temporarily break the package. An API update was deployed on the 18.02.2016, the package was updated accordingly and version 0.4.3 is now available on CRAN and GitHub.

Advertisements

Generate a Latin Hypercube of parameters for a given FUSE model

In the previous post I showed how to get information regarding model building options and used parameters for a given FUSE model. In this post I’ll show how to sample (for the given model) a set of 100 parameters uniformly using the Latin Hypercube Sampling method.

Each FUSE model uses different parameters therefore, in order to sample uniformly, we need to remove the unused parameters and then sample. Here is how I achieve this.

Install/load the fuse package:

    install.packages("fuse", repos="http://R-Forge.R-project.org")
    library(fuse)

Now load devtools and source the gist below (it contains a function called GenerateFUSEParameters)

    require(devtools)
    devtools::source_gist("https://gist.github.com/cvitolo/9574a0cef0bf4f2cd28b")

Choose one of FUSE’s models and get model/parameters info using the FUSEinfo function described in the previous post:

    mid <- 60 # This is TOPMODEL
    x <- FUSEinfo(mid)

Now a Latin Hypercube of 100 samples for the above model (mid=60) is defined as follows:

    parameters <- GenerateFUSEParameters(NumberOfRuns = 100, 
                                         params2remove = names(x)[which(x==FALSE)])

High Performance Computing Service – Part 3: Transfer your files

At this stage, all the files are ready.

Now, let’s organize the stores adding subdirectories then transfer the files from my desktop computer to the HPC server.

1. ON THE HPC: I have organized the stores in this way:

 mkdir $HOME/R #for custom library 
 mkdir $HOME/scripts 
 mkdir $HOME/R/scripts/r #for my R scripts  
 mkdir $WORK/inputs 
 mkdir $WORK/outputs 
 mkdir $WORK/outputs/Rdata 
 mkdir $WORK/outputs/Rout

2. FROM MY DESKTOP: transfer the files created in Part 2 to the HPC server.

 scp topmodel.tar.gz login.cx1.hpc.ic.ac.uk:$HOME/R         #to install R packages (e.g. fuse) move the tarball to HPC
 scp batch.sh login.cx1.hpc.ic.ac.uk:$HOME/                 #script to set up the runs
 scp fuse_1.0.tar.gz login.cx1.hpc.ic.ac.uk:$HOME/scripts   #to install R packages (e.g. fuse) move the tarball to HPC

3. INSTALL PACKAGES AND ANY OTHER APPLICATION IT MAY BE NEEDED

 module load R/2.15 intel-suite/2012     
 R CMD INSTALL $HOME/R/fuse_1.0.tar.gz                       #install the package (alternatively, you can download it from cran)

4. ON THE HPC: RUN THE BATCH FILE

 qsub batch.sh

For examples of the above mentioned files, see PART 2.

For information on using the HPC service see http://www.hpc.ic.ac.uk

High Performance Computing Service – Part 2: Get your files ready

To use the HPC service I need:

  1. input files (e.g. data.rda)
  2. some routines (e.g. myroutine1.R, myroutine2.R, myroutine3.R)
  3. a batch script

The input files are:

  • data_C1.rda
  • data_C2.rda
  • data_C3.rda

Each of the above contains the following objects:

  • Classes of Topographic Index (topidxclasses),
  • Delay function (delay),
  • Precipitation time series (rain),
  • Evapotranspiration time series (ET0),
  • Observed streamflow discharge time series (Qobs).

For more info on the above input files, read help file of topmodel package (?topmodel)

Let’s prepare the R routines:

# myroutine1.R

# This routine runs an hydrological model called "topmodel" 1000 times using parameter sets
# randomly sampled from a uniform distribution, using data from catchment "C1". 
# The result is a vector containing Nash-Sutcliffe efficiencies.

library(Hmisc)
library(topmodel)
load("$WORK/input/data_C1.rda")

runs<-10000
vch   <- 1000
dt    <- 1
Srmax <- runif(runs, min=0, max=2)
td    <- runif(runs, min=0, max=3)
k0    <- runif(runs, min=0, max=0.01)
CD    <- runif(runs, min=0, max=5)
qs0   <- runif(runs, min=3e-5, max=5e-5)
lnTe  <- runif(runs, min=-4, max=0)
m     <- runif(runs, min=0, max=0.2)
Sr0   <- runif(runs, min=0, max=0.02)
vr    <- runif(runs, min=100, max=1500)

param.set<-cbind(qs0,lnTe,m,Sr0,Srmax,td,vch,vr,k0,CD,dt)
## returns an array of (runs) Nash Sutcliffe efficiencies; one for each parameter set:
eff1<-topmodel(param.set, topidxclasses, delay, rain, ET0, Qobs = Qobs)

# myroutine2.R

# This routine runs an hydrological model called "topmodel" 1000 times using parameter sets 
# randomly sampled from a uniform distribution, using data from catchment "C2". 
# The result is a vector containing Nash-Sutcliffe efficiencies.

library(Hmisc)
library(topmodel)
load("$WORK/input/data_C2.rda")

runs<-10000
vch   <- 1000
dt    <- 1
Srmax <- runif(runs, min=0, max=2)
td    <- runif(runs, min=0, max=3)
k0    <- runif(runs, min=0, max=0.01)
CD    <- runif(runs, min=0, max=5)
qs0   <- runif(runs, min=3e-5, max=5e-5)
lnTe  <- runif(runs, min=-4, max=0)
m     <- runif(runs, min=0, max=0.2)
Sr0   <- runif(runs, min=0, max=0.02)
vr    <- runif(runs, min=100, max=1500)

param.set<-cbind(qs0,lnTe,m,Sr0,Srmax,td,vch,vr,k0,CD,dt)
## returns an array of (runs) Nash Sutcliffe efficiencies; one for each parameter set:
eff2<-topmodel(param.set, topidxclasses, delay, rain, ET0, Qobs = Qobs)

# myroutine3.R

# This routine runs an hydrological model called "topmodel" 1000 times using parameter sets 
# randomly sampled from a uniform distribution, using data from catchment "C3". 
# The result is a vector containing Nash-Sutcliffe efficiencies.

library(Hmisc)
library(topmodel)
load("$WORK/input/data_C3.rda")

runs<-10000
vch   <- 1000
dt    <- 1
Srmax <- runif(runs, min=0, max=2)
td    <- runif(runs, min=0, max=3)
k0    <- runif(runs, min=0, max=0.01)
CD    <- runif(runs, min=0, max=5)
qs0   <- runif(runs, min=3e-5, max=5e-5)
lnTe  <- runif(runs, min=-4, max=0)
m     <- runif(runs, min=0, max=0.2)
Sr0   <- runif(runs, min=0, max=0.02)
vr    <- runif(runs, min=100, max=1500)

param.set<-cbind(qs0,lnTe,m,Sr0,Srmax,td,vch,vr,k0,CD,dt)
## returns an array of (runs) Nash Sutcliffe efficiencies; one for each parameter set:
eff3<-topmodel(param.set, topidxclasses, delay, rain, ET0, Qobs = Qobs)

Let’s prepare the batch file:

The batch file contains the following important information:

  • the maximum expected running time (walltime, on the second line)
  • the number of cpus to utilize for each job (cpus, on the third line)
  • the memory to allocate for the job (mem, on the third line)
  • the module to use (I’ll use the module that access the R version 2.15, fourth line)
  • the list of routines to call and files where to read any printed message (e.g. error messages)

# batch.sh

#!/bin/sh  
#PBS -l walltime=50:00:00 
#PBS -l select=1:ncpus=1:mem=600mb

module load R/2.15 intel-suite/2012
R CMD BATCH --slave $HOME/myroutine_C1.R $WORK/outputs/Rout/C1.Rout
R CMD BATCH --slave $HOME/myroutine_C2.R $WORK/outputs/Rout/C2.Rout
R CMD BATCH --slave $HOME/myroutine_C3.R $WORK/outputs/Rout/C3.Rout

Set up and test PyWPS using “curl”

Install curl to test the service from terminal.

sudo aptitude install libcurl3

# Here are some example requests (from terminal):

 curl "http://localhost/pywps/pywps.cgi?service=wps&version=1.0.0&request=getcapabilities"
 curl "http://localhost/pywps/pywps.cgi?service=wps&version=1.0.0&request=describeprocess
                                       &identifier=dummyprocess"
 curl "http://localhost/pywps/pywps.cgi?service=wps&version=1.0.0&request=execute
                                       &identifier=dummyprocess&datainputs=\[input1=10;input2=10\]"

SSH without password

Bored of typing a password to access your remote desktop? Use a key!

Let’s say you are user1 and are using machineA.

You want to access machineB as user2.

Generate key pairs:

user1@machineA:~>ssh-keygen -t rsa

user1@machineA:~> ssh user2@machineB mkdir -p .ssh         
                  enter user2@machineB's password:
user1@machineA:~> cat .ssh/id_rsa.pub | ssh user2@machineB 'cat >> .ssh/authorized_keys'

If the ssh port is not the standard 22 and you use ip address rather than hostname

user1@machineA:~> cat .ssh/id_rsa.pub | ssh -p 0000 user2@ip.address 'cat >> .ssh/authorized_keys'
                  enter user2@machineB's password:            (one last time, I promise!)

user1@machineA:~> ssh user2@machineB hostname
                  or
                  ssh -p 0000 user2@ip.address

More info here

High Performance Computing Service – Part 1: Intro

Imperial College London, has many other universities, provides an excellent High Performance Computing Service for its staff and students.

It’s like a private cloud, with thousands of processors, which allows you to run highly demanding computational jobs. HPC service is particularly suitable for code which can be parallelised. There are many modules and libraries installed and you can use your own routines if written in a common programming/scripting language. For instance I used my R code.

Here is a short tutorial on how to use HPC service.

I first contacted the HPC Service Manager to activate an account with login credentials matching the college’s ones (e.g. my username is “user”).

Once the account is active, the system automatically creates a key pair to easily remote access (for more info read my previous post here) and the following file stores become available:
$HOME = /home/user (10GB and is intended for storing binaries, source and modest amounts of data)
$WORK = /work/user (at least 150GB which is intended for staging files between jobs and for long term data storage)
$TMPDIR = /tmp (to use only for temporary results)

Refer to them using environmental variables.

Next steps:

  • to prepare the scripts I’ll use to run my computations (more info on Part 2).
  • copy on the server the necessary input files, scripts and R packages, if not available online (see Part 3).