Using R on Migale

# Using R on Migale
## A short trip through purrr, future and furrr
### Mahendra Mariadassou
### INRAE - MaIAGE, Migale Team
### 2020-03-12 (updated: 2020-03-14)

---

# Welcome to the [Migale platform!](https://migale.inra.fr/)

---

---
class: middle

# Prerequisites: 
- ## an account1: request one using the [form](https://migale.inra.fr/ask-account) 
- ## Working knowledge of unix2

[2] See [here](http://genome.jouy.inra.fr/~orue/tuto-unix.html) for a simple intro
]
---
class: center, middle

# Architecture

## Migale is the **front node**

## Which connects you to the [computer farm](https://migale.inra.fr/cluster)

### Tens of **nodes** (~ machines) representing hundred of **cores** (~ CPU)

---

# Level 1: Rstudio server

---

# Connect to migale [Rstudio server](https://rstudio.migale.inrae.fr/)

### Available at [https://rstudio.migale.inrae.fr/](https://rstudio.migale.inrae.fr/)

### Super easy ☺️ but runs on a Virtual Machine 😞

---

# Level 2: bash mode

---

# Prepare your R script

```r
create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze <- function(x) { mean(x) }
results <- numeric(100)
for (i in 1:100) {
 results[i] <- analyze(create())
}
saveRDS(results, "results.rds")
```

```bash
cat scripts/my_script_1.R
```

```
## create <- function(mu = 0) { rnorm(100, mean = mu) }
## analyze <- function(x) { mean(x) }
## results <- numeric(100)
## for (i in 1:100) {
## results[i] <- analyze(create())
## }
## saveRDS(results, "results.rds")
```

---

# Running you script

2. Connect to migale

3. Run script

4. Work on the results
]

```bash
scp -r . mmariadasso@migale:SOTR/
```

```bash
ssh mmariadasso@migale
```

```bash
Rscript my_script_1.R
```
]

To make life easier add

```bash
Host migale
User mmariadasso
HostName migale.jouy.inra.fr
```
to your `~/.ssh/config` file.

### Quite easy ☺️ but uses only the front-node 😞

# Level 3: Using sge

---

# About SGE

**S**un **G**rid **E**ngine is a *job scheduler*. You submit many jobs on the front-node and sge will dispatch them to the computer farm.

A long introduction to sge can be found [here](https://migale.inra.fr/sge) but here is a simple example

```bash
RSCRIPT=/usr/local/public/R/bin/Rscript
RJOB="my_script_1.R"
qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB
```

`RSCRIPT` and `RJOB` are *environment variables* and are expanded in the final call.

Here we need to specify `-S $RSCRIPT` to make sure that the instructions in `my_script_1.R` are executed with R.

---

# SGE options

Let's unpack the options:

- `-cwd` run in current working directory
- `-V` will pass all environment variables to the job
- `-N <jobname>` name of the job. This you will see when you use `qstat`, to check status of your jobs.
- `-b y` allow command to be a binary file instead of a script.

Other usefull options are:

- `-q <queue>` set the queue. See [here](https://migale.inra.fr/cluster) to choose a queue (short / long / bigmem / etc ) adapted to your needs.
- `-pe thread <n_slots>` This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores.
- `-R y` allows to reserve resources as soon as they are free
- `-o <output_logfile>` name of the output log file
- `-e <error_logfile>` name of the error log file
- `-m bea` Will send email when job **b**egins, **e**nds or **a**borts
- `-M <emailaddress>` Email address to send email to

---

# Leveraging the computer farm (I)

We're still only using one node at the time !!

### Decompose your script and pass arguments

```bash
cat scripts/my_script_2.R
```

```r
## Arguments
args <- commandArgs(trailingOnly = TRUE)
id <- as.integer(args[1])
## Computations
create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze <- function(x) { mean(x) }
result <- analyze(create())
## Results
saveRDS(object = result, file = paste0("result_", id, ".rds"))
```

---

# Leveraging the computer farm (II)

### Use `qsub` repeatedly

```bash
RSCRIPT="/usr/local/public/R/bin/Rscript"
RJOB="my_script_2.R"
QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae"
seq 1 100 | xargs -I {} $QSUB $RJOB {}
```

This is equivalent to

```bash
$QSUB $RJOB 1
...
$QSUB $RJOB 100
```

### Aggregate all the results at the end

```r
results <- numeric(100)
for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds"))
```

---

# Monitoring your jobs

Use `qstat` on migale to monitor the state of your jobs: `qw` (waiting), `Eqw` (error), `t` (transferring), `r` (running)

# Some pros and cons

➕ Quite easy if you want to parallelize loops (simulations)

➕ uses many machines (!= many cores on a machine)

➖ Forth and back between `R` and `bash`

➖ Not perfect for numerical experiments (machines are not perfectly similar)

---

# Level 4: Using `future`

---

# Future/Future.batchtools package

1. `future` allows you to call SGE directly from R and suppress the forth and back

1. `future` is quite general and can handle many back-ends

1. You need to specify the back-end with `plan`. Here are some examples:

```r
library(future)
library(future.batchtools)
plan(sequential)    ## R as usual
plan(multiprocess)  ## Use many cores on the same machines
plan(batchtools_sge) ## Use sge via the future.batchtools package
```

But first you need to setup a **configuration file**.

---

```bash
cat ~/.batchools.sge.tmpl ## on Migale
```

```bash
#!/bin/bash
## The name of the job
#$ -N <%= job.name %>
## Combining output/error messages into one file
#$ -j y
## Giving the name of the output log file
#$ -o <%= log.file %>
## Execute the script in the working directory
#$ -cwd
## Use environment variables
#$ -V
## Use multithreading
#$ -pe threads <%= resources$threads %>
## Use correct queue
#$ -q <%= resources$queue %>

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0
```
]

---

# Time to make a plan

```r
library(future.batchtools)
plan(batchtools_sge, 
     workers = 10,                         ## nombre maximum de jobs en //,
                                           ## non limité par défaut
     template = "~/.batchtools.sge.tmpl",  ## template sge, inutile ici
                                           ## car lu par défaut 
     resources = list(                     ## Paramètres définis à la volée
       queue   = "short.q",   ## queue à utiliser
       threads = 1            ## Nombre de cores par nodes
     )   
)
```

---

# Working with future

Simple **drop-in** in most scripts.

- replace `vector` and/or `list` with `listenv`
- replace `<-` with `%<-%`
- transform `listenv` to `list` and/or `vector`

```r
library(listenv)
## Setup a listenv (special kind of list)
results <- listenv()
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
for (i in 1:10) { 
 results[[i]] %<-% analyze(create())
}
results <- unlist(as.list(results)) ## stalled until the end of
 ## all computations
```

---

# Level 5: Using `furrr`

## `furrr` = `purrr` + `future`

---

## `purrr`: Iteration made easy

- The `map` and `map_*` family of functions superbly replace loops.

Writing our previous example with `purrr` would give

```r
library(purrr); library(dplyr)
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
 i = 1:10,
 mu = rep(0, length(i)),
 result = map_dbl(mu, ~ analyze(create(.x)))
)
results
```

```
## # A tibble: 10 x 3
## i mu result
## <int> <dbl> <dbl>
## 1 1 0 -0.0247 
## 2 2 0 0.0197 
## 3 3 0 0.0222 
## 4 4 0 -0.0295 
## 5 5 0 -0.0391 
## 6 6 0 -0.0323 
## 7 7 0 0.0139 
## 8 8 0 -0.0327 
## 9 9 0 0.0485 
## 10 10 0 -0.00247
```
---

# `Furrr`: when `future` meets `purrr`

1. Furrr provides `future_map_*()` as drop-in alternatives to `map_*()` functions.

2. You just need to have a `plan` (as with `future`)

```r
library(furrr)
library(purrr)
library(dplyr)
plan(multiprocess) ## Or plan(batchtool_sge)
```

```
## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
## in future (>= 1.13.0) when running R from RStudio, because it is
## considered unstable. Because of this, plan("multicore") will fall
## back to plan("sequential"), and plan("multiprocess") will fall back to
## plan("multisession") - not plan("multicore") as in the past. For more details,
## how to control forked processing or not, and how to silence this warning in
## future R sessions, see ?future::supportsMulticore
```

```r
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
 i = 1:10,
 mu = rep(0, length(i)),
 result = future_map_dbl(mu, ~ analyze(create(.x)))
)
results
```

```
## # A tibble: 10 x 3
## i mu result
## <int> <dbl> <dbl>
## 1 1 0 0.0234 
## 2 2 0 -0.0120 
## 3 3 0 -0.0287 
## 4 4 0 -0.00791
## 5 5 0 0.0134 
## 6 6 0 0.0167 
## 7 7 0 0.0132 
## 8 8 0 0.00816
## 9 9 0 0.0147 
## 10 10 0 -0.0165
```

---

# Going further

You can produce back-ends that spawn multiple jobs, each of which uses multiple cores.

```r
plan(list(
  tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)), 
  tweak(multiprocess, workers = 10)
))
```

Note the `workers` option for `multiprocess`.

- This is **good practice**.

- Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡

---
class: inverse

background-image: url(https://media.giphy.com/media/lD76yTC5zxZPG/giphy.gif)
background-size: contain