class: center, middle, inverse, title-slide # Using R on Migale ## A short trip through purrr, future and furrr ### Mahendra Mariadassou ### INRAE - MaIAGE, Migale Team ### 2020-03-12 (updated: 2020-03-14) --- class: inverse, middle, center # Welcome to the [Migale platform!](https://migale.inra.fr/) --- background-image: url(img/migale.png) background-size: contain --- class: middle # Prerequisites: - ## an account<sup>1</sup>: request one using the [form](https://migale.inra.fr/ask-account) - ## Working knowledge of unix<sup>2</sup> .footnote[ [1] Requires an academic e-mail adress [2] See [here](http://genome.jouy.inra.fr/~orue/tuto-unix.html) for a simple intro ] --- class: center, middle # Architecture ## Migale is the **front node** -- ## Which connects you to the [computer farm](https://migale.inra.fr/cluster) -- ### Tens of **nodes** (~ machines) representing hundred of **cores** (~ CPU) --- class: inverse, center, middle # Level 1: Rstudio server --- class: middle, center # Connect to migale [Rstudio server](https://rstudio.migale.inrae.fr/) ### Available at [https://rstudio.migale.inrae.fr/](https://rstudio.migale.inrae.fr/) -- ### Super easy ☺️ but runs on a Virtual Machine 😞 --- class: inverse, center, middle # Level 2: bash mode --- # Prepare your R script ```r create <- function(mu = 0) { rnorm(100, mean = mu) } analyze <- function(x) { mean(x) } results <- numeric(100) for (i in 1:100) { results[i] <- analyze(create()) } saveRDS(results, "results.rds") ``` ```bash cat scripts/my_script_1.R ``` ``` ## create <- function(mu = 0) { rnorm(100, mean = mu) } ## analyze <- function(x) { mean(x) } ## results <- numeric(100) ## for (i in 1:100) { ## results[i] <- analyze(create()) ## } ## saveRDS(results, "results.rds") ``` --- # Running you script .pull-left[ 1. Transfer scripts and data to migale<sup>1</sup> 2. Connect to migale 3. Run script 4. Work on the results ] .pull-right[ ```bash scp -r . mmariadasso@migale:SOTR/ ``` ```bash ssh mmariadasso@migale ``` ```bash Rscript my_script_1.R ``` ] -- To make life easier add ```bash Host migale User mmariadasso HostName migale.jouy.inra.fr ``` to your `~/.ssh/config` file. -- ### Quite easy ☺️ but uses only the front-node 😞 .footnote[[1] You may need to expand `migale` to `migale.jouy.inra.fr`] --- class: inverse, center, middle # Level 3: Using sge --- # About SGE **S**un **G**rid **E**ngine is a *job scheduler*. You submit many jobs on the front-node and sge will dispatch them to the computer farm. A long introduction to sge can be found [here](https://migale.inra.fr/sge) but here is a simple example ```bash RSCRIPT=/usr/local/public/R/bin/Rscript RJOB="my_script_1.R" qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB ``` `RSCRIPT` and `RJOB` are *environment variables* and are expanded in the final call. Here we need to specify `-S $RSCRIPT` to make sure that the instructions in `my_script_1.R` are executed with R. --- # SGE options Let's unpack the options: - `-cwd` run in current working directory - `-V` will pass all environment variables to the job - `-N <jobname>` name of the job. This you will see when you use `qstat`, to check status of your jobs. - `-b y` allow command to be a binary file instead of a script. Other usefull options are: - `-q <queue>` set the queue. See [here](https://migale.inra.fr/cluster) to choose a queue (short / long / bigmem / etc ) adapted to your needs. - `-pe thread <n_slots>` This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores. - `-R y` allows to reserve resources as soon as they are free - `-o <output_logfile>` name of the output log file - `-e <error_logfile>` name of the error log file - `-m bea` Will send email when job **b**egins, **e**nds or **a**borts - `-M <emailaddress>` Email address to send email to --- # Leveraging the computer farm (I) We're still only using one node at the time !! ### Decompose your script and pass arguments ```bash cat scripts/my_script_2.R ``` ```r ## Arguments args <- commandArgs(trailingOnly = TRUE) id <- as.integer(args[1]) ## Computations create <- function(mu = 0) { rnorm(100, mean = mu) } analyze <- function(x) { mean(x) } result <- analyze(create()) ## Results saveRDS(object = result, file = paste0("result_", id, ".rds")) ``` --- # Leveraging the computer farm (II) ### Use `qsub` repeatedly ```bash RSCRIPT="/usr/local/public/R/bin/Rscript" RJOB="my_script_2.R" QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae" seq 1 100 | xargs -I {} $QSUB $RJOB {} ``` This is equivalent to ```bash $QSUB $RJOB 1 ... $QSUB $RJOB 100 ``` ### Aggregate all the results at the end ```r results <- numeric(100) for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds")) ``` --- # Monitoring your jobs Use `qstat` on migale to monitor the state of your jobs: `qw` (waiting), `Eqw` (error), `t` (transferring), `r` (running) # Some pros and cons ➕ Quite easy if you want to parallelize loops (simulations) ➕ uses many machines (!= many cores on a machine) ➖ Forth and back between `R` and `bash` ➖ Not perfect for numerical experiments (machines are not perfectly similar) --- class: inverse, center, middle # Level 4: Using `future` --- # Future/Future.batchtools package 1. `future` allows you to call SGE directly from R and suppress the forth and back 1. `future` is quite general and can handle many back-ends 1. You need to specify the back-end with `plan`. Here are some examples: ```r library(future) library(future.batchtools) plan(sequential) ## R as usual plan(multiprocess) ## Use many cores on the same machines plan(batchtools_sge) ## Use sge via the future.batchtools package ``` But first you need to setup a **configuration file**. --- ```bash cat ~/.batchools.sge.tmpl ## on Migale ``` .smaller[ ```bash #!/bin/bash ## The name of the job #$ -N <%= job.name %> ## Combining output/error messages into one file #$ -j y ## Giving the name of the output log file #$ -o <%= log.file %> ## Execute the script in the working directory #$ -cwd ## Use environment variables #$ -V ## Use multithreading #$ -pe threads <%= resources$threads %> ## Use correct queue #$ -q <%= resources$queue %> ## Export value of DEBUGME environemnt var to slave export DEBUGME=<%= Sys.getenv("DEBUGME") %> <%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%> <%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%> <%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%> Rscript -e 'batchtools::doJobCollection("<%= uri %>")' exit 0 ``` ] --- # Time to make a plan ```r library(future.batchtools) plan(batchtools_sge, workers = 10, ## nombre maximum de jobs en //, ## non limité par défaut template = "~/.batchtools.sge.tmpl", ## template sge, inutile ici ## car lu par défaut resources = list( ## Paramètres définis à la volée queue = "short.q", ## queue à utiliser threads = 1 ## Nombre de cores par nodes ) ) ``` --- # Working with future Simple **drop-in** in most scripts. - replace `vector` and/or `list` with `listenv` - replace `<-` with `%<-%` - transform `listenv` to `list` and/or `vector` -- ```r library(listenv) ## Setup a listenv (special kind of list) results <- listenv() create <- function(mu = 0) { rnorm(1000, mean = mu) } analyze <- function(x) { mean(x) } for (i in 1:10) { results[[i]] %<-% analyze(create()) } results <- unlist(as.list(results)) ## stalled until the end of ## all computations ``` --- class: inverse, middle, center # Level 5: Using `furrr` ## `furrr` = `purrr` + `future` --- ## `purrr`: Iteration made easy - The `map` and `map_*` family of functions superbly replace loops. Writing our previous example with `purrr` would give ```r library(purrr); library(dplyr) create <- function(mu = 0) { rnorm(1000, mean = mu) } analyze <- function(x) { mean(x) } results <- tibble( i = 1:10, mu = rep(0, length(i)), result = map_dbl(mu, ~ analyze(create(.x))) ) results ``` ``` ## # A tibble: 10 x 3 ## i mu result ## <int> <dbl> <dbl> ## 1 1 0 -0.0247 ## 2 2 0 0.0197 ## 3 3 0 0.0222 ## 4 4 0 -0.0295 ## 5 5 0 -0.0391 ## 6 6 0 -0.0323 ## 7 7 0 0.0139 ## 8 8 0 -0.0327 ## 9 9 0 0.0485 ## 10 10 0 -0.00247 ``` --- # `Furrr`: when `future` meets `purrr` 1. Furrr provides `future_map_*()` as drop-in alternatives to `map_*()` functions. 2. You just need to have a `plan` (as with `future`) ```r library(furrr) library(purrr) library(dplyr) plan(multiprocess) ## Or plan(batchtool_sge) ``` ``` ## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled ## in future (>= 1.13.0) when running R from RStudio, because it is ## considered unstable. Because of this, plan("multicore") will fall ## back to plan("sequential"), and plan("multiprocess") will fall back to ## plan("multisession") - not plan("multicore") as in the past. For more details, ## how to control forked processing or not, and how to silence this warning in ## future R sessions, see ?future::supportsMulticore ``` ```r create <- function(mu = 0) { rnorm(1000, mean = mu) } analyze <- function(x) { mean(x) } results <- tibble( i = 1:10, mu = rep(0, length(i)), result = future_map_dbl(mu, ~ analyze(create(.x))) ) results ``` ``` ## # A tibble: 10 x 3 ## i mu result ## <int> <dbl> <dbl> ## 1 1 0 0.0234 ## 2 2 0 -0.0120 ## 3 3 0 -0.0287 ## 4 4 0 -0.00791 ## 5 5 0 0.0134 ## 6 6 0 0.0167 ## 7 7 0 0.0132 ## 8 8 0 0.00816 ## 9 9 0 0.0147 ## 10 10 0 -0.0165 ``` --- # Going further You can produce back-ends that spawn multiple jobs, each of which uses multiple cores. ```r plan(list( tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)), tweak(multiprocess, workers = 10) )) ``` Note the `workers` option for `multiprocess`. - This is **good practice**. - Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡 --- class: inverse background-image: url(https://media.giphy.com/media/lD76yTC5zxZPG/giphy.gif) background-size: contain