Le modèle linéaire spatial généralisé (Spatial Generalized Linear Model, SGLM)

lapply(c("tidyverse", "mvtnorm", "rstan"), library, character.only = TRUE)
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.2.5  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.2       ✔ stringr 1.3.1  
## ✔ readr   1.3.1       ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: StanHeaders
## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [6] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [11] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [16] "base"     
## 
## [[2]]
##  [1] "mvtnorm"   "forcats"   "stringr"   "dplyr"     "purrr"    
##  [6] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [11] "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [16] "methods"   "base"     
## 
## [[3]]
##  [1] "rstan"       "StanHeaders" "mvtnorm"     "forcats"     "stringr"    
##  [6] "dplyr"       "purrr"       "readr"       "tidyr"       "tibble"     
## [11] "ggplot2"     "tidyverse"   "stats"       "graphics"    "grDevices"  
## [16] "utils"       "datasets"    "methods"     "base"
### clean-up
rm(list = ls())
### set paths
setwd(WorkDir <- getwd())
DataDir <- paste(WorkDir, "data", sep = "/")
OutDir <- paste(WorkDir, "output", sep = "/")

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

### useful functions
lower <- function(x, alpha = 0.8) { coda::HPDinterval(coda::as.mcmc(x), prob = alpha)[1] }
upper <- function(x, alpha = 0.8) { coda::HPDinterval(coda::as.mcmc(x), prob = alpha)[2] }

Et l’espace fut…

.

sakura <- read.table(file = paste(DataDir, "sakura2019.txt", sep = "/"), header = TRUE,
                     colClasses = c("numeric", "character")[c(2, 2, rep(1, 16))]
                     )
sakura <- subset(sakura, !is.na(Flowering))

sakura %>% 
  ggplot(aes(x = X, y = Y, fill = Flowering)) + 
  geom_point(size = 3, shape = 21) +
  xlab("Easting") + ylab("Northing") +
  theme_bw() +
  ggtitle("2019 Flowering Predictions")

Nous avons vu précédément comment travailler sur la structure du prédicteur linéaire (partie systématique) dans un GLM. Aujourd’hui nous allons voir comment travailler aussi sur la structure des résidus, et notamment nous allons tenir compte de la dimension spatiale qui peut exister.

Un premier SGLM : le krigeage ordinaire

Le krigeage ordinaire est un modèle hiérarchique dans lequel on suppose qu’il existe une variable latente \(\upsilon\) définie sur un domaine spatial \(\mathcal{S}\) :

\(\upsilon(.) \sim \mathcal{GP}(m(.), c(.,~.))\)

avec moyenne

\[\mathbb{E}[\upsilon(\mathbb{s})] = m(\mathbb{s})\]

et fonction de covariance de Matern d’ordre \(\frac{3}{2}\)

\[c(s_i,~s_j) = \sigma_{\mathrm{sill}}^2 (1 + \dfrac{|s_i - s_j|\sqrt{3}}{\rho}) e^{\dfrac{|s_i - s_j|\sqrt{3}}{\rho}}\]

matern_cov <- function(d, sill = 1, range = 1) {
  sill * sill * (1.0 + abs(d) * sqrt(3.0) / range) * exp(-abs(d) * sqrt(3.0) / range)
}

D <- as.matrix(dist(sakura[, c("X", "Y")])/10000)
dim(D)
## [1] 48 48

En pratique, cette matrice de distance va nous permettre de calculer la covariance spatiale entre deux localisations du domaine :

\[\upsilon\left(\begin{bmatrix}s_1\\ \dots\\s_n\end{bmatrix}\right) \sim \mathcal{MVN}\left(\begin{bmatrix}m(s_1)\\ \dots\\m(s_n)\end{bmatrix},~\begin{bmatrix}c(s_1,~s_1) & \dots & c(s_1,~s_n)\\ & \dots & \\ c(s_n,~s_1) & \dots & c(s_n,~s_n)\end{bmatrix}\right)\]

Cette matrice dépend de \(\rho\), de \(\sigma_{\mathrm{sill}}\) et de la distance entre la localisation \(s_i\) et la localisation \(s_j\). Cette matrice est de taille finie, mais une fois que l’on aura une estimation de \(\rho\) et de \(\sigma_{\mathrm{sill}}\), alors on pourra prédire la valeur de \(z\) en tout point du domaine \(\mathcal{S}\).

Pour un krigeage ordinaire, la moyenne du processus Gaussien est supposée constante en tout point de l’espace :

\[\forall s \in \mathcal{S}, ~m(s) = \mu\]

Donc en \(\texttt{Stan}\), cela donne

stan_sglm1 <- '
 data {
  int<lower = 1> n_obs;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  real sill;
  real nugget;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + sill * upsilon;
 }

 model {
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm1 <- stan_model(model_code = stan_sglm1,
                    model_name = "Mon premier GLM spatial avec Stan"
                    )

On va se limiter aux données complètes pour les analyses à venir.

n_chains <- 4
n_iter <- 2000
n_warm <- 500
n_thin <- 5

fit_1 <- sampling(object = sglm1,
                  data = list(n_obs = nrow(sakura),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10
                              ), 
                  pars = c("intercept", "rho", "sill", "nugget", "upsilon", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 91 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 11 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_1)
##          warmup   sample
## chain:1 28.5499  67.2348
## chain:2 28.7889  62.7591
## chain:3 31.4908 121.3140
## chain:4 24.1072  65.2568

Exercice : changer le prior sur \(\rho\) pour une distribution uniforme entre \(0\) et \(250\). Que remarquez-vous?

La première chose à faire est de vérifier la convergence des paramètres.

Vérfier la convergence des paramètres :

stan_rhat(fit_1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Regarder la distribution a posteriori

pairs(x = fit_1, 
      pars = c("intercept", "nugget", "sill", "rho", "lp__")
      )