Le modèle linéaire spatial généralisé (Spatial Generalized Linear Model, SGLM)

lapply(c("tidyverse", "mvtnorm", "rstan"), library, character.only = TRUE)
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.2.5  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.2       ✔ stringr 1.3.1  
## ✔ readr   1.3.1       ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: StanHeaders
## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [6] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [11] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [16] "base"     
## 
## [[2]]
##  [1] "mvtnorm"   "forcats"   "stringr"   "dplyr"     "purrr"    
##  [6] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [11] "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [16] "methods"   "base"     
## 
## [[3]]
##  [1] "rstan"       "StanHeaders" "mvtnorm"     "forcats"     "stringr"    
##  [6] "dplyr"       "purrr"       "readr"       "tidyr"       "tibble"     
## [11] "ggplot2"     "tidyverse"   "stats"       "graphics"    "grDevices"  
## [16] "utils"       "datasets"    "methods"     "base"
### clean-up
rm(list = ls())
### set paths
setwd(WorkDir <- getwd())
DataDir <- paste(WorkDir, "data", sep = "/")
OutDir <- paste(WorkDir, "output", sep = "/")

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

### useful functions
lower <- function(x, alpha = 0.8) { coda::HPDinterval(coda::as.mcmc(x), prob = alpha)[1] }
upper <- function(x, alpha = 0.8) { coda::HPDinterval(coda::as.mcmc(x), prob = alpha)[2] }

Et l’espace fut…

.

sakura <- read.table(file = paste(DataDir, "sakura2019.txt", sep = "/"), header = TRUE,
                     colClasses = c("numeric", "character")[c(2, 2, rep(1, 16))]
                     )
sakura <- subset(sakura, !is.na(Flowering))

sakura %>% 
  ggplot(aes(x = X, y = Y, fill = Flowering)) + 
  geom_point(size = 3, shape = 21) +
  xlab("Easting") + ylab("Northing") +
  theme_bw() +
  ggtitle("2019 Flowering Predictions")

Nous avons vu précédément comment travailler sur la structure du prédicteur linéaire (partie systématique) dans un GLM. Aujourd’hui nous allons voir comment travailler aussi sur la structure des résidus, et notamment nous allons tenir compte de la dimension spatiale qui peut exister.

Un premier SGLM : le krigeage ordinaire

Le krigeage ordinaire est un modèle hiérarchique dans lequel on suppose qu’il existe une variable latente \(\upsilon\) définie sur un domaine spatial \(\mathcal{S}\) :

\(\upsilon(.) \sim \mathcal{GP}(m(.), c(.,~.))\)

avec moyenne

\[\mathbb{E}[\upsilon(\mathbb{s})] = m(\mathbb{s})\]

et fonction de covariance de Matern d’ordre \(\frac{3}{2}\)

\[c(s_i,~s_j) = \sigma_{\mathrm{sill}}^2 (1 + \dfrac{|s_i - s_j|\sqrt{3}}{\rho}) e^{\dfrac{|s_i - s_j|\sqrt{3}}{\rho}}\]

matern_cov <- function(d, sill = 1, range = 1) {
  sill * sill * (1.0 + abs(d) * sqrt(3.0) / range) * exp(-abs(d) * sqrt(3.0) / range)
}

D <- as.matrix(dist(sakura[, c("X", "Y")])/10000)
dim(D)
## [1] 48 48

En pratique, cette matrice de distance va nous permettre de calculer la covariance spatiale entre deux localisations du domaine :

\[\upsilon\left(\begin{bmatrix}s_1\\ \dots\\s_n\end{bmatrix}\right) \sim \mathcal{MVN}\left(\begin{bmatrix}m(s_1)\\ \dots\\m(s_n)\end{bmatrix},~\begin{bmatrix}c(s_1,~s_1) & \dots & c(s_1,~s_n)\\ & \dots & \\ c(s_n,~s_1) & \dots & c(s_n,~s_n)\end{bmatrix}\right)\]

Cette matrice dépend de \(\rho\), de \(\sigma_{\mathrm{sill}}\) et de la distance entre la localisation \(s_i\) et la localisation \(s_j\). Cette matrice est de taille finie, mais une fois que l’on aura une estimation de \(\rho\) et de \(\sigma_{\mathrm{sill}}\), alors on pourra prédire la valeur de \(z\) en tout point du domaine \(\mathcal{S}\).

Pour un krigeage ordinaire, la moyenne du processus Gaussien est supposée constante en tout point de l’espace :

\[\forall s \in \mathcal{S}, ~m(s) = \mu\]

Donc en \(\texttt{Stan}\), cela donne

stan_sglm1 <- '
 data {
  int<lower = 1> n_obs;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  real sill;
  real nugget;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + sill * upsilon;
 }

 model {
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm1 <- stan_model(model_code = stan_sglm1,
                    model_name = "Mon premier GLM spatial avec Stan"
                    )

On va se limiter aux données complètes pour les analyses à venir.

n_chains <- 4
n_iter <- 2000
n_warm <- 500
n_thin <- 5

fit_1 <- sampling(object = sglm1,
                  data = list(n_obs = nrow(sakura),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10
                              ), 
                  pars = c("intercept", "rho", "sill", "nugget", "upsilon", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 91 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 11 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_1)
##          warmup   sample
## chain:1 28.5499  67.2348
## chain:2 28.7889  62.7591
## chain:3 31.4908 121.3140
## chain:4 24.1072  65.2568

Exercice : changer le prior sur \(\rho\) pour une distribution uniforme entre \(0\) et \(250\). Que remarquez-vous?

La première chose à faire est de vérifier la convergence des paramètres.

Vérfier la convergence des paramètres :

stan_rhat(fit_1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Regarder la distribution a posteriori

pairs(x = fit_1, 
      pars = c("intercept", "nugget", "sill", "rho", "lp__")
      )

Tracer le correlogramme

correlo <- do.call('cbind', purrr::map(.x = 0:250,
             .f = function(d) { (1.0 + abs(d) * sqrt(3.0) / rstan::extract(fit_1, "rho")$rho) * exp(-abs(d) * sqrt(3.0) / rstan::extract(fit_1, "rho")$rho) })
             )
ggplot(data = data.frame(distance = 0:250,
                         correlation = apply(correlo, 2, mean),
                         lower = apply(correlo, 2, lower),
                         upper = apply(correlo, 2, upper)
                         ),
       aes(x = distance, y = correlation)
       ) +
  geom_ribbon(aes(x = distance, ymin = lower, ymax = upper), fill = "steelblue", alpha = 0.1) +
  geom_line(colour = "midnightblue") +
  theme_bw()

Et on peut enfin regarder rapidement les estimations

print(fit_1, digits = 3)
## Inference for Stan model: Mon premier GLM spatial avec Stan.
## 4 chains, each with iter=2000; warmup=500; thin=5; 
## post-warmup draws per chain=300, total post-warmup draws=1200.
## 
##               mean se_mean     sd    2.5%    25%    50%     75%   97.5%
## intercept   98.305   0.517  7.010  84.710 94.025 98.197 102.947 112.048
## rho         71.695   4.282 19.695  40.464 57.606 68.588  83.416 118.356
## sill        12.521   0.367  3.082   7.973 10.287 12.066  14.376  19.992
## nugget       2.788   0.022  0.380   2.120  2.536  2.760   3.015   3.616
## upsilon[1]  -1.348   0.058  0.650  -2.630 -1.794 -1.344  -0.899  -0.205
## upsilon[2]   0.698   0.040  0.619  -0.431  0.287  0.682   1.101   1.959
## upsilon[3]   1.302   0.042  0.671   0.123  0.822  1.283   1.727   2.722
## upsilon[4]  -1.113   0.052  0.635  -2.362 -1.534 -1.122  -0.665   0.021
## upsilon[5]  -1.635   0.066  0.697  -3.000 -2.114 -1.607  -1.138  -0.437
## upsilon[6]  -1.017   0.050  0.619  -2.284 -1.434 -1.023  -0.588   0.124
## upsilon[7]  -1.657   0.058  0.680  -2.980 -2.119 -1.647  -1.143  -0.458
## upsilon[8]  -0.401   0.045  0.602  -1.545 -0.810 -0.399  -0.007   0.768
## upsilon[9]  -1.268   0.056  0.641  -2.551 -1.711 -1.265  -0.828  -0.132
## upsilon[10] -0.963   0.054  0.633  -2.181 -1.387 -0.965  -0.544   0.201
## upsilon[11] -1.594   0.066  0.690  -2.964 -2.060 -1.567  -1.096  -0.385
## upsilon[12]  2.089   0.050  0.744   0.758  1.549  2.068   2.605   3.676
## upsilon[13]  2.006   0.052  0.748   0.670  1.471  1.977   2.467   3.544
## upsilon[14]  1.807   0.049  0.725   0.517  1.289  1.782   2.266   3.332
## upsilon[15] -1.336   0.054  0.651  -2.603 -1.793 -1.332  -0.855  -0.186
## upsilon[16] -0.985   0.053  0.629  -2.217 -1.408 -0.988  -0.558   0.168
## upsilon[17] -0.802   0.046  0.604  -2.039 -1.198 -0.807  -0.391   0.321
## upsilon[18]  0.656   0.039  0.618  -0.456  0.245  0.650   1.070   1.885
## upsilon[19] -1.454   0.056  0.667  -2.774 -1.925 -1.441  -0.969  -0.293
## upsilon[20] -1.284   0.046  0.636  -2.673 -1.719 -1.242  -0.813  -0.173
## upsilon[21] -1.331   0.062  0.653  -2.607 -1.769 -1.324  -0.882  -0.141
## upsilon[22] -1.597   0.065  0.692  -2.983 -2.061 -1.577  -1.110  -0.441
## upsilon[23] -1.620   0.058  0.679  -2.923 -2.091 -1.600  -1.105  -0.427
## upsilon[24] -1.295   0.053  0.643  -2.592 -1.741 -1.280  -0.818  -0.168
## upsilon[25] -1.379   0.056  0.654  -2.700 -1.828 -1.369  -0.891  -0.242
## upsilon[26] -0.146   0.043  0.590  -1.267 -0.550 -0.146   0.229   1.026
## upsilon[27] -1.427   0.049  0.656  -2.765 -1.867 -1.384  -0.941  -0.275
## upsilon[28] -0.729   0.046  0.615  -1.953 -1.135 -0.721  -0.332   0.398
## upsilon[29] -1.572   0.054  0.662  -2.870 -2.011 -1.553  -1.069  -0.429
## upsilon[30] -1.339   0.054  0.646  -2.636 -1.773 -1.322  -0.859  -0.223
## upsilon[31] -0.222   0.044  0.598  -1.393 -0.623 -0.219   0.168   0.943
## upsilon[32] -1.645   0.060  0.691  -2.985 -2.112 -1.631  -1.112  -0.450
## upsilon[33] -1.439   0.056  0.665  -2.752 -1.901 -1.437  -0.963  -0.280
## upsilon[34] -1.332   0.054  0.648  -2.609 -1.778 -1.325  -0.854  -0.195
## upsilon[35] -1.655   0.057  0.677  -2.951 -2.119 -1.645  -1.144  -0.485
## upsilon[36] -1.104   0.058  0.643  -2.366 -1.544 -1.107  -0.674   0.064
## upsilon[37] -1.265   0.054  0.639  -2.523 -1.696 -1.244  -0.798  -0.158
## upsilon[38] -1.414   0.062  0.671  -2.734 -1.868 -1.407  -0.945  -0.253
## upsilon[39] -1.439   0.064  0.668  -2.760 -1.900 -1.423  -0.974  -0.260
## upsilon[40] -0.938   0.053  0.629  -2.129 -1.349 -0.936  -0.511   0.242
## upsilon[41] -1.425   0.056  0.661  -2.728 -1.882 -1.413  -0.929  -0.301
## upsilon[42] -1.272   0.061  0.649  -2.533 -1.714 -1.264  -0.833  -0.100
## upsilon[43] -1.316   0.056  0.659  -2.632 -1.760 -1.306  -0.859  -0.109
## upsilon[44] -0.708   0.045  0.603  -1.956 -1.088 -0.706  -0.311   0.424
## upsilon[45] -1.380   0.055  0.654  -2.704 -1.822 -1.370  -0.895  -0.238
## upsilon[46] -0.119   0.043  0.593  -1.233 -0.539 -0.110   0.265   1.052
## upsilon[47] -1.627   0.059  0.685  -3.021 -2.086 -1.614  -1.111  -0.398
## upsilon[48] -1.261   0.061  0.655  -2.541 -1.696 -1.256  -0.827  -0.086
## log_lik[1]  -2.721   0.015  0.523  -4.000 -2.994 -2.630  -2.361  -1.928
## log_lik[2]  -2.144   0.009  0.303  -2.980 -2.251 -2.082  -1.957  -1.742
## log_lik[3]  -2.297   0.014  0.480  -3.672 -2.449 -2.170  -1.999  -1.746
## log_lik[4]  -2.741   0.023  0.776  -4.844 -3.049 -2.544  -2.191  -1.824
## log_lik[5]  -2.108   0.008  0.260  -2.746 -2.224 -2.063  -1.940  -1.736
## log_lik[6]  -2.095   0.008  0.250  -2.714 -2.200 -2.053  -1.937  -1.730
## log_lik[7]  -2.235   0.012  0.366  -3.183 -2.371 -2.146  -2.000  -1.778
## log_lik[8]  -2.463   0.015  0.473  -3.572 -2.710 -2.382  -2.115  -1.835
## log_lik[9]  -3.189   0.020  0.628  -4.672 -3.552 -3.100  -2.753  -2.177
## log_lik[10] -2.076   0.013  0.231  -2.651 -2.171 -2.042  -1.929  -1.718
## log_lik[11] -2.087   0.007  0.241  -2.720 -2.196 -2.049  -1.931  -1.726
## log_lik[12] -2.330   0.026  0.523  -3.706 -2.529 -2.156  -1.983  -1.741
## log_lik[13] -3.156   0.030  0.919  -5.337 -3.604 -2.958  -2.469  -1.957
## log_lik[14] -2.115   0.009  0.286  -2.839 -2.214 -2.055  -1.941  -1.744
## log_lik[15] -2.035   0.007  0.181  -2.464 -2.128 -2.021  -1.913  -1.712
## log_lik[16] -2.076   0.011  0.230  -2.626 -2.176 -2.046  -1.929  -1.727
## log_lik[17] -2.080   0.007  0.246  -2.711 -2.197 -2.043  -1.928  -1.716
## log_lik[18] -2.141   0.009  0.325  -3.049 -2.233 -2.072  -1.943  -1.733
## log_lik[19] -2.301   0.015  0.362  -3.231 -2.462 -2.230  -2.044  -1.811
## log_lik[20] -2.783   0.061  0.891  -5.081 -3.147 -2.505  -2.141  -1.817
## log_lik[21] -2.081   0.012  0.245  -2.673 -2.191 -2.038  -1.920  -1.738
## log_lik[22] -2.833   0.022  0.731  -4.667 -3.181 -2.675  -2.304  -1.871
## log_lik[23] -2.321   0.013  0.391  -3.356 -2.514 -2.232  -2.059  -1.785
## log_lik[24] -2.258   0.009  0.300  -3.011 -2.390 -2.208  -2.054  -1.831
## log_lik[25] -2.119   0.008  0.250  -2.751 -2.224 -2.081  -1.957  -1.737
## log_lik[26] -2.494   0.018  0.546  -3.921 -2.764 -2.367  -2.090  -1.835
## log_lik[27] -2.175   0.015  0.358  -3.093 -2.284 -2.079  -1.957  -1.730
## log_lik[28] -5.730   0.064  1.598  -9.226 -6.705 -5.578  -4.515  -3.195
## log_lik[29] -2.139   0.009  0.298  -2.908 -2.256 -2.077  -1.950  -1.730
## log_lik[30] -2.032   0.007  0.184  -2.434 -2.135 -2.016  -1.916  -1.712
## log_lik[31] -2.327   0.016  0.515  -3.777 -2.468 -2.183  -2.007  -1.795
## log_lik[32] -2.186   0.011  0.333  -3.001 -2.316 -2.111  -1.973  -1.751
## log_lik[33] -2.074   0.011  0.229  -2.633 -2.175 -2.042  -1.924  -1.722
## log_lik[34] -2.161   0.009  0.253  -2.752 -2.300 -2.122  -1.996  -1.755
## log_lik[35] -2.215   0.011  0.339  -3.039 -2.369 -2.132  -1.988  -1.747
## log_lik[36] -2.107   0.021  0.262  -2.835 -2.213 -2.064  -1.943  -1.718
## log_lik[37] -2.964   0.020  0.584  -4.377 -3.261 -2.854  -2.561  -2.126
## log_lik[38] -2.188   0.011  0.366  -3.174 -2.301 -2.102  -1.960  -1.747
## log_lik[39] -2.176   0.011  0.369  -3.127 -2.264 -2.101  -1.956  -1.763
## log_lik[40] -2.033   0.010  0.195  -2.508 -2.133 -2.013  -1.911  -1.706
## log_lik[41] -2.519   0.018  0.494  -3.801 -2.768 -2.432  -2.159  -1.820
## log_lik[42] -2.693   0.029  0.546  -3.879 -3.005 -2.597  -2.293  -1.894
## log_lik[43] -2.559   0.017  0.583  -4.074 -2.823 -2.436  -2.146  -1.816
## log_lik[44] -2.146   0.011  0.313  -3.028 -2.252 -2.071  -1.958  -1.744
## log_lik[45] -2.111   0.008  0.300  -2.756 -2.199 -2.063  -1.947  -1.767
## log_lik[46] -3.250   0.032  0.822  -5.322 -3.646 -3.091  -2.671  -2.148
## log_lik[47] -2.715   0.019  0.650  -4.402 -3.021 -2.585  -2.253  -1.863
## log_lik[48] -2.399   0.025  0.444  -3.505 -2.632 -2.289  -2.073  -1.814
## lp__        13.820   3.155 15.885 -17.657  2.841 13.624  24.956  43.698
##             n_eff  Rhat
## intercept     184 1.007
## rho            21 1.127
## sill           71 1.070
## nugget        293 1.018
## upsilon[1]    126 1.037
## upsilon[2]    245 1.002
## upsilon[3]    253 1.005
## upsilon[4]    149 1.026
## upsilon[5]    111 1.043
## upsilon[6]    153 1.026
## upsilon[7]    137 1.037
## upsilon[8]    182 1.013
## upsilon[9]    130 1.035
## upsilon[10]   138 1.026
## upsilon[11]   110 1.041
## upsilon[12]   224 1.007
## upsilon[13]   209 1.011
## upsilon[14]   221 1.010
## upsilon[15]   148 1.032
## upsilon[16]   139 1.027
## upsilon[17]   174 1.020
## upsilon[18]   250 1.001
## upsilon[19]   140 1.037
## upsilon[20]   194 1.020
## upsilon[21]   112 1.038
## upsilon[22]   115 1.042
## upsilon[23]   136 1.038
## upsilon[24]   149 1.031
## upsilon[25]   135 1.035
## upsilon[26]   192 1.008
## upsilon[27]   180 1.027
## upsilon[28]   176 1.015
## upsilon[29]   149 1.036
## upsilon[30]   146 1.032
## upsilon[31]   186 1.008
## upsilon[32]   134 1.039
## upsilon[33]   139 1.037
## upsilon[34]   146 1.032
## upsilon[35]   140 1.039
## upsilon[36]   123 1.032
## upsilon[37]   142 1.033
## upsilon[38]   116 1.039
## upsilon[39]   108 1.038
## upsilon[40]   140 1.026
## upsilon[41]   141 1.035
## upsilon[42]   114 1.037
## upsilon[43]   139 1.034
## upsilon[44]   182 1.016
## upsilon[45]   143 1.033
## upsilon[46]   193 1.007
## upsilon[47]   135 1.035
## upsilon[48]   115 1.035
## log_lik[1]   1159 1.004
## log_lik[2]   1125 1.001
## log_lik[3]   1220 1.001
## log_lik[4]   1158 1.003
## log_lik[5]   1193 1.004
## log_lik[6]   1106 0.998
## log_lik[7]    992 0.998
## log_lik[8]   1062 0.999
## log_lik[9]    952 1.004
## log_lik[10]   304 1.012
## log_lik[11]  1224 0.999
## log_lik[12]   417 1.007
## log_lik[13]   966 1.001
## log_lik[14]  1048 1.002
## log_lik[15]   718 1.004
## log_lik[16]   447 1.012
## log_lik[17]  1077 0.999
## log_lik[18]  1186 1.000
## log_lik[19]   601 1.010
## log_lik[20]   212 1.020
## log_lik[21]   387 1.012
## log_lik[22]  1114 1.004
## log_lik[23]   913 1.001
## log_lik[24]  1110 1.005
## log_lik[25]  1100 1.002
## log_lik[26]   963 0.998
## log_lik[27]   572 1.004
## log_lik[28]   631 1.002
## log_lik[29]  1073 1.001
## log_lik[30]   689 1.007
## log_lik[31]  1099 1.003
## log_lik[32]   839 1.002
## log_lik[33]   472 1.006
## log_lik[34]   823 1.004
## log_lik[35]   929 0.999
## log_lik[36]   153 1.023
## log_lik[37]   884 1.008
## log_lik[38]  1123 1.002
## log_lik[39]  1177 1.001
## log_lik[40]   346 1.011
## log_lik[41]   757 1.005
## log_lik[42]   351 1.010
## log_lik[43]  1117 0.998
## log_lik[44]   816 1.002
## log_lik[45]  1318 1.001
## log_lik[46]   641 1.000
## log_lik[47]  1195 1.001
## log_lik[48]   312 1.012
## lp__           25 1.122
## 
## Samples were drawn using NUTS(diag_e) at Thu Apr 18 23:47:12 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Et maintenant, un peu d’algèbre !

On veut prédire les dates de floraisons sur l’ensemble du japon.

japan <- read.table(file = paste(OutDir, "Loc2Predict.txt", sep = "/"), header = TRUE,
                    colClasses = "numeric"
                    )

head(japan)
##          X       Y
## 1  -995000 -685000
## 2  -995000 -675000
## 3  -985000 -665000
## 4  -975000 -665000
## 5 -1005000 -655000
## 6  -995000 -655000
ggplot(data = japan,
       aes(x = X, y = Y)
       ) +
  geom_point(size = 2, alpha = 0.3, color = "red") +
  xlab("easting") + ylab ("northing") +
  theme_bw()

Pour cela, on va réaliser un krigeage ordinaire, càd une prédiction qui nécessite uniquement de connaître la distance entre les localisations pour lesquels on veut une prédiction et les données observées. Il faut commencer par calculer la matrice de corrélations entre ces points :

D_xnew_xobs <- as.matrix(fields::rdist(japan[, c("X", "Y")], sakura[, c("X", "Y")])/10000)
dim(D_xnew_xobs)
## [1] 4630   48

Et on va calculer les covariances \(\hat{C}(s_{\mathrm{new}}, s_{\mathrm{obs}})\) grâce aux estimations a posteriori

Cov_xnew_xobs <- purrr::map(.x = rstan::extract(fit_1, "rho")$rho,
                            .f = function(range) {
                              matern_cov(d = D_xnew_xobs, sill = 1, range = range)
                              }
                            )

On va également avoir besoin de l’inverse de la matrice de covariance des données \(\hat{C}^{-1}(s_{\mathrm{obs}}, s_{\mathrm{obs}})\) :

inv_cov_xobs <- purrr::map(.x = rstan::extract(fit_1, "rho")$rho,
                    .f = function(range) {
                      solve(matern_cov(d = D, sill = 1, range = range))
                      }
                    )

Enfin, on a besoin des valeurs estimées de la variable \(\hat{\upsilon}(s_{\mathrm{obs}})\) pour nos données, et des a posteriori des autres paramètres :

upsilon <- rstan::extract(fit_1, "upsilon")$upsilon
mu <- rstan::extract(fit_1, "intercept")$intercept
sill <- rstan::extract(fit_1, "sill")$sill
nugget <- rstan::extract(fit_1, "nugget")$nugget

La prédiction par krigeage aux \(n\) nouvelles localisations \(s_{\mathrm{new}}\) est :

\[\mathbb{\overset{\sim}{y}}(s_{\mathrm{new}}) \sim \mathcal{N} \left(\mathcal{1}_n\hat{\mu} + \hat{\sigma}_{\mathrm{sill}} \hat{C}(s_{\mathrm{new}}, s_{\mathrm{obs}}) \hat{C}^{-1}(s_{\mathrm{obs}}, s_{\mathrm{obs}}) \mathbb{\hat{\upsilon}}, \hat{\sigma}_{\mathrm{nugget}} \mathcal{I}_n \right)\]

pred <- lapply(1:1000, function(iter) {
  drop(rep(mu[iter], nrow(japan)) + sill[iter] * Cov_xnew_xobs[[iter]] %*% inv_cov_xobs[[iter]] %*% upsilon[iter, ] + nugget[iter] * rnorm(nrow(japan)))
})
pred <- do.call('rbind', pred)

japan$pred <- round(apply(pred, 2, mean), 0)
japan$lower <- floor(apply(pred, 2, lower))
japan$upper <- ceiling(apply(pred, 2, upper))

head(japan)
##          X       Y pred lower upper
## 1  -995000 -685000   89    82    96
## 2  -995000 -675000   89    82    96
## 3  -985000 -665000   89    82    95
## 4  -975000 -665000   89    82    96
## 5 -1005000 -655000   89    83    97
## 6  -995000 -655000   88    81    95
ggplot(data = japan,
       aes(x = X, y = Y, fill = pred)
       ) +
  geom_tile() +
  xlab("easting") + ylab ("northing") +
  scale_fill_gradientn(colours = viridisLite::viridis(256)) +
  theme_bw()

Krigeage universel

Le krigeage universel est du jargon pour signifier que l’on inclut également des covariables dans la partie systématique du modèle. On dispose ici de covariables dont on pourrait tout à fait tenir compte dans l’analyse.

input <- c("Temp_nov", "Temp_dec", "Temp_janv", "Temp_feb", "Temp_mar_09", 
           "Sunshine_nov", "Sunshine_dec", "Sunshine_janv", "Sunshine_feb", 
           "Lat")
print(input)
##  [1] "Temp_nov"      "Temp_dec"      "Temp_janv"     "Temp_feb"     
##  [5] "Temp_mar_09"   "Sunshine_nov"  "Sunshine_dec"  "Sunshine_janv"
##  [9] "Sunshine_feb"  "Lat"

Mais on a un tout petit jeu de données : \(N = 48\); et pas mal de covariables possibles \(p = 10\)!

Sachant que le modèle de base inclus déjà trois paramètres (questions: lesquels? est-ce que l’estimation du nombre de paramètres par la fonction \(\texttt{loo::waic()}\) est proche de ce chiffre?).

Cela fait donc un ratio paramètres-données de \(\frac{48}{13} \approx 3.7\), soit moins de quatre données pour renseigner un paramètre en moyenne… Ce n’est vraiment pas beaucoup. Néanmoins, cela peut se tenter car le prior apporte aussi une information. Tourefois, comme le jeu de données est faible, l’influence du prior sera importante.

Priors indépendants

On va supposer que les différents \(\beta_{j\in[1:10]}\) sont indépendants, et on va leur donner à chacun un prior :

\(\forall j, \beta_j \sim \mathcal{N}(0.0, 5.0)\)

stan_sglm2 <- '
 data {
  int<lower = 1> n_obs;
  int<lower = 1> n_cov;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  matrix[n_obs, n_cov] X;
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
  vector<lower = 0.0>[n_cov] prior_scale_beta;
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
  vector[n_cov] unscaled_beta;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  vector[n_cov] beta;
  real sill;
  real nugget;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  beta = unscaled_beta .* prior_scale_beta;
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + X * beta + sill * upsilon;
 }

 model {
  unscaled_beta ~ normal(0.0, 1.0);
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm2 <- stan_model(model_code = stan_sglm2,
                    model_name = "Mon second GLM spatial avec Stan"
                    )

Ajuster ce modèle aux données :

fit_2 <- sampling(object = sglm2,
                  data = list(n_obs = nrow(sakura),
                              n_cov = length(input),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              X = as.matrix(sapply(input, function(j) { scale(sakura[, j]) })),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10,
                              prior_scale_beta = rep(5.0, length(input))
                              ), 
                  pars = c("intercept", "beta", "rho", "sill", "nugget", "upsilon", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 53 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 18 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_2)
##          warmup  sample
## chain:1 61.4037 139.703
## chain:2 54.5812 117.366
## chain:3 50.0852 136.801
## chain:4 51.6148 268.579

Vérifier la convergence des paramètres :

stan_rhat(fit_2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

On va stocker les moyennes a posteriori :

independent <- apply(rstan::extract(fit_2, "beta")$beta, 2, mean)
independent
##  [1]  0.4566274  4.6443686 -1.9296857  0.4704134 -5.4100900  2.1873820
##  [7]  1.3360764 -2.8451736 -1.2884024  6.3601248

Questions :

  • comment jugez-vous ici la qualité de vos estimations \(\hat{\beta_j}\)?
  • quels sont les covariables les plus importantes d’après vous?

Priors échangeables (ridge regression)

On peut également décider de faire quelque chose d’étrange ici : on va supposer que les \(\beta_j\) ne sont pas indépendants, mais échangeables. En d’autres termes, nous sommes dans une situation où l’on a peu de connaissances ou d’idée sur quelles variables est a priori plus importante que les autres, et on suppose que leurs effets proviennent d’une distribution commune :

\(\forall j, \beta_j|\sigma_{\beta} \sim \mathcal{N}(0.0, \sigma_{\beta})\)

stan_sglm3 <- '
 data {
  int<lower = 1> n_obs;
  int<lower = 1> n_cov;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  matrix[n_obs, n_cov] X;
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
  vector[n_cov] unscaled_beta;
  real<lower = 0.0> sigma_beta;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  vector[n_cov] beta;
  real sill;
  real nugget;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  beta = unscaled_beta * sigma_beta;
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + X * beta + sill * upsilon;
 }

 model {
  sigma_beta ~ student_t(3.0, 0.0, 1.0);
  unscaled_beta ~ normal(0.0, 1.0);
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm3 <- stan_model(model_code = stan_sglm3,
                    model_name = "Mon troisieme GLM spatial avec Stan"
                    )

AJuster ce modèle aux données :

fit_3 <- sampling(object = sglm3,
                  data = list(n_obs = nrow(sakura),
                              n_cov = length(input),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              X = as.matrix(sapply(input, function(j) { scale(sakura[, j]) })),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10
                              ), 
                  pars = c("intercept", "beta", "rho", "sill", "nugget", "sigma_beta", "upsilon", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 109 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 44 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_3)
##          warmup  sample
## chain:1 56.4686 255.141
## chain:2 69.0486 229.472
## chain:3 46.2034 107.644
## chain:4 56.6731 131.153

Vérifier la convergence des paramètres :

stan_rhat(fit_3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Extraire les estimations moyennes a posteriori :

ridge <- apply(rstan::extract(fit_3, "beta")$beta, 2, mean)

Comparer ces estimations avec les précédentes : que remarquez-vous?

Selection de variables via le prior Horseshoe

Le troisième jour, l’espace (des paramètres) rétrécit!

Le problème de la régression ridge est qu’elle ne permet pas de sélectionner les variables : aucun des \(\hat{\beta}_j\) ne sera estimé à exactement \(0\). Une manière de permettre cette sélection est encore une fois via le prior, on va ici utiliser un prior particulier : un prior à shrinkage, le fer à cheval!

\[\forall j, \beta_j|\sigma_{\mathrm{global}}, \sigma_{\mathrm{local}_j} \sim \mathcal{N} (0.0, \sigma_{\mathrm{global}} \times \sigma_{\mathrm{local}_j})\]

Dans cette formule, le paramètre global \(\sigma_{\mathrm{global}}\) excerce une force qui a priori va forcer la valeur de \(\hat{\beta}_j\) vers \(0\). Le \(j^{\mathrm{ieme}}\) paramètre local \(\sigma_{\mathrm{local}_j}\) va permettre d’échapper cette force graviationnelle vers \(0\), et donc d’estimer une valeur non nulle de \(\hat{\beta}_j\) si il y a un signal fort dans les données.

Reste à donner des priors pour ces paramètres \(\sigma_{\mathrm{global}}, \sigma_{\mathrm{local}_j}\).

\[\forall j, \sigma_{\mathrm{local}_j} \sim \mathcal{C}^{+}(0.0, 1.0)\]

\[\sigma_{\mathrm{global}} \sim \mathcal{C}^{+}(0.0, 1.0)\]

\(\mathcal{C}^{+}(0.0, 1.0)\) est la distribution demi-Cauchy standard. A quoi est-ce que cela ressemble?

hist(abs(sn::rst(1e6, xi = 0.0, omega = 1.0, nu = 1.0)))

Que ce passe-t-il?

La distribution de Cauchy est très intéressante car elle est complétement atypique : par exemple le théorème central limite ne s’applique pas à une variable aléatoire qui suit une distribution Cauchy. Exemple, je veux estimer la moyenne empirique d’une variable aléatoire et pour cela je vais réaliser \(1\), puis \(2\), puis \(3\) jusqu’à \(1000\) mesures et à chaque fois je vais calculer la moyenne.

clt <- function(n = 1000, ddl = Inf) {
  samp <- sn::rst(n, xi = 0.0, omega = 1.0, nu = ddl)
  df <- data.frame(x = 1:n,
                   y = sapply(1:n, function(i) { mean(samp[1:i]) })
                   )
  g <- ggplot(data = df,
              aes(x = x, y = y)
              ) +
    geom_line() +
    xlab("Sample Size") + ylab("Empirical mean") +
    theme_bw()
  return(g)
} 

### normal
clt()

### cauchy
clt(ddl = 1)

La moyenne empirique d’une variable Cauchy ne converge pas vers la valeur du paramètre de location…

Pourquoi est-ce utile?

La distribution de Cauchy n’interdit aucune valeur a priori : sa moyenne et sa variance sont infinies…

… et pour le prior Fer à cheval, cela va être particulièrement intéressant car cela aboutit à un prior sur le facteur de rétrécissement, ou shrinkage factor, \(\kappa_j\) qui est une distribution \(\mathcal{Be}(0.5, 0.5)\).

data.frame(x = seq(0, 1, length.out = 1e3), 
           y = dbeta(seq(0, 1, length.out = 1e3), 0.5, 0.5)
           ) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  xlab(bquote(kappa)) + ylab("pdf") +
  theme_bw()

Le facteur de rétrécissement \(\kappa_j\) intervient dans la formule ci-dessous :

\[\beta_j^{\mathrm{Horseshoe}} = (1 - \kappa_j) \times \beta_j^{\mathrm{Independent~Normal}}\]

Donc le prior Fer à cheval favorise a priori de rétrécir le coefficient à \(0\), soit de ne pas le rétrécir du tout.

stan_sglm4 <- '
 data {
  int<lower = 1> n_obs;
  int<lower = 1> n_cov;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  matrix[n_obs, n_cov] X;
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
  vector<lower = 0.0>[n_cov] local;
  vector[n_cov] unscaled_beta;
  real<lower = 0.0> global;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  vector[n_cov] beta;
  vector[n_cov] shrinkage;
  real sill;
  real nugget;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  beta = unscaled_beta .* local * global * nugget;
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + X * beta + sill * upsilon;
  for(j in 1:n_cov) {
   shrinkage[j] = 1 / (1 + n_obs * square(global * local[j]));
  }
 }

 model {
  global ~ cauchy(0.0, 1.0);
  local ~ cauchy(0.0, 1.0);
  unscaled_beta ~ normal(0.0, 1.0);
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm4 <- stan_model(model_code = stan_sglm4,
                    model_name = "Mon quatrieme GLM spatial avec Stan"
                    )

Ajuster ce modèle aux données :

fit_4 <- sampling(object = sglm4,
                  data = list(n_obs = nrow(sakura),
                              n_cov = length(input),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              X = as.matrix(sapply(input, function(j) { scale(sakura[, j]) })),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10
                              ), 
                  pars = c("intercept", "beta", "rho", "sill", "nugget", "upsilon", "shrinkage", "local", "global", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 77 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 23 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_4)
##          warmup  sample
## chain:1 70.1823 144.596
## chain:2 69.4118 267.600
## chain:3 60.8959 167.051
## chain:4 76.4072 275.466

Vérifier la convergence des paramètres :

stan_rhat(fit_4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Regarder les facteurs de rétrécissement \(\hat{\kappa}_j\)

apply(rstan::extract(fit_4, "shrinkage")$shrinkage, 2, mean)
##  [1] 0.3268729 0.3108892 0.3210409 0.3269257 0.1612299 0.3267736 0.3589905
##  [8] 0.2801298 0.3339094 0.1875859

… et les estimations

horseshoe <- apply(rstan::extract(fit_4, "beta")$beta, 2, mean)

Comparer ces nouvelles estimations aux précédentes.

Avec un prior un peu plus complexe : le Fer à Poney

Ce prior est proposé par Piironen et Vehtari (2017) et permet de mieux contrôler le nombre de covariables que l’on veut inclure a priori dans un modèle. Ici nous avons \(48\) observations, \(10\) covariables possibles, et on ne voudrait qu’en garder \(1\) pour améliorer nos prédictions. La déviation standard résiduelle est de l’ordre de \(2.8\) (cf notre premier modèle).

prior_m_eff <- function(sigma = 1, n, p, p0, n_sim = 1e5) {
  # prior for global scale
  tau0 = p0/(p - p0) * sigma/sqrt(n)
  tau = abs(sn::rst(n_sim, xi = 0, omega = tau0, nu = 1))
  # local scales
  lambda = replicate(p, abs(sn::rst(n_sim, xi = 0, omega = 1, nu = 1)))
  # shrinkage factors
  One_minus_kappa = apply(lambda, 2, function(x) { 1 - 1 / (1 + n * (tau * x / sigma)^2) } )
  # m_eff
  m_eff <- apply(One_minus_kappa, 1, sum)
  return(as.data.frame(m_eff))
}

prior_m_eff(sigma = 2.8, n = 48, p = 10, p0 = 1) %>% 
  ggplot(aes(x = m_eff)) + 
  geom_histogram() +
  scale_x_continuous(name = "Nb of covariates",
                     breaks = seq(0, 10, 1)) +
  theme_bw() +
  labs(subtitle = "A priori nb of active inputs")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

On va aussi avoir besoin d’une valeur pour l’échelle a priori du prior pour le paramètre \(\sigma_{\mathrm{global}}\) (Piironen & Vehtari 2017). Cette valeur dépend de la taille de notre jeu de données, du nombre de covariables (\(p\)) et de la complexité a priori du modèle (\(p_0\))

prior_global <- function(n = 48, p = 10, p0 = 1) {
  return(p0/((p - p0) * sqrt(n)))
}
stan_sglm5 <- '
 data {
  int<lower = 1> n_obs;
  int<lower = 1> n_cov;
  vector[n_obs] FLOWERING;
  matrix[n_obs, n_obs] D; // distance matrix (km)
  matrix[n_obs, n_cov] X;
  real prior_location_intercept;
  real<lower = 0.0> prior_scale_intercept;
  real<lower = 0> prior_scale_global;
  real<lower = 0> slab_scale; // regularization scale for ponyshoe
 }
 
 parameters {
  real unscaled_intercept;
  simplex[2] pi;
  real<lower = 0.0> sigma_tot;
  real<lower = 0.0> rho;
  vector[n_obs] upsilon;
  vector<lower = 0.0>[n_cov] local;
  vector[n_cov] unscaled_beta;
  real<lower = 0.0> global;
 }

 transformed parameters {
  real intercept;
  cov_matrix[n_obs] Omega;
  vector[n_obs] linear_predictor;
  vector[n_cov] regularized_local;
  vector[n_cov] beta;
  vector[n_cov] shrinkage;
  real sill;
  real nugget;
  real b;
  intercept = prior_location_intercept + unscaled_intercept * prior_scale_intercept;
  nugget = sigma_tot * sqrt(pi[1]);
  regularized_local = slab_scale * local ./ sqrt(rep_vector(square(slab_scale), n_cov) + square(local) * square(prior_scale_global * global) * square(nugget));
  beta = unscaled_beta .* regularized_local * prior_scale_global * global * nugget;
  sill = sigma_tot * sqrt(pi[2]);
  for(i in 1:n_obs) {
   for(j in 1:n_obs) {
    Omega[i, j] = (1 + D[i, j] * sqrt(3) / rho) * exp(-D[i, j] * sqrt(3) / rho);
   }
  }
  linear_predictor = rep_vector(intercept, n_obs) + X * beta + sill * upsilon;
  b = 1 / (1 + n_obs * square(slab_scale / nugget));
  for(j in 1:n_cov) {
   shrinkage[j] = b + (1 - b) / (1 + n_obs * square(prior_scale_global * global * local[j]));
  }
 }

 model {
  global ~ cauchy(0.0, 1.0);
  local ~ cauchy(0.0, 1.0);
  unscaled_beta ~ normal(0.0, 1.0);
  rho ~ gamma(2.1535, 0.0292); // prior range between 10 and 200
  sigma_tot ~ student_t(3, 0.0, 1.0);
  unscaled_intercept ~ normal(0.0, 1.0);
  upsilon ~ multi_normal(rep_vector(0.0, n_obs), Omega);
  FLOWERING ~ normal(linear_predictor, nugget);
 }

 generated quantities {
  vector[n_obs] log_lik;
  for(i in 1:n_obs) {
   log_lik[i] = normal_lpdf(FLOWERING[i]| linear_predictor[i], nugget);
  }
 }
'
## Compilation
sglm5 <- stan_model(model_code = stan_sglm5,
                    model_name = "Mon cinquieme GLM spatial avec Stan"
                    )

Ajuster enfin de modèle aux données

fit_5 <- sampling(object = sglm5,
                  data = list(n_obs = nrow(sakura),
                              n_cov = length(input),
                              FLOWERING = sakura$Flowering,
                              D = round(D, 6),
                              X = as.matrix(sapply(input, function(j) { scale(sakura[, j]) })),
                              prior_location_intercept = 100,
                              prior_scale_intercept = 10,
                              slab_scale = 1,
                              prior_scale_global = prior_global()
                              ), 
                  pars = c("intercept", "beta", "rho", "sill", "nugget", "upsilon", "shrinkage", "local", "regularized_local", "global", "log_lik"),
                  chains = n_chains, 
                  iter = n_iter, 
                  warmup = n_warm, 
                  thin = n_thin
                  )
## Warning: There were 190 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 113 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
get_elapsed_time(fit_5)
##          warmup   sample
## chain:1 81.0951 393.6270
## chain:2 45.6950 115.4500
## chain:3 51.2990 148.6790
## chain:4 59.9738  49.5472

Vérifier la convergence des paramètres :

stan_rhat(fit_5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Les facteurs de rétrécissement \(\hat{\kappa}_j\) (les comparer aux estimations précédentes)

apply(rstan::extract(fit_5, "shrinkage")$shrinkage, 2, mean)
##  [1] 0.6490488 0.6410082 0.6385399 0.6373223 0.5617698 0.6278468 0.6460894
##  [8] 0.5753431 0.5695688 0.6454387

et, enfin! les estimations moyennes a posteriori

pony <- apply(rstan::extract(fit_5, "beta")$beta, 2, mean)

Comparer les différentes estimations. Quel est la covariable la plus importante au final pour prédire la date de floraison des sakura?

Exercice : représenter les différentes estimations des \(\beta_j\) selon les différents priors.