We explore the tidymodels framework which allows to work in a unified workflow for different models. tidymodels is a meta package like tidyverse. Here we will focus only on:

library(tidymodels)
library(tidyverse)

Example dataset is an extract from kaggle: https://www.kaggle.com/uciml/forest-cover-type-dataset

forest_data <- read_csv("data/covtype_small.csv")

The dataset has 24757 observations and 55 variables. We want to predict the Cover_Type according to all variables.

We keep only a part of data to be in a 2 class prediction problem. This will delete some variables which are constant over all observations.

## remove columns with only 0s
forest_data <- 
  forest_data %>% 
  select_if(.predicate = ~n_distinct(.) > 1) 

## set type as factor
forest_data <- 
  forest_data %>% 
  mutate(Cover_Type = as.factor(Cover_Type))

The dataset has now 24757 observations and 50 variables.

First split

We split the dataset into training and test sets with rsample::initial_split with proportion 3/4 and respecting the Cover_Type strata.

Logistic regression

Model estimate

First we model a logistic regression on the whole training dataset. The framework of parsnip consists in first defining the type of model (here logistic_reg), the engine (the underlying package which effectively estimate the model) with set_egine and then estimate the model on the data with fit.

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Model performance

yardstick is a package to get metrics on model performance. By default on classification problem yardstick::metrics returns accuracy and kappa.

.metric .estimator .estimate
accuracy binary 0.7771853
kap binary 0.5401555

We also have access to other metrics with specific functions such as yardstick::spec for specificity, yardstick::precision, yardstick::recall… These metrics can be combined to be estimated together with yardstick::metric_set.

.metric .estimator .estimate
accuracy binary 0.7771853
bal_accuracy binary 0.7678956
sens binary 0.7066667
spec binary 0.8291246
precision binary 0.7528409
recall binary 0.7066667
ppv binary 0.7528409
npv binary 0.7932886

yardstick::roc_auc computes the area under the roc and yardstick::roc_curve data to plot the roc.

.metric .estimator .estimate
roc_auc binary 0.8519712

Cross-validation

In this part we estimate again a logistic regression but on 10-fold cross validation samples.

rsample::vfold_cv creates the sampling scheme. Here we create 5 repetitions of a 10 fold cv.

We define next some functions to fit the model, get the predictions and the probabilities.

We apply these functions on the cv data.

We can compute the model preformance for each fold:

## Warning: Unquoting language objects with `!!!` is deprecated as of rlang 0.4.0.
## Please use `!!` instead.
## 
##   # Bad:
##   dplyr::select(data, !!!enquo(x))
## 
##   # Good:
##   dplyr::select(data, !!enquo(x))    # Unquote single quosure
##   dplyr::select(data, !!!enquos(x))  # Splice list of quosures
## 
## This warning is displayed once per session.
Model performance on each fold

Model performance on each fold

ROC curve on each fold:

ROC curves on each fold

ROC curves on each fold