Feature Selection】3 - Nested Resampling

Eldorado de Titanicdata setBuilding a classification machine learning task (predicting life and death)

Data Cleaning：




library(mlr3verse)




library(mlr3fselect)




(7832)




lgr::get_logger("mlr3")$set_threshold("warn")




lgr::get_logger("bbotk")$set_threshold("warn")




 



library(mlr3data)




 



data("titanic", package = "mlr3data")




titanic$age[(titanic$age)] = median(titanic$age,  = TRUE)




titanic$embarked[(titanic$embarked)] = "S"




titanic$ticket = NULL




titanic$name = NULL




titanic$cabin = NULL




titanic = titanic[!(titanic$survived),]

establishmachine learningMission:

task = as_task_classif(titanic, target = "survived", positive = "yes")

optionmould：




library(mlr3learners)




 



#logistic regression learner，



#To evaluate the predictive performance, we choose a 3-fold cross-validation and the classification error as the measure.



 



learner = lrn("classif.log_reg")




 



resampling = rsmp("cv", folds = 3)




measure = msr("")




 



resampling$instantiate(task)

The above can be considered as a global setup, with different approaches in the subsequent feature selection process. However, task, learner, resampling, measure are the same. terminator is when to terminate due to thearithmeticAnd different.

In summary, FSelectInstanceSingleCrit$new determines the feature screening task, provides the machine learning task, resampling strategy, evaluation metrics, and terminates the event. View feature screening methods:

mlr_fselectors

The following is an example of Sequential Forward Selection to demonstrate this process.




library(mlr3verse)




library(mlr3fselect)




 



(7832)




lgr::get_logger("mlr3")$set_threshold("warn")




lgr::get_logger("bbotk")$set_threshold("warn")




 



library(mlr3data)




 



data("titanic", package = "mlr3data")




titanic$age[(titanic$age)] = median(titanic$age,  = TRUE)




titanic$embarked[(titanic$embarked)] = "S"




titanic$ticket = NULL




titanic$name = NULL




titanic$cabin = NULL




titanic = titanic[!(titanic$survived),]




 



task = as_task_classif(titanic, target = "survived", positive = "yes")




 



library(mlr3learners)




 



learner = lrn("classif.log_reg")




 



resampling = rsmp("cv", folds = 3)




measure = msr("")




 



resampling$instantiate(task)




 



terminator = trm("stagnation", iters = 5)




 



instance = FSelectInstanceSingleCrit$new(




  task = task,




  learner = learner,




  resampling = resampling,




  measure = measure,




  terminator = terminator)




 



fselector = fs("sequential")




fselector$optimize(instance)




 



fselector$optimization_path(instance)

Nested resampling for feature selection

The graphic above illustrates nested resampling for parameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

The repeated evaluation of the model might leak information about the test sets into the model and thus leads to over-fitting and over-optimistic performance results. nested resampling uses an outer and inner resampling to separate the feature selection from the performance estimation of the model.

The above is the principle of step-by-step demonstration, in practice, the code can be simplified:




# Nested resampling on Palmer Penguins data set



rr = fselect_nested(




  fselector = fs("random_search"),




  task = tsk("penguins"),




  learner = lrn(""),




  inner_resampling = rsmp ("holdout"),




  outer_resampling = rsmp("cv", folds = 2),




  measure = msr(""),




  term_evals = 4)




 



# Performance scores estimated on the outer resampling



rr$score()




 



# Unbiased performance of the final model trained on the full data set



rr$aggregate()