Permutation Test for Prediction Performance Differences
Source:R/perm_prediction_difference.R
perm_prediction_difference.Rd
This function compares model prediction performance between two datasets (e.g., different ancestries) using a reference training set. It trains a classification model using tidymodels, evaluates it on test and inference datasets, and assesses the statistical significance of the performance difference via permutation testing.
Arguments
- X
Matrix of predictors for the test group (ancestry 1); samples x features
- Y
Matrix of predictors for the inference group (ancestry 2)
- R
Matrix of predictors for the reference training group
- MX
Data frame of metadata for `X`, must include the outcome column
- MY
Data frame of metadata for `Y`
- MR
Data frame of metadata for `R`
- g_col
Character. Name of the column in `MX`, `MY`, `MR` containing the binary outcome (must have exactly 2 levels).
- method
Character. Which model to use: `"glmnet"` (default) or `"rf"` (random forest).
- metric
Character. Performance metric to optimize and test, currently supports `"roc_auc"` only.
- cv_folds
Integer. Number of cross-validation folds (default = 5).
- tune_len
Integer. Number of levels per hyperparameter in grid search (default = 10).
- max_iter
Integer. Not used currently but reserved for future logic (default = 1000).
- B
Integer. Number of permutations to run (default = 1000).
- seed
Optional integer seed for reproducibility.