Skip to contents

This function compares model prediction performance between two datasets (e.g., different ancestries) using a reference training set. It trains a classification model using tidymodels, evaluates it on test and inference datasets, and assesses the statistical significance of the performance difference via permutation testing.

Usage

perm_prediction_difference(
  X,
  Y,
  R,
  MX,
  MY,
  MR,
  g_col,
  method = c("glmnet", "rf"),
  metric = c("roc_auc"),
  cv_folds = 5,
  tune_len = 10,
  max_iter = 1000,
  B = 1000,
  seed = NULL
)

Arguments

X

Matrix of predictors for the test group (ancestry 1); samples x features

Y

Matrix of predictors for the inference group (ancestry 2)

R

Matrix of predictors for the reference training group

MX

Data frame of metadata for `X`, must include the outcome column

MY

Data frame of metadata for `Y`

MR

Data frame of metadata for `R`

g_col

Character. Name of the column in `MX`, `MY`, `MR` containing the binary outcome (must have exactly 2 levels).

method

Character. Which model to use: `"glmnet"` (default) or `"rf"` (random forest).

metric

Character. Performance metric to optimize and test, currently supports `"roc_auc"` only.

cv_folds

Integer. Number of cross-validation folds (default = 5).

tune_len

Integer. Number of levels per hyperparameter in grid search (default = 10).

max_iter

Integer. Not used currently but reserved for future logic (default = 1000).

B

Integer. Number of permutations to run (default = 1000).

seed

Optional integer seed for reproducibility.

Value

A list containing:

summary_stats

A data frame with the observed statistic, group metrics, and p-value

T_null

Null distribution of the test statistic from permutations

B_used

Number of successful permutations