Split Expression and Metadata into Train, Test, and Inference Sets — split_stratified_ancestry

Performs stratified sampling of an overrepresented group (X) to match the distribution of an underrepresented group (Y) based on one or more grouping variables (e.g., ancestry).

Usage

split_stratified_ancestry_sets(
  X,
  Y,
  MX,
  MY,
  g_col = NULL,
  id_col = NULL,
  seed = NULL
)

Arguments

X: A numeric matrix (samples × features) for the overrepresented group. Row names must be sample IDs.
Y: A numeric matrix (samples × features) for the underrepresented group. Row names must be sample IDs.
MX: A data frame of metadata for `X`, with the same number of rows. Must align with `X` either by row order or `id_col`.
MY: A data frame of metadata for `Y`, same format and row count as `Y`.
g_col: Optional character vector of column names in metadata to stratify by (e.g., ancestry, sex). If `NULL`, all samples are grouped together.
id_col: Optional column name in `MX` and `MY` that holds sample IDs. If `NULL`, metadata is matched to expression matrices by rownames.
seed: Optional numeric seed for reproducibility of sampling.

Value

A list with the following elements:

train: A list with matrix, metadata, and sample IDs for the training set.
test: A list with matrix, metadata, and sample IDs for the testing set.
inference: A list with matrix, metadata, and sample IDs for the inference set (group Y).
strata_info: A summary list of usable, missing, and insufficient strata.

Details

Sample IDs must be provided as rownames in both expression matrices. If `id_col` is specified, it must match those rownames exactly in the metadata.