Split Expression and Metadata into Train, Test, and Inference Sets
Source:R/split_stratified_ancestry_sets.R
split_stratified_ancestry_sets.Rd
Performs stratified sampling of an overrepresented group (X) to match the distribution of an underrepresented group (Y) based on one or more grouping variables (e.g., ancestry).
Arguments
- X
A numeric matrix (samples × features) for the overrepresented group. Row names must be sample IDs.
- Y
A numeric matrix (samples × features) for the underrepresented group. Row names must be sample IDs.
- MX
A data frame of metadata for `X`, with the same number of rows. Must align with `X` either by row order or `id_col`.
- MY
A data frame of metadata for `Y`, same format and row count as `Y`.
- g_col
Optional character vector of column names in metadata to stratify by (e.g., ancestry, sex). If `NULL`, all samples are grouped together.
- id_col
Optional column name in `MX` and `MY` that holds sample IDs. If `NULL`, metadata is matched to expression matrices by rownames.
- seed
Optional numeric seed for reproducibility of sampling.
Value
A list with the following elements:
- train
A list with matrix, metadata, and sample IDs for the training set.
- test
A list with matrix, metadata, and sample IDs for the testing set.
- inference
A list with matrix, metadata, and sample IDs for the inference set (group Y).
- strata_info
A summary list of usable, missing, and insufficient strata.