Skip to contents

Performs stratified sampling of an overrepresented group (X) to match the distribution of an underrepresented group (Y) based on one or more grouping variables (e.g., ancestry).

Usage

split_stratified_ancestry_sets(
  X,
  Y,
  MX,
  MY,
  g_col = NULL,
  id_col = NULL,
  seed = NULL
)

Arguments

X

A numeric matrix (samples × features) for the overrepresented group. Row names must be sample IDs.

Y

A numeric matrix (samples × features) for the underrepresented group. Row names must be sample IDs.

MX

A data frame of metadata for `X`, with the same number of rows. Must align with `X` either by row order or `id_col`.

MY

A data frame of metadata for `Y`, same format and row count as `Y`.

g_col

Optional character vector of column names in metadata to stratify by (e.g., ancestry, sex). If `NULL`, all samples are grouped together.

id_col

Optional column name in `MX` and `MY` that holds sample IDs. If `NULL`, metadata is matched to expression matrices by rownames.

seed

Optional numeric seed for reproducibility of sampling.

Value

A list with the following elements:

train

A list with matrix, metadata, and sample IDs for the training set.

test

A list with matrix, metadata, and sample IDs for the testing set.

inference

A list with matrix, metadata, and sample IDs for the inference set (group Y).

strata_info

A summary list of usable, missing, and insufficient strata.

Details

Sample IDs must be provided as rownames in both expression matrices. If `id_col` is specified, it must match those rownames exactly in the metadata.