Skip to contents

Performs stratified sampling of an overrepresented group (X, e.g. EUR) to match the distribution of an underrepresented group (Y, e.g. AFR) based on a grouping variable (e.g., condition).

Usage

split_stratified_ancestry_sets(
  X,
  Y,
  MX,
  MY,
  g_col,
  a_col,
  seed = NULL,
  verbose = TRUE
)

Arguments

X

Numeric matrix or data.frame of features for cohort X; rows are samples and must align with MX.

Y

Numeric matrix or data.frame of features for cohort Y; rows are samples and must align with MY.

MX

Data.frame with metadata for X.

MY

Data.frame with metadata for Y.

g_col

Name of the metadata column holding the stratification label.

a_col

Name of the metadata column holding the ancestry label.

seed

Optional numeric seed for reproducibility of sampling.

verbose

Logical, whether to print messages.

Value

A list with the following elements (all matrices with rownames):

R

Reference set: remaining X after subsampling

X

Subset set: subsampled X matching Y

Y

Inference set: full Y, untouched

strata_info

list with usable/missing/insufficient strata