Simulate imbalanced ancestry sampling across two cohorts — sim_imbalanced

Subsample two data matrices (X, Y) and their metadata (MX, MY) to mimic an ancestry-imbalanced cohort with controllable between- and within-ancestry group splits.

Usage

sim_imbalanced_ancestry(
  X,
  Y,
  MX,
  MY,
  g_col,
  a_col,
  majority = c("X", "Y"),
  total_samples = 100,
  between_ratio = 1,
  within_major_ratio = 1,
  within_minor_ratio = 1,
  seed = NULL,
  replace = FALSE,
  verbose = FALSE
)

Arguments

X: Numeric matrix or data.frame of features for cohort X; rows are samples and must align with `MX`.
Y: Numeric matrix or data.frame of features for cohort Y; rows are samples and must align with `MY`.
MX: Data.frame with metadata for `X`.
MY: Data.frame with metadata for `Y`.
g_col: Name of the metadata column with the two-level group split used inside each ancestry (e.g., case/control).
a_col: Name of the metadata column holding the ancestry label.
majority: Character, which block is the majority ancestry; one of `c("X","Y")`.
total_samples: Integer, total number of samples drawn across both ancestries.
between_ratio: Positive number, majority:minority ancestry ratio. For example `2` means 2:1 between ancestries.
within_major_ratio: Positive number, ratio for the two `g_col` levels inside the majority ancestry (`level1:level2 = r:1`).
within_minor_ratio: Positive number, ratio for the two `g_col` levels inside the minority ancestry (`level1:level2 = r:1`).
seed: Optional integer; if supplied, sets the RNG seed.
replace: Logical; sample with replacement within each block.
verbose: Logical; if `TRUE`, print per-block split messages.

Value

A list with four elements:

X: Subset of the feature matrix for the first return slot.
Y: Subset of the feature matrix for the second return slot.
MX: Subset of metadata aligned to `X`.
MY: Subset of metadata aligned to `Y`.

Which ancestry lands in `X` vs `Y` depends on `majority`: the majority ancestry is always returned first.

Details

X and MX must have the same number of rows; Y and MY must match as well. The ancestry of each block is read from `a_col` in MX and MY; each must contain exactly one non-NA, single label, and the two labels must differ.

The within-ancestry split uses `g_col`, which must have exactly two non-NA labels in each block. Labels are inferred alphabetically, and a ratio `r` means "`level1:level2 = r:1`".

Targets are rounded. If `replace = FALSE` and there are not enough rows, the function draws what is available. Per-block split messages are printed only when `verbose = TRUE`. A warning is emitted on shortfall when sampling without replacement.

Examples

set.seed(1)
X  <- matrix(rnorm(1000), nrow = 100)
Y  <- matrix(rnorm( 800), nrow =  80)
MX <- data.frame(anc = "EUR",
                 grp = sample(c("A","B"), 100, TRUE))
MY <- data.frame(anc = "AFR",
                 grp = sample(c("A","B"),  80, TRUE))
out <- sim_imbalanced_ancestry(
  X, Y, MX, MY,
  g_col = "grp",
  a_col = "anc",
  majority = "X",
  total_samples = 120,
  between_ratio = 2,
  within_major_ratio = 3,
  within_minor_ratio = 1,
  seed = 123,
  verbose = TRUE
)
#> Error in assert_input(X, Y, MX, MY, g_col, a_col): X must have rownames corresponding to sample IDs.
lapply(out, nrow)
#> Error: object 'out' not found