Simulate imbalanced ancestry sampling across two cohorts
Source:R/sim_imbalanced_ancestry.R
sim_imbalanced_ancestry.Rd
Subsample two data matrices (X, Y) and their metadata (MX, MY) to mimic an ancestry-imbalanced cohort with controllable between- and within-ancestry group splits.
Usage
sim_imbalanced_ancestry(
X,
Y,
MX,
MY,
g_col,
a_col,
majority = c("X", "Y"),
total_samples = 100,
between_ratio = 1,
within_major_ratio = 1,
within_minor_ratio = 1,
seed = NULL,
replace = FALSE,
verbose = FALSE
)
Arguments
- X
Numeric matrix or data.frame of features for cohort X; rows are samples and must align with `MX`.
- Y
Numeric matrix or data.frame of features for cohort Y; rows are samples and must align with `MY`.
- MX
Data.frame with metadata for `X`.
- MY
Data.frame with metadata for `Y`.
- g_col
Name of the metadata column with the two-level group split used inside each ancestry (e.g., case/control).
- a_col
Name of the metadata column holding the ancestry label.
- majority
Character, which block is the majority ancestry; one of `c("X","Y")`.
- total_samples
Integer, total number of samples drawn across both ancestries.
- between_ratio
Positive number, majority:minority ancestry ratio. For example `2` means 2:1 between ancestries.
- within_major_ratio
Positive number, ratio for the two `g_col` levels inside the majority ancestry (`level1:level2 = r:1`).
- within_minor_ratio
Positive number, ratio for the two `g_col` levels inside the minority ancestry (`level1:level2 = r:1`).
- seed
Optional integer; if supplied, sets the RNG seed.
- replace
Logical; sample with replacement within each block.
- verbose
Logical; if `TRUE`, print per-block split messages.
Value
A list with four elements:
- X
Subset of the feature matrix for the first return slot.
- Y
Subset of the feature matrix for the second return slot.
- MX
Subset of metadata aligned to `X`.
- MY
Subset of metadata aligned to `Y`.
Which ancestry lands in `X` vs `Y` depends on `majority`: the majority ancestry is always returned first.
Details
X and MX must have the same number of rows; Y and MY must match as well. The ancestry of each block is read from `a_col` in MX and MY; each must contain exactly one non-NA, single label, and the two labels must differ.
The within-ancestry split uses `g_col`, which must have exactly two non-NA labels in each block. Labels are inferred alphabetically, and a ratio `r` means "`level1:level2 = r:1`".
Targets are rounded. If `replace = FALSE` and there are not enough rows, the function draws what is available. Per-block split messages are printed only when `verbose = TRUE`. A warning is emitted on shortfall when sampling without replacement.
Examples
set.seed(1)
X <- matrix(rnorm(1000), nrow = 100)
Y <- matrix(rnorm( 800), nrow = 80)
MX <- data.frame(anc = "EUR",
grp = sample(c("A","B"), 100, TRUE))
MY <- data.frame(anc = "AFR",
grp = sample(c("A","B"), 80, TRUE))
out <- sim_imbalanced_ancestry(
X, Y, MX, MY,
g_col = "grp",
a_col = "anc",
majority = "X",
total_samples = 120,
between_ratio = 2,
within_major_ratio = 3,
within_minor_ratio = 1,
seed = 123,
verbose = TRUE
)
#> Error in assert_input(X, Y, MX, MY, g_col, a_col): X must have rownames corresponding to sample IDs.
lapply(out, nrow)
#> Error: object 'out' not found