Simulate RNA-seq Expression Data for Four Groups (Two Ancestries × Two Conditions)

Generate simulated RNA-seq count matrices for two ancestries, each with two conditions, based on parameters estimated from real data. This function calls sim_2group_expression twice (once per ancestry) and then calculates the true interaction effects between ancestries for each gene.

Usage

sim_4group_expression(
  estimates_X = NULL,
  estimates_Y = NULL,
  ancestry_X,
  ancestry_Y,
  n_samples_X,
  n_samples_Y,
  n_degs_X,
  n_degs_Y,
  log2FC_X,
  log2FC_Y,
  mean_method = c("mle", "map", "libnorm_mle", "libnorm_map"),
  disp_method = c("mle", "map"),
  seed = NULL
)

Arguments

estimates_X: Optional list of parameter estimates from estimate_params for ancestry X. If NULL, parameters are estimated from X.
estimates_Y: Optional list of parameter estimates from estimate_params for ancestry Y. If NULL, parameters are estimated from Y.
ancestry_X: Character scalar giving the ancestry label for X.
ancestry_Y: Character scalar giving the ancestry label for Y.
n_samples_X: Integer, number of samples to simulate per condition for ancestry X.
n_samples_Y: Integer, number of samples to simulate per condition for ancestry Y.
n_degs_X: Integer, number of differentially expressed genes to simulate in ancestry X.
n_degs_Y: Integer, number of differentially expressed genes to simulate in ancestry Y.
log2FC_X: Numeric, log2 fold-change magnitude for DEGs in ancestry X.
log2FC_Y: Numeric, log2 fold-change magnitude for DEGs in ancestry Y.
mean_method: Character string, method to use for mean estimates in both ancestries. One of "mle", "map", "libnorm_mle", "libnorm_map".
disp_method: Character string, method to use for dispersion estimates in both ancestries. One of "mle", "map".
seed: Optional integer random seed for reproducibility. The simulation for ancestry Y will use seed + 1000 to ensure different DE gene sets.
X: Numeric matrix or data frame of counts for the first ancestry (samples in rows, genes in columns).
Y: Numeric matrix or data frame of counts for the second ancestry (samples in rows, genes in columns).

Value

A list with the following elements:

X: Simulated count matrix (samples x genes) for ancestry X.
Y: Simulated count matrix (samples x genes) for ancestry Y.
MX: Sample metadata for ancestry X.
MY: Sample metadata for ancestry Y.
fX: Gene-level features (DE status, true log2FC) for ancestry X.
fY: Gene-level features (DE status, true log2FC) for ancestry Y.
pX: List of ggplot objects comparing means and dispersions for ancestry X.
pY: List of ggplot objects comparing means and dispersions for ancestry Y.
fI: Data frame of interaction effects, with DE status and true interaction log2FC.

Details

The interaction effect for each gene is defined as: $$\mathrm{Interaction\ log2FC} = \mathrm{log2FC}_Y - \mathrm{log2FC}_X$$ where $\mathrm{log2FC}_X$ and $\mathrm{log2FC}_Y$ are the true log2 fold-changes from the simulated data for ancestry X and Y respectively.

Usage

Arguments

Value

Details

See also