Skip to contents

Calculates pairwise Jaccard indices between iterations based on sample usage. Useful for assessing overlap and dependence in train/test/inference sets across iterations.

Usage

compute_jaccard_matrix(id_usage, role)

Arguments

id_usage

A data.frame with columns ids, role, and iteration, such as the output from track_sample_ids.

role

A character string specifying which sample role to evaluate ("test", "train", or "inference").

Value

A square matrix of Jaccard indices with dimensions iterations × iterations.

Details

The Jaccard index is computed as: $$J(i, j) = |A_i ∩ A_j| / |A_i ∪ A_j|$$ where \(A_i\) and \(A_j\) are the sets of sample IDs used in the specified role for iterations i and j.

Examples

# Assume `id_usage` is created using multiple splits and track_sample_ids()
jaccard_test <- compute_jaccard_matrix(id_usage, role = "test")
#> Error: object 'id_usage' not found
jaccard_train <- compute_jaccard_matrix(id_usage, role = "train")
#> Error: object 'id_usage' not found