This vignette demonstrates how to simulate realistic single-cell
sequencing reads using the cancerSimCraft package. The
simulation process models the complexities of cancer evolution by
incorporating:
By the end of this tutorial, you will have simulated FASTQ files that mimic real sequencing data from individual cancer cells, with known ground truth mutations and structural variations.
Load the necessary libraries and the pre-processed data.
# Load required libraries
library(parallel)
library(tictoc)
library(Biostrings)
library(cancerSimCraft)
# Load pre-processed data
load(file = "./tutorials_data/advanced_8_dynamics_simulation.RData")
To mimic a real single-cell sequencing experiment, we randomly sample
a subset of cells from the simulated population for further analysis. In
practice, only a fraction of cells are typically sequenced due to cost
and technical limitations. Here, we sample 100
cells from our simulation to represent this subset.
param2_dynamics_sim[[1]] is generated during the dynamics
analysis in the advanced_8_dynamics_simulation
# Sample 100 cells from the simulated population
sampled_cell_info <- sample_cells(param2_dynamics_sim[[1]]$cell_info, num_samples = 100)
sampled_cell_idx <- sampled_cell_info$cell_index
The sample_cells function randomly selects living cells
from the simulation data. The returned data frame contains full lineage
information for each sampled cell, including:
Retrieve mutations for the sampled cells using
the get_mutations_sc function.
# Retrieve mutations for the sampled cells
all_sampled_sc_mutations <- get_mutations_sc(
cell_info = param2_dynamics_sim[[1]]$cell_info,
mutation_info = param2_dynamics_sim[[1]]$mutation_info,
sampled_cell_idx = sampled_cell_idx
)
# View the mutation table
head(all_sampled_sc_mutations$sampled_mutation_table)
# View mutations for the first sampled cell
all_sampled_sc_mutations$sampled_sc_mutations[[1]]
The get_mutations_sc function:
To simulate the specific nucleotide changes for each mutation, we use
the sim_snv_nt_sc function. This function introduces single
nucleotide variants (SNVs) into the genome based on the sampled mutation
table and a specified nucleotide transition model. Here, we use
the Kimura 2-Parameter (K2P) substitution model, which
is commonly used in evolutionary biology to model nucleotide
substitutions.
# Define the Kimura 2-Parameter (K2P) transition matrix
alpha <- 1.21
beta <- 1
K2P_transition_matrix <- generate_K2P_matrix(alpha, beta)
# Simulate SNVs for the sampled cells
sc_mut_table_with_nt <- sim_snv_nt_sc(
genome_sequence = genome_with_snp$sim_genome,
seg_list = sim_updates$all_node_segments,
mutation_table = all_sampled_sc_mutations$sampled_mutation_table,
recurrent_mutation_tracker = all_sampled_sc_mutations$recurrent_mutation_tracker,
nt_transition_matrix = K2P_transition_matrix
)
The resulting mutation table now contains:
Note on genome_with_snp$sim_genome
The genome_with_snp$sim_genome object is a simulated
genome that was created in
the advanced_5_backbone_genome_with_snp vignette. In that
vignette, only chromosomes 1, 2, and 3
(chr1, chr2, and chr3) were
generated for demonstration purposes.
For a complete run of this vignette, we recommend rerunning
the advanced_5_backbone_genome_with_snp vignette with the
following modification:
chr_indices = 1:22 to generate a whole genome
(chromosomes 1 through 22).Important: Generating a whole genome sets of the lineage tree requires at least 100 GB of memory. Ensure your system has sufficient resources before proceeding.
Before simulating individual cells, we need to create founder genomes for each clone in the lineage tree. These founder genomes represent the starting point for each clone and incorporate structural changes (e.g., copy number variations) and mutations that occurred during the evolution of the lineage tree. By synthesizing founder genomes first, we can efficiently simulate individual cells without recalculating shared evolutionary changes.
The synth_sc_tree_founder_genomes function is used to
generate these founder genomes. It traverses the lineage tree in
depth-first order, starting from the root node, and synthesizes genomes
for each child node.
# Synthesize founder genomes
sc_founder_genomes <- synth_sc_tree_founder_genomes(
tree = chisel_tree,
root_genome = genome_with_snp$sim_genome,
seg_list = sim_updates$all_node_segments,
mutation_info = sc_mut_table_with_nt,
cell_info = param2_dynamics_sim[[1]]$cell_info
)
To simulate sequencing reads for the sampled cells, we use
the simulate_sc_dynamic_reads_for_batches function. This
function processes cells in batches to efficiently manage memory usage
and computational resources. Before running this step, ensure that the
ART (Huang et al., Bioinformatics 2012) is installed on your system.
# Define paths and parameters
art_path <- "/path_to_art/art_bin_MountRainier/art_illumina"
depth <- 0.01 # Sequencing depth
readLen <- 150 # Read length
batch_size <- 10 # Number of cells to process per batch
n_cores <- 15 # Number of cores for parallel processing
output_dir <- paste0("/path/to/fq/output") # Output directory for FASTQ files
fa_dir <- paste0("/path/to/fa") # Directory for temporary FASTA files
# Create directories if they don't exist
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(fa_dir, recursive = TRUE, showWarnings = FALSE)
# Simulate sequencing reads
mutation_checks <- simulate_sc_dynamic_reads_for_batches(
sampled_cell_idx = sampled_cell_idx,
dynamics_ob = param2_dynamics_sim[[1]],
sc_founder_genomes = sc_founder_genomes,
all_sampled_sc_mutations = all_sampled_sc_mutations,
sc_mut_table_with_nt = sc_mut_table_with_nt,
sim_updates = sim_updates,
fa_dir = fa_dir,
output_dir = output_dir,
art_path = art_path,
n_cores = n_cores,
depth = depth,
readLen = readLen,
otherArgs = "--noALN", # Additional arguments for ART
keep_genome_files = FALSE, # Delete genome files after simulation
batch_size = batch_size
)
The simulate_sc_dynamic_reads_for_batches function:
The resulting FASTQ files can be used for downstream analyses such as variant calling, phylogenetic reconstruction, or other single-cell sequencing applications.
This vignette has demonstrated how to use the
cancerSimCraft package to simulate realistic single-cell
sequencing data that captures the complexity of cancer evolution. The
simulation process maintains biological realism by:
This approach enables researchers to generate ground-truth datasets for benchmarking and validating single-cell analysis methods, or to explore hypothetical evolutionary scenarios in silico.