Overview

This vignette demonstrates how to simulate realistic single-cell sequencing reads using the cancerSimCraft package. The simulation process models the complexities of cancer evolution by incorporating:

By the end of this tutorial, you will have simulated FASTQ files that mimic real sequencing data from individual cancer cells, with known ground truth mutations and structural variations.

Load Required Libraries and Data

Load the necessary libraries and the pre-processed data.

# Load required libraries
library(parallel)
library(tictoc)
library(Biostrings)
library(cancerSimCraft)

# Load pre-processed data
load(file = "./tutorials_data/advanced_8_dynamics_simulation.RData")

Sample Cells for Simulation

To mimic a real single-cell sequencing experiment, we randomly sample a subset of cells from the simulated population for further analysis. In practice, only a fraction of cells are typically sequenced due to cost and technical limitations. Here, we sample 100 cells from our simulation to represent this subset. param2_dynamics_sim[[1]] is generated during the dynamics analysis in the advanced_8_dynamics_simulation

# Sample 100 cells from the simulated population
sampled_cell_info <- sample_cells(param2_dynamics_sim[[1]]$cell_info, num_samples = 100)
sampled_cell_idx <- sampled_cell_info$cell_index

The sample_cells function randomly selects living cells from the simulation data. The returned data frame contains full lineage information for each sampled cell, including:

  • Clone type
  • Parent cell
  • Birth time
  • Clone index
  • Other lineage details

Retrieve Mutations for Sampled Cells

Retrieve mutations for the sampled cells using the get_mutations_sc function.

# Retrieve mutations for the sampled cells
all_sampled_sc_mutations <- get_mutations_sc(
  cell_info = param2_dynamics_sim[[1]]$cell_info,
  mutation_info = param2_dynamics_sim[[1]]$mutation_info,
  sampled_cell_idx = sampled_cell_idx
)

# View the mutation table
head(all_sampled_sc_mutations$sampled_mutation_table)

# View mutations for the first sampled cell
all_sampled_sc_mutations$sampled_sc_mutations[[1]]

The get_mutations_sc function:

  1. Retrieves all mutations for each sampled cell, including those in the cell’s ancestral lineage
  2. Identifies and tracks recurrent mutations at the same genomic location
  3. Returns both a combined mutation table and cell-specific mutation lists

Simulate Nucleotide Changes for Sampled Cells

To simulate the specific nucleotide changes for each mutation, we use the sim_snv_nt_sc function. This function introduces single nucleotide variants (SNVs) into the genome based on the sampled mutation table and a specified nucleotide transition model. Here, we use the Kimura 2-Parameter (K2P) substitution model, which is commonly used in evolutionary biology to model nucleotide substitutions.

# Define the Kimura 2-Parameter (K2P) transition matrix
alpha <- 1.21
beta <- 1
K2P_transition_matrix <- generate_K2P_matrix(alpha, beta)

# Simulate SNVs for the sampled cells
sc_mut_table_with_nt <- sim_snv_nt_sc(
  genome_sequence = genome_with_snp$sim_genome,
  seg_list = sim_updates$all_node_segments,
  mutation_table = all_sampled_sc_mutations$sampled_mutation_table,
  recurrent_mutation_tracker = all_sampled_sc_mutations$recurrent_mutation_tracker,
  nt_transition_matrix = K2P_transition_matrix
)

The resulting mutation table now contains:

  • Original nucleotide at each mutation site
  • Alternative (mutated) nucleotide
  • Segment ID and reference position information
  • All original mutation metadata

Note on genome_with_snp$sim_genome

The genome_with_snp$sim_genome object is a simulated genome that was created in the advanced_5_backbone_genome_with_snp vignette. In that vignette, only chromosomes 1, 2, and 3 (chr1chr2, and chr3) were generated for demonstration purposes.

For a complete run of this vignette, we recommend rerunning the advanced_5_backbone_genome_with_snp vignette with the following modification:

  • Set chr_indices = 1:22 to generate a whole genome (chromosomes 1 through 22).

Important: Generating a whole genome sets of the lineage tree requires at least 100 GB of memory. Ensure your system has sufficient resources before proceeding.

Synthesize Founder Genomes

Before simulating individual cells, we need to create founder genomes for each clone in the lineage tree. These founder genomes represent the starting point for each clone and incorporate structural changes (e.g., copy number variations) and mutations that occurred during the evolution of the lineage tree. By synthesizing founder genomes first, we can efficiently simulate individual cells without recalculating shared evolutionary changes.

The synth_sc_tree_founder_genomes function is used to generate these founder genomes. It traverses the lineage tree in depth-first order, starting from the root node, and synthesizes genomes for each child node.

# Synthesize founder genomes
sc_founder_genomes <- synth_sc_tree_founder_genomes(
  tree = chisel_tree, 
  root_genome = genome_with_snp$sim_genome, 
  seg_list = sim_updates$all_node_segments, 
  mutation_info = sc_mut_table_with_nt, 
  cell_info = param2_dynamics_sim[[1]]$cell_info
)

Simulate Sequencing Reads

To simulate sequencing reads for the sampled cells, we use the simulate_sc_dynamic_reads_for_batches function. This function processes cells in batches to efficiently manage memory usage and computational resources. Before running this step, ensure that the ART (Huang et al., Bioinformatics 2012) is installed on your system.

# Define paths and parameters
art_path <- "/path_to_art/art_bin_MountRainier/art_illumina" 
depth <- 0.01  # Sequencing depth
readLen <- 150  # Read length
batch_size <- 10  # Number of cells to process per batch
n_cores <- 15  # Number of cores for parallel processing
output_dir <- paste0("/path/to/fq/output")  # Output directory for FASTQ files
fa_dir <- paste0("/path/to/fa")  # Directory for temporary FASTA files

# Create directories if they don't exist
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(fa_dir, recursive = TRUE, showWarnings = FALSE)

# Simulate sequencing reads
mutation_checks <- simulate_sc_dynamic_reads_for_batches(
  sampled_cell_idx = sampled_cell_idx,
  dynamics_ob = param2_dynamics_sim[[1]],
  sc_founder_genomes = sc_founder_genomes,
  all_sampled_sc_mutations = all_sampled_sc_mutations,
  sc_mut_table_with_nt = sc_mut_table_with_nt,
  sim_updates = sim_updates,
  fa_dir = fa_dir,
  output_dir = output_dir,
  art_path = art_path,
  n_cores = n_cores,
  depth = depth,
  readLen = readLen,
  otherArgs = "--noALN",  # Additional arguments for ART
  keep_genome_files = FALSE,  # Delete genome files after simulation
  batch_size = batch_size
)

The simulate_sc_dynamic_reads_for_batches function:

  1. Processes cells in batches to manage memory efficiently
  2. For each cell:
    • Synthesizes a cell-specific genome by:
      • Starting with the appropriate clone founder genome
      • Adding cell-specific mutations
    • Verifies that mutations were correctly introduced
    • Writes the genome to a FASTA file
  3. Simulates sequencing reads using the ART simulator in parallel
  4. Returns validation information for quality control

The resulting FASTQ files can be used for downstream analyses such as variant calling, phylogenetic reconstruction, or other single-cell sequencing applications.

Summary

This vignette has demonstrated how to use the cancerSimCraft package to simulate realistic single-cell sequencing data that captures the complexity of cancer evolution. The simulation process maintains biological realism by:

  1. Preserving clonal structure and lineage relationships
  2. Accurately modeling the accumulation of mutations during evolution
  3. Correctly handling recurrent mutations and structural variants
  4. Generating realistic sequencing reads that reflect these genomic features

This approach enables researchers to generate ground-truth datasets for benchmarking and validating single-cell analysis methods, or to explore hypothetical evolutionary scenarios in silico.