Simulating Single-Cell Genome and Sequencing Reads (Dynamics Incorporated)

Overview

This vignette demonstrates how to simulate realistic single-cell sequencing reads using the cancerSimCraft package. The simulation process models the complexities of cancer evolution by incorporating:

Cell lineage dynamics and clonal evolution
Single nucleotide variants (SNVs)
Structural variants (SVs)
Recurrent mutations
Sequencing errors and biases

By the end of this tutorial, you will have simulated FASTQ files that mimic real sequencing data from individual cancer cells, with known ground truth mutations and structural variations.

Load Required Libraries and Data

Load the necessary libraries and the pre-processed data.

# Load required libraries
library(parallel)
library(tictoc)
library(Biostrings)
library(cancerSimCraft)

# Load pre-processed data
load(file = "./tutorials_data/advanced_8_dynamics_simulation.RData")

Sample Cells for Simulation

To mimic a real single-cell sequencing experiment, we randomly sample a subset of cells from the simulated population for further analysis. In practice, only a fraction of cells are typically sequenced due to cost and technical limitations. Here, we sample 100 cells from our simulation to represent this subset. param2_dynamics_sim[[1]] is generated during the dynamics analysis in the advanced_8_dynamics_simulation

# Sample 100 cells from the simulated population
sampled_cell_info <- sample_cells(param2_dynamics_sim[[1]]$cell_info, num_samples = 100)
sampled_cell_idx <- sampled_cell_info$cell_index

The sample_cells function randomly selects living cells from the simulation data. The returned data frame contains full lineage information for each sampled cell, including:

Clone type
Parent cell
Birth time
Clone index
Other lineage details

Retrieve Mutations for Sampled Cells

Retrieve mutations for the sampled cells using the get_mutations_sc function.

# Retrieve mutations for the sampled cells
all_sampled_sc_mutations <- get_mutations_sc(
  cell_info = param2_dynamics_sim[[1]]$cell_info,
  mutation_info = param2_dynamics_sim[[1]]$mutation_info,
  sampled_cell_idx = sampled_cell_idx
)

# View the mutation table
head(all_sampled_sc_mutations$sampled_mutation_table)

# View mutations for the first sampled cell
all_sampled_sc_mutations$sampled_sc_mutations[[1]]

The get_mutations_sc function:

Retrieves all mutations for each sampled cell, including those in the cell’s ancestral lineage
Identifies and tracks recurrent mutations at the same genomic location
Returns both a combined mutation table and cell-specific mutation lists

Simulate Nucleotide Changes for Sampled Cells

To simulate the specific nucleotide changes for each mutation, we use the sim_snv_nt_sc function. This function introduces single nucleotide variants (SNVs) into the genome based on the sampled mutation table and a specified nucleotide transition model. Here, we use the Kimura 2-Parameter (K2P) substitution model, which is commonly used in evolutionary biology to model nucleotide substitutions.

# Define the Kimura 2-Parameter (K2P) transition matrix
alpha <- 1.21
beta <- 1
K2P_transition_matrix <- generate_K2P_matrix(alpha, beta)

# Simulate SNVs for the sampled cells
sc_mut_table_with_nt <- sim_snv_nt_sc(
  genome_sequence = genome_with_snp$sim_genome,
  seg_list = sim_updates$all_node_segments,
  mutation_table = all_sampled_sc_mutations$sampled_mutation_table,
  recurrent_mutation_tracker = all_sampled_sc_mutations$recurrent_mutation_tracker,
  nt_transition_matrix = K2P_transition_matrix
)

The resulting mutation table now contains:

Original nucleotide at each mutation site
Alternative (mutated) nucleotide
Segment ID and reference position information
All original mutation metadata

Note on genome_with_snp$sim_genome

The genome_with_snp$sim_genome object is a simulated genome that was created in the advanced_5_backbone_genome_with_snp vignette. In that vignette, only chromosomes 1, 2, and 3 (chr1, chr2, and chr3) were generated for demonstration purposes.

For a complete run of this vignette, we recommend rerunning the advanced_5_backbone_genome_with_snp vignette with the following modification:

Set chr_indices = 1:22 to generate a whole genome (chromosomes 1 through 22).

Important: Generating a whole genome sets of the lineage tree requires at least 100 GB of memory. Ensure your system has sufficient resources before proceeding.

Synthesize Founder Genomes

Before simulating individual cells, we need to create founder genomes for each clone in the lineage tree. These founder genomes represent the starting point for each clone and incorporate structural changes (e.g., copy number variations) and mutations that occurred during the evolution of the lineage tree. By synthesizing founder genomes first, we can efficiently simulate individual cells without recalculating shared evolutionary changes.

The synth_sc_tree_founder_genomes function is used to generate these founder genomes. It traverses the lineage tree in depth-first order, starting from the root node, and synthesizes genomes for each child node.

# Synthesize founder genomes
sc_founder_genomes <- synth_sc_tree_founder_genomes(
  tree = chisel_tree, 
  root_genome = genome_with_snp$sim_genome, 
  seg_list = sim_updates$all_node_segments, 
  mutation_info = sc_mut_table_with_nt, 
  cell_info = param2_dynamics_sim[[1]]$cell_info
)

Simulate Sequencing Reads

To simulate sequencing reads for the sampled cells, we use the simulate_sc_dynamic_reads_for_batches function. This function processes cells in batches to efficiently manage memory usage and computational resources. Before running this step, ensure that the ART (Huang et al., Bioinformatics 2012) is installed on your system.

# Define paths and parameters
art_path <- "/path_to_art/art_bin_MountRainier/art_illumina" 
depth <- 0.01  # Sequencing depth
readLen <- 150  # Read length
batch_size <- 10  # Number of cells to process per batch
n_cores <- 15  # Number of cores for parallel processing
output_dir <- paste0("/path/to/fq/output")  # Output directory for FASTQ files
fa_dir <- paste0("/path/to/fa")  # Directory for temporary FASTA files

# Create directories if they don't exist
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(fa_dir, recursive = TRUE, showWarnings = FALSE)

# Simulate sequencing reads
mutation_checks <- simulate_sc_dynamic_reads_for_batches(
  sampled_cell_idx = sampled_cell_idx,
  dynamics_ob = param2_dynamics_sim[[1]],
  sc_founder_genomes = sc_founder_genomes,
  all_sampled_sc_mutations = all_sampled_sc_mutations,
  sc_mut_table_with_nt = sc_mut_table_with_nt,
  sim_updates = sim_updates,
  fa_dir = fa_dir,
  output_dir = output_dir,
  art_path = art_path,
  n_cores = n_cores,
  depth = depth,
  readLen = readLen,
  otherArgs = "--noALN",  # Additional arguments for ART
  keep_genome_files = FALSE,  # Delete genome files after simulation
  batch_size = batch_size
)

The simulate_sc_dynamic_reads_for_batches function:

Processes cells in batches to manage memory efficiently
For each cell:
- Synthesizes a cell-specific genome by:
  - Starting with the appropriate clone founder genome
  - Adding cell-specific mutations
- Verifies that mutations were correctly introduced
- Writes the genome to a FASTA file
Simulates sequencing reads using the ART simulator in parallel
Returns validation information for quality control

The resulting FASTQ files can be used for downstream analyses such as variant calling, phylogenetic reconstruction, or other single-cell sequencing applications.