Introduction

This vignette demonstrates how to simulate single-cell sequencing data at the clonal resolution using the merge_fasta_filesand simulate_read_sc functions. The workflow involves merging paternal and maternal FASTA files for each clone and then simulating single-cell sequencing reads using the ART (Huang et al., Bioinformatics 2012).

Loading Libraries and Data

Load required libraries and simulation setting data generated from advanced_4_simulation_setting.

# Load library 
library(parallel)
library(cancerSimCraft)

# Load data 
# Load the simulation settings
load("./tutorials_data/advanced_4_simulation_setting.RData")

Merging Paternal and Maternal FASTA Files

For each clone, we merge the paternal and maternal FASTA files into a single file using the merge_fasta_files function.

# Merge FASTA files for each clone
for (clone in clone_names) {
  merge_fasta_files(
    paternal_fa = paste0("./large_tutorials_data/clone_", clone, "_paternal.fa"),
    maternal_fa = paste0("./large_tutorials_data/clone_", clone, "_maternal.fa"),
    output_fa = paste0("./large_tutorials_data/clone_", clone, "_merged.fa"),
    tmp_dir = paste0("./large_tutorials_data/")
  )
}

Simulating Single-Cell Sequencing Reads

Next, we simulate single-cell sequencing reads using the simulate_read_sc function. We specify the sequencing depth, read length, and other parameters.

art_path = "~/postdoc_project/cancerSimCraft/tutorials/softwares/art_bin_MountRainier/art_illumina"
depth = 0.01 
readLen = 150

sim_clone_num <- c(5, 6, 3, 2, 5, 3, 2)
names(sim_clone_num) <- clone_names

n_cores = 2

print("sim_clone_num:")
print(sim_clone_num)

tic(paste0("Simulation of ", sum(sim_clone_num), " Single Cell Fastq Files with ", n_cores, " Cores"))
for(clone in clone_names){
  tic(paste0("Simulation of ", sim_clone_num[clone], " Fastq Files from Clone ", clone))
  simulate_read_sc(fasta_input = paste0("./large_tutorials_data/clone_", clone, "_merged.fa"),
                   output_prefixes = paste0("large_tutorials_data/sc_reads_clonal/clone_", clone, "_", 1:sim_clone_num[clone]),
                   depth = depth,
                   readLen = readLen, 
                   artPath = art_path,
                   paired = FALSE,
                   numCores = n_cores,
                   otherArgs = "--noALN")
  toc()
}
toc()

print("All simulation finished!")

After generating the FASTQ files, users can process the simulated reads using standard sequence processing tools such as bowtie2 (Langmead and Salzberg, Nature Methods 2012) or bwa (Li and Durbin, Bioinformatics 2009) for alignment, followed by samtools (Danecek et al., GigaScience 2021) for SAM/BAM file manipulation. Once the reads are aligned, downstream analysis tools can be applied for copy number (CN) analysis, variant calling, or other genomic analyses.