This vignette demonstrates how to simulate single-cell sequencing
data at the clonal resolution using
the merge_fasta_filesand simulate_read_sc functions.
The workflow involves merging paternal and maternal FASTA files for each
clone and then simulating single-cell sequencing reads using the ART
(Huang et al., Bioinformatics 2012).
Load required libraries and simulation setting data generated from
advanced_4_simulation_setting.
# Load library
library(parallel)
library(cancerSimCraft)
# Load data
# Load the simulation settings
load("./tutorials_data/advanced_4_simulation_setting.RData")
For each clone, we merge the paternal and maternal FASTA files into a
single file using the merge_fasta_files function.
# Merge FASTA files for each clone
for (clone in clone_names) {
merge_fasta_files(
paternal_fa = paste0("./large_tutorials_data/clone_", clone, "_paternal.fa"),
maternal_fa = paste0("./large_tutorials_data/clone_", clone, "_maternal.fa"),
output_fa = paste0("./large_tutorials_data/clone_", clone, "_merged.fa"),
tmp_dir = paste0("./large_tutorials_data/")
)
}
Next, we simulate single-cell sequencing reads using
the simulate_read_sc function. We specify the sequencing
depth, read length, and other parameters.
art_path = "~/postdoc_project/cancerSimCraft/tutorials/softwares/art_bin_MountRainier/art_illumina"
depth = 0.01
readLen = 150
sim_clone_num <- c(5, 6, 3, 2, 5, 3, 2)
names(sim_clone_num) <- clone_names
n_cores = 2
print("sim_clone_num:")
print(sim_clone_num)
tic(paste0("Simulation of ", sum(sim_clone_num), " Single Cell Fastq Files with ", n_cores, " Cores"))
for(clone in clone_names){
tic(paste0("Simulation of ", sim_clone_num[clone], " Fastq Files from Clone ", clone))
simulate_read_sc(fasta_input = paste0("./large_tutorials_data/clone_", clone, "_merged.fa"),
output_prefixes = paste0("large_tutorials_data/sc_reads_clonal/clone_", clone, "_", 1:sim_clone_num[clone]),
depth = depth,
readLen = readLen,
artPath = art_path,
paired = FALSE,
numCores = n_cores,
otherArgs = "--noALN")
toc()
}
toc()
print("All simulation finished!")
After generating the FASTQ files, users can process the simulated
reads using standard sequence processing tools such
as bowtie2 (Langmead and Salzberg, Nature Methods 2012)
or bwa (Li and Durbin, Bioinformatics 2009) for alignment,
followed by samtools (Danecek et al., GigaScience 2021) for
SAM/BAM file manipulation. Once the reads are aligned, downstream
analysis tools can be applied for copy number (CN) analysis, variant
calling, or other genomic analyses.