This vignette demonstrates how to generate a synthetic genome with single nucleotide polymorphisms (SNPs) using phased VCF files and a reference genome. The process involves reading VCF files, converting them into a list of SNPs, checking compatibility with the reference genome, inserting SNPs into the genome, and verifying the results.
Phased VCF files contain information about SNPs, including their positions, reference alleles, and alternate alleles. These files are loaded and processed to extract SNP information.
# Load libraries
library(Biostrings)
library(cancerSimCraft)
# Load phased VCF files
TN28N_vcf_list <- readRDS(file = "./tutorials_data/TN28N_vcf_list.rds")
# Define the sample name
SAMPLE_NAME <- "TN28N"
# Convert VCF files into a list of SNPs
TN28N_snp_list <- lapply(TN28N_vcf_list, function(vcf_table) {
vcf_to_snp_list(vcf_table, sample_name = SAMPLE_NAME)
})
The reference genome is loaded from FASTA files. These files contain the reference sequences for each chromosome.
# Define chromosome indices
chr_indices <- c(1:3)
# Define file paths for the reference genome
file_paths <- paste0("./large_tutorials_data/ucsc_hg19_chr/chr", chr_indices, ".fa.gz")
# Load the reference genome
ref_genome <- readDNAStringSet(filepath = file_paths, format = "fasta")
Before introducing SNPs into the genome, it is important to verify that the reference nucleotides in the SNP list match the corresponding positions in the reference genome.
# Check compatibility between the reference genome and SNP list
check_ref_snp_match(
seg_names = paste0('chr', chr_indices),
snp_list = TN28N_snp_list,
ref_genome = ref_genome
)
## [1] "seq_ref and snp_ref matches in chr1!"
## [1] "seq_ref and snp_ref matches in chr2!"
## [1] "seq_ref and snp_ref matches in chr3!"
SNPs are introduced into the reference genome to create a synthetic genome with SNPs for both maternal and paternal haplotypes.
# Insert SNPs into the reference genome
genome_with_snp <- insert_snps_to_genome(
seg_names = paste0('chr', chr_indices),
snp_list = TN28N_snp_list,
ref_genome = ref_genome
)
## [1] "Insert snp to the chr1 of the maternal genome."
## [1] "Insert snp to the chr1 of the paternal genome."
## [1] "Insert snp to the chr2 of the maternal genome."
## [1] "Insert snp to the chr2 of the paternal genome."
## [1] "Insert snp to the chr3 of the maternal genome."
## [1] "Insert snp to the chr3 of the paternal genome."
After inserting SNPs, it is important to verify that the alternate alleles in the SNP list match the corresponding positions in the synthetic genome.
# Verify SNP insertion
check_alt_snp_match(
seg_names = paste0('chr', chr_indices),
sim_genome = genome_with_snp$sim_genome,
snp_list = TN28N_snp_list
)
## [1] "seq_alt and snp_alt matches in maternal chr1!"
## [1] "seq_alt and snp_alt matches in paternal chr1!"
## [1] "seq_alt and snp_alt matches in maternal chr2!"
## [1] "seq_alt and snp_alt matches in paternal chr2!"
## [1] "seq_alt and snp_alt matches in maternal chr3!"
## [1] "seq_alt and snp_alt matches in paternal chr3!"
# Save complete environment for full reproducibility
save.image(file = "./large_tutorials_data/advanced_5_backbone_genome_with_snp.RData")