Introduction

This vignette demonstrates how to generate a synthetic genome with single nucleotide polymorphisms (SNPs) using phased VCF files and a reference genome. The process involves reading VCF files, converting them into a list of SNPs, checking compatibility with the reference genome, inserting SNPs into the genome, and verifying the results.

Load Phased VCF Files

Phased VCF files contain information about SNPs, including their positions, reference alleles, and alternate alleles. These files are loaded and processed to extract SNP information.

# Load libraries 
library(Biostrings)
library(cancerSimCraft)
# Load phased VCF files
TN28N_vcf_list <- readRDS(file = "./tutorials_data/TN28N_vcf_list.rds")

# Define the sample name
SAMPLE_NAME <- "TN28N"

# Convert VCF files into a list of SNPs
TN28N_snp_list <- lapply(TN28N_vcf_list, function(vcf_table) {
  vcf_to_snp_list(vcf_table, sample_name = SAMPLE_NAME)
})

Load the Reference Genome

The reference genome is loaded from FASTA files. These files contain the reference sequences for each chromosome.

# Define chromosome indices
chr_indices <- c(1:3)

# Define file paths for the reference genome
file_paths <- paste0("./large_tutorials_data/ucsc_hg19_chr/chr", chr_indices, ".fa.gz")

# Load the reference genome
ref_genome <- readDNAStringSet(filepath = file_paths, format = "fasta")

Check Reference Genome and SNP List Compatibility

Before introducing SNPs into the genome, it is important to verify that the reference nucleotides in the SNP list match the corresponding positions in the reference genome.

# Check compatibility between the reference genome and SNP list
check_ref_snp_match(
  seg_names = paste0('chr', chr_indices),
  snp_list = TN28N_snp_list,
  ref_genome = ref_genome
)
## [1] "seq_ref and snp_ref matches in chr1!"
## [1] "seq_ref and snp_ref matches in chr2!"
## [1] "seq_ref and snp_ref matches in chr3!"

Insert SNPs into the Genome

SNPs are introduced into the reference genome to create a synthetic genome with SNPs for both maternal and paternal haplotypes.

# Insert SNPs into the reference genome
genome_with_snp <- insert_snps_to_genome(
  seg_names = paste0('chr', chr_indices),
  snp_list = TN28N_snp_list,
  ref_genome = ref_genome
)
## [1] "Insert snp to the chr1 of the maternal genome."
## [1] "Insert snp to the chr1 of the paternal genome."
## [1] "Insert snp to the chr2 of the maternal genome."
## [1] "Insert snp to the chr2 of the paternal genome."
## [1] "Insert snp to the chr3 of the maternal genome."
## [1] "Insert snp to the chr3 of the paternal genome."

Verify SNP Insertion

After inserting SNPs, it is important to verify that the alternate alleles in the SNP list match the corresponding positions in the synthetic genome.

# Verify SNP insertion
check_alt_snp_match(
  seg_names = paste0('chr', chr_indices),
  sim_genome = genome_with_snp$sim_genome,
  snp_list = TN28N_snp_list
)
## [1] "seq_alt and snp_alt matches in maternal chr1!"
## [1] "seq_alt and snp_alt matches in paternal chr1!"
## [1] "seq_alt and snp_alt matches in maternal chr2!"
## [1] "seq_alt and snp_alt matches in paternal chr2!"
## [1] "seq_alt and snp_alt matches in maternal chr3!"
## [1] "seq_alt and snp_alt matches in paternal chr3!"
# Save complete environment for full reproducibility
save.image(file = "./large_tutorials_data/advanced_5_backbone_genome_with_snp.RData")