This tutorial demonstrates how to map genomic regions originally defined in the hg38 reference genome to the hg19 reference genome and derive a simulation setting based on hg19. This is particularly useful when working with genomic data from different reference genomes, such as copy number variations (CNVs) or other genomic features.
Input Data:
sorted_regions: Genomic regions defined in hg38
(advanced_2_pattern_extraction).
hg38_cytoband and hg19_cytoband:
Cytoband information for hg38 and hg19, respectively.
Workflow:
Map genomic regions from hg38 to hg19 using cytobands.
Derive simulation regions based on hg19 coordinates.
Output:
A mapping of genomic regions to hg19 cytobands.
A simulation setup based on hg19 coordinates.
First, load the genomic regions (sorted_regions) and
cytoband information for hg38 and hg19.
# Load preprocessed data from previous steps
load("./tutorials_data/advanced_2_pattern_extraction.RData")
library(cancerSimCraft)
# load hg19 cytoband info
hg19_cytoband <- read.table(file = "./tutorials_data/ucsc_cytoband_hg19.tsv", sep = "\t", header = TRUE)
hg38_cytoband <- read.table(file = "./tutorials_data/ucsc_cytoband_hg38.tsv", sep = "\t", header = TRUE)
Map each genomic region in sorted_regions to its
corresponding cytobands in hg38 and hg19.
# Initialize a list to store the mapping
cytoband_mapping <- list()
# Loop through each genomic region
for (i in 1:nrow(sorted_regions)) {
# Extract region information
quant_start <- sorted_regions[i, 'START']
quant_end <- sorted_regions[i, 'END']
quant_chrom <- sorted_regions[i, 'CHR']
# Find overlapping cytobands in hg38
cytoband_mapping[['hg38']][[i]] <- hg38_cytoband[which(
(hg38_cytoband$chrom == quant_chrom) &
(hg38_cytoband$chromStart < quant_end) &
(hg38_cytoband$chromEnd > quant_start)
), ]
# Find corresponding cytobands in hg19
cytoband_mapping[['hg19']][[i]] <- hg19_cytoband[which(
(hg19_cytoband$chrom == quant_chrom) &
(hg19_cytoband$name %in% cytoband_mapping[['hg38']][[i]]$name)
), ]
}
# Assign region IDs to the mapping
names(cytoband_mapping[['hg38']]) <- names(cytoband_mapping[['hg19']]) <- sorted_regions$region_id
After mapping regions to cytobands, we create genomic segments for simulation. This involves two main steps:
First, we create cytoband annotations for each cluster using
create_cluster_cytoband_anno():
This function:
Takes clustered regions and their cytoband mappings as input
Creates annotations for both maternal and paternal haplotypes
Organizes annotations by chromosome and cluster
Combines cytoband information for all regions in each cluster
cytoband_anno <- create_cluster_cytoband_anno(region_clusters = modified_region_clusters,
cytoband_mapping = cytoband_mapping$hg19)
# Remove the first sub region in chr10 and chr6, as it is almost covering the whole p-arm
cytoband_anno$paternal$chr10 <- cytoband_anno$paternal$chr10[2]
cytoband_anno$maternal$chr6 <- cytoband_anno$maternal$chr6[c(2,3)]
Then convert cytoband annotations into genomic segments
using sub_seg_from_cytoband_anno():
This function:
Creates segments spanning each cluster’s cytobands
Generates unique identifiers for each segment
Initializes metadata fields for downstream analysis
Returns segments organized by haplotype
# Convert cytoband annotations into genomic segments
sub_seg_list <- sub_seg_from_cytoband_anno(cytoband_anno)
The resulting sub_seg_list contains:
Segment coordinates (start/end positions)
Haplotype information (maternal/paternal)
Reference coordinates
Copy number and segment tracking fields
Source information for segment evolution
# Save complete environment for full reproducibility
save.image(file = "./tutorials_data/advanced_3_genomic_region_mapping.RData")