Mapping Genomic Regions from hg38 to hg19 for Simulation Setup

This tutorial demonstrates how to map genomic regions originally defined in the hg38 reference genome to the hg19 reference genome and derive a simulation setting based on hg19. This is particularly useful when working with genomic data from different reference genomes, such as copy number variations (CNVs) or other genomic features.

Overview

  1. Input Data:

    • sorted_regions: Genomic regions defined in hg38 (advanced_2_pattern_extraction).

    • hg38_cytoband and hg19_cytoband: Cytoband information for hg38 and hg19, respectively.

  2. Workflow:

    • Map genomic regions from hg38 to hg19 using cytobands.

    • Derive simulation regions based on hg19 coordinates.

  3. Output:

    • A mapping of genomic regions to hg19 cytobands.

    • A simulation setup based on hg19 coordinates.

Step 1: Load Required Data

First, load the genomic regions (sorted_regions) and cytoband information for hg38 and hg19.

# Load preprocessed data from previous steps
load("./tutorials_data/advanced_2_pattern_extraction.RData")
library(cancerSimCraft)
# load hg19 cytoband info 
hg19_cytoband <- read.table(file = "./tutorials_data/ucsc_cytoband_hg19.tsv", sep = "\t", header = TRUE)
hg38_cytoband <- read.table(file = "./tutorials_data/ucsc_cytoband_hg38.tsv", sep = "\t", header = TRUE)

Step 2: Map Genomic Regions to Cytobands

Map each genomic region in sorted_regions to its corresponding cytobands in hg38 and hg19.

# Initialize a list to store the mapping
cytoband_mapping <- list() 
# Loop through each genomic region
for (i in 1:nrow(sorted_regions)) {
  # Extract region information
  quant_start <- sorted_regions[i, 'START']
  quant_end <- sorted_regions[i, 'END']
  quant_chrom <- sorted_regions[i, 'CHR']
  
  # Find overlapping cytobands in hg38
  cytoband_mapping[['hg38']][[i]] <- hg38_cytoband[which(
    (hg38_cytoband$chrom == quant_chrom) & 
      (hg38_cytoband$chromStart < quant_end) & 
      (hg38_cytoband$chromEnd > quant_start)
  ), ]
  
  # Find corresponding cytobands in hg19
  cytoband_mapping[['hg19']][[i]] <- hg19_cytoband[which(
    (hg19_cytoband$chrom == quant_chrom) & 
      (hg19_cytoband$name %in% cytoband_mapping[['hg38']][[i]]$name)
  ), ]
  
}

# Assign region IDs to the mapping
names(cytoband_mapping[['hg38']]) <- names(cytoband_mapping[['hg19']]) <- sorted_regions$region_id

Step 3: Creating Genomic Segments from Cytoband Annotations

After mapping regions to cytobands, we create genomic segments for simulation. This involves two main steps:

3.1 Create Cluster Annotations

First, we create cytoband annotations for each cluster using create_cluster_cytoband_anno():

This function:

  • Takes clustered regions and their cytoband mappings as input

  • Creates annotations for both maternal and paternal haplotypes

  • Organizes annotations by chromosome and cluster

  • Combines cytoband information for all regions in each cluster

cytoband_anno <- create_cluster_cytoband_anno(region_clusters = modified_region_clusters,
                                              cytoband_mapping = cytoband_mapping$hg19)
# Remove the first sub region in chr10 and chr6, as it is almost covering the whole p-arm  
cytoband_anno$paternal$chr10 <- cytoband_anno$paternal$chr10[2]
cytoband_anno$maternal$chr6 <- cytoband_anno$maternal$chr6[c(2,3)]

3.2 Create Genomic Segments

Then convert cytoband annotations into genomic segments using sub_seg_from_cytoband_anno():

This function:

  • Creates segments spanning each cluster’s cytobands

  • Generates unique identifiers for each segment

  • Initializes metadata fields for downstream analysis

  • Returns segments organized by haplotype

# Convert cytoband annotations into genomic segments 
sub_seg_list <- sub_seg_from_cytoband_anno(cytoband_anno)

The resulting sub_seg_list contains:

  • Segment coordinates (start/end positions)

  • Haplotype information (maternal/paternal)

  • Reference coordinates

  • Copy number and segment tracking fields

  • Source information for segment evolution

# Save complete environment for full reproducibility
save.image(file = "./tutorials_data/advanced_3_genomic_region_mapping.RData")