Introduction

This tutorial provides a step-by-step guide on how to extract copy number patterns from genomic data processed using CHISEL. The goal is to identify genomic segments that share the same copy number events and prepare boundary data for configuring simulation setups. By the end of this tutorial, you will be able to transform CHISEL-processed data into a format suitable for downstream simulations.

Load Required Libraries and Data

library(pheatmap)
library(pals)
library(cancerSimCraft)

# Load preprocessed data from previous steps
load("./tutorials_data/advanced_1_chisel_data_preprocessing.RData")

Define Target Chromosomes

In this step, we define the chromosomes of interest for each haplotype based on the copy number heatmap generated during the preprocessing step (advanced_1_chisel_data_preprocessing). These chromosomes exhibit distinctive haplotype-specific copy number patterns, making them suitable for simulation setups.

# Define chromosomes of interest for each haplotype
target_chr <- list(
  maternal = c("chr6", "chr11", "chr22"),
  paternal = c("chr10", "chr16", "chr17", "chr20")
)

This list specifies the chromosomes that will be used for further analysis and simulation configuration.

Calculate Copy Number Correlations

Copy number data alone does not provide explicit segment boundary information (i.e., the exact genomic coordinates where copy number states change). To identify these boundaries for simulation configuration, we perform correlation analysis. For each target chromosome, we calculate the Spearman correlation between genomic bins based on their copy number profiles across cells. The resulting correlation coefficients indicate the similarity in copy number patterns between genomic bins, enabling us to define regions affected by the same copy number events.

# Initialize list to store correlation matrices
cn_cor <- list()

# Calculate correlations for each haplotype and chromosome
for(haplotype in c("maternal", "paternal")) {
  for(chr in target_chr[[haplotype]]) {
    # Extract copy number matrix for specific chromosome
    cn_mat <- haplotype_cn[[haplotype]][, grep(pattern = chr, 
                                              colnames(haplotype_cn[[haplotype]]))]
    # Calculate the correlation matrix using Spearman's method
    cn_cor[[haplotype]][[chr]] <- cor(cn_mat, method = "spearman")
  }
}

This code computes correlation matrices for each chromosome of interest, separately for maternal and paternal haplotypes. These matrices will be used to identify regions with similar copy number patterns in subsequent steps.

Visualize Correlation Patterns

To better understand the correlation patterns and identify regions of shared copy number states, we create heatmaps for each chromosome. These heatmaps visually represent the correlation matrices, highlighting regions of high correlation that likely correspond to genomic segments affected by the same copy number events.

# Generate correlation heatmaps
for(haplotype in c("maternal", "paternal")) {
  for(chr in target_chr[[haplotype]]) {
    pheatmap(cn_cor[[haplotype]][[chr]], 
             cluster_rows = FALSE, 
             cluster_cols = FALSE,
             main = paste0("Spearman Cor of ", haplotype, " ", chr, " CN Profile"),
             display_numbers = TRUE,  
             fontsize_number = 6,     
             color = colorRampPalette(c("blue", "white", "red"))(50),
             border_color = "NA")    
  }
}

Identify Region Clusters

To identify groups of genomic regions with highly correlated copy number profiles, we employ a dynamic window-based clustering approach. The identify_region_clusters function implements the following algorithm:

  1. Sliding Window Initialization: Start with a minimum cluster size of 2 bins.

  2. Average Correlation Calculation: Compute the average correlation within each sliding window.

  3. Dynamic Window Extension: Extend the window dynamically when the average correlation exceeds a predefined threshold.

As a result of this process, adjacent regions with high correlation are merged into clusters, representing genomic segments with similar copy number patterns.

# Initialize list to store cluster information
region_clusters <- list()

# Identify clusters for each haplotype and chromosome
for(haplotype in c("maternal", "paternal")) {
  for(chr in target_chr[[haplotype]]) {
    region_clusters[[haplotype]][[chr]] <- identify_region_clusters(
      cor_matrix = cn_cor[[haplotype]][[chr]],
      min_cluster_size = 2,
      threshold = 0.85
    )
  }
}

Key Parameters:

  • cor_matrix: The correlation matrix for the specific chromosome and haplotype.

  • min_cluster_size: The minimum number of bins required to form a cluster (default: 2).

  • threshold: The correlation threshold for identifying highly correlated regions (default: 0.85).

Output:

  • region_clusters: A nested list containing clusters of genomic regions for each haplotype and chromosome. Each cluster consists of regions with highly correlated copy number profiles.

Purpose:

The identify_region_clusters function identifies groups of cells with highly correlated copy number profiles (correlation > 0.85) and a minimum cluster size of 2 cells. These clusters represent common copy number patterns that can be used to define simulation parameters and study the impact of copy number variation.

Understanding Region Clusters: An Example from Chromosome 6

The region clusters identified on maternal chromosome 6 represent groups of genomic bins with highly correlated copy number patterns. Each cluster corresponds to a contiguous genomic region where bins exhibit similar copy number profiles across cells.

Let’s examine the structure of region clusters by looking at the first cluster on maternal chromosome 6:

# Example of the first cluster
region_clusters$maternal$chr6[[1]]
##  chr6_5000000_10000000 chr6_10000000_15000000 
##                      2                      3

This cluster contains two adjacent genomic bins:

  1. First bin: "chr6_5000000_10000000" represents the region from 5Mb to 10Mb on chromosome 6.

  2. Second bin: "chr6_10000000_15000000" represents the region from 10Mb to 15Mb on chromosome 6.

The numbers below (e.g., 2 and 3) are the bin indices in the original data matrix. These bins were clustered together because they exhibit highly correlated copy number patterns across cells, suggesting they likely underwent similar copy number alterations during tumor evolution.

Manual Refinement of Region Clusters

After identifying initial region clusters, we refine them based on known copy number events from the CHISEL paper (Zaccaria and Raphael, Nat Biotechnol 2021) and visual inspection of the data. To facilitate focused examination of specific genomic regions, we use the subset_interested_profiles function to extract and visualize copy number patterns in regions of interest.

# Initialize list for manually refined clusters
modified_region_clusters <- list()

Maternal Chromosome 6

According to the CHISEL paper:

  • chr6q LOH occurred in clone I → MRCA.

  • chr6p deletion occurred in MRCA → clone II.

Key Steps:

  1. Subset Regions of Interest: The subset_interested_profiles function extracts copy number profiles for specific genomic regions.

  2. Visual Inspection: The plot_truth_heatmap function generates a heatmap to visualize copy number patterns in the selected regions.

  3. Refinement: Only clusters validated by known events and visual inspection are retained in the modified_region_clusters list.

Purpose:

This manual refinement step ensures that the identified clusters are biologically meaningful and consistent with known copy number events, improving the accuracy of downstream analyses and simulations.

# Examine specific regions of interest
haplotype = "maternal"
chr = "chr6"
# Define regions of interest based on prior clustering results
# The indices (e.g., 4:15, 17:19, 31, 32) correspond to bins identified in the region_clusters list
# These regions were selected because they show copy number patterns consistent with known events
region_indices <- c(4:15, 17:19, 31, 32)  # Regions showing these events

# Visualize focused regions
plot_truth_heatmap(
    seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
    title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
    chr_bar = chr_bar,
    integer_col = TRUE,
    max_int = max_int,
    clone_identity_vector = clone_identity_vector,
    clone_color_anno = clone_color_anno
)

# Keep only validated clusters
modified_region_clusters$maternal$chr6 <- region_clusters$maternal$chr6[c(2, 3, 6)]

Paternal Chromosome 10

According to the CHISEL paper:

  • chr10q LOH occurred in clone I → MRCA.

  • chr10p deletion occurred in MRCA → clone II.

haplotype = "paternal"
chr = "chr10"
region_indices <- c(1:7, 17:24)  # Regions corresponding to these events

plot_truth_heatmap(
    seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
    title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
    chr_bar = chr_bar,
    integer_col = TRUE,
    max_int = max_int,
    clone_identity_vector = clone_identity_vector,
    clone_color_anno = clone_color_anno
)

modified_region_clusters$paternal$chr10 <- region_clusters$paternal$chr10[c(1,3)]

Maternal Chromosome 11

According to the CHISEL paper:

  • chr11q LOH occurred in clone I → MRCA.
haplotype = "maternal"
chr = "chr11"
region_indices <- c(24:27)

plot_truth_heatmap(
    seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
    title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
    chr_bar = chr_bar,
    integer_col = TRUE,
    max_int = max_int,
    clone_identity_vector = clone_identity_vector,
    clone_color_anno = clone_color_anno
)

modified_region_clusters$maternal$chr11 <- region_clusters$maternal$chr11[2]

Paternal Chromosome 16

Based on the CHISEL paper:

  • chr16p duplication occurred in clone I → MRCA.
haplotype = "paternal"
chr = "chr16"
region_indices <- c(1:7)
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
                   title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
                   chr_bar = chr_bar, integer_col = TRUE, max_int = max_int, 
                   clone_identity_vector = clone_identity_vector, 
                   clone_color_anno = clone_color_anno)

# Keep first cluster but remove 7th window bin
modified_region_clusters$paternal$chr16 <- region_clusters$paternal$chr16[1]
modified_region_clusters$paternal$chr16[[1]] <- modified_region_clusters$paternal$chr16[[1]][-7]

Paternal Chromosome 17

Based on the CHISEL paper:

  • chr17p LOH occurred in clone I → MRCA.
haplotype = "paternal"
chr = "chr17"
region_indices <- c(1:3, 6,7)
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
                   title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
                   chr_bar = chr_bar, integer_col = TRUE, max_int = max_int, 
                   clone_identity_vector = clone_identity_vector, 
                   clone_color_anno = clone_color_anno)

modified_region_clusters$paternal$chr17 <- region_clusters$paternal$chr17[c(1)]

Maternal Chromosome 22

Based on the CHISEL paper:

  • Focal LOH occurred in clone I → MRCA.
haplotype = "maternal"
chr = "chr22"
region_indices = 5
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
                   title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
                   chr_bar = chr_bar, integer_col = TRUE, max_int = max_int, 
                   clone_identity_vector = clone_identity_vector, 
                   clone_color_anno = clone_color_anno)

modified_region_clusters$maternal$chr22 <- list(5)
names(modified_region_clusters$maternal$chr22[[1]]) <- colnames(cn_cor$maternal$chr22)[5]
# Save complete environment for full reproducibility
save.image(file = "./tutorials_data/advanced_2_pattern_extraction.RData")