This tutorial provides a step-by-step guide on how to extract copy number patterns from genomic data processed using CHISEL. The goal is to identify genomic segments that share the same copy number events and prepare boundary data for configuring simulation setups. By the end of this tutorial, you will be able to transform CHISEL-processed data into a format suitable for downstream simulations.
library(pheatmap)
library(pals)
library(cancerSimCraft)
# Load preprocessed data from previous steps
load("./tutorials_data/advanced_1_chisel_data_preprocessing.RData")
In this step, we define the chromosomes of interest for each
haplotype based on the copy number heatmap generated during the
preprocessing step (advanced_1_chisel_data_preprocessing).
These chromosomes exhibit distinctive haplotype-specific copy number
patterns, making them suitable for simulation setups.
# Define chromosomes of interest for each haplotype
target_chr <- list(
maternal = c("chr6", "chr11", "chr22"),
paternal = c("chr10", "chr16", "chr17", "chr20")
)
This list specifies the chromosomes that will be used for further analysis and simulation configuration.
Copy number data alone does not provide explicit segment boundary information (i.e., the exact genomic coordinates where copy number states change). To identify these boundaries for simulation configuration, we perform correlation analysis. For each target chromosome, we calculate the Spearman correlation between genomic bins based on their copy number profiles across cells. The resulting correlation coefficients indicate the similarity in copy number patterns between genomic bins, enabling us to define regions affected by the same copy number events.
# Initialize list to store correlation matrices
cn_cor <- list()
# Calculate correlations for each haplotype and chromosome
for(haplotype in c("maternal", "paternal")) {
for(chr in target_chr[[haplotype]]) {
# Extract copy number matrix for specific chromosome
cn_mat <- haplotype_cn[[haplotype]][, grep(pattern = chr,
colnames(haplotype_cn[[haplotype]]))]
# Calculate the correlation matrix using Spearman's method
cn_cor[[haplotype]][[chr]] <- cor(cn_mat, method = "spearman")
}
}
This code computes correlation matrices for each chromosome of interest, separately for maternal and paternal haplotypes. These matrices will be used to identify regions with similar copy number patterns in subsequent steps.
To better understand the correlation patterns and identify regions of shared copy number states, we create heatmaps for each chromosome. These heatmaps visually represent the correlation matrices, highlighting regions of high correlation that likely correspond to genomic segments affected by the same copy number events.
# Generate correlation heatmaps
for(haplotype in c("maternal", "paternal")) {
for(chr in target_chr[[haplotype]]) {
pheatmap(cn_cor[[haplotype]][[chr]],
cluster_rows = FALSE,
cluster_cols = FALSE,
main = paste0("Spearman Cor of ", haplotype, " ", chr, " CN Profile"),
display_numbers = TRUE,
fontsize_number = 6,
color = colorRampPalette(c("blue", "white", "red"))(50),
border_color = "NA")
}
}
To identify groups of genomic regions with highly correlated copy
number profiles, we employ a dynamic window-based clustering
approach. The identify_region_clusters function
implements the following algorithm:
Sliding Window Initialization: Start with a minimum cluster size of 2 bins.
Average Correlation Calculation: Compute the average correlation within each sliding window.
Dynamic Window Extension: Extend the window dynamically when the average correlation exceeds a predefined threshold.
As a result of this process, adjacent regions with high correlation are merged into clusters, representing genomic segments with similar copy number patterns.
# Initialize list to store cluster information
region_clusters <- list()
# Identify clusters for each haplotype and chromosome
for(haplotype in c("maternal", "paternal")) {
for(chr in target_chr[[haplotype]]) {
region_clusters[[haplotype]][[chr]] <- identify_region_clusters(
cor_matrix = cn_cor[[haplotype]][[chr]],
min_cluster_size = 2,
threshold = 0.85
)
}
}
cor_matrix: The correlation matrix
for the specific chromosome and haplotype.
min_cluster_size: The minimum
number of bins required to form a cluster (default: 2).
threshold: The correlation
threshold for identifying highly correlated regions (default:
0.85).
region_clusters: A nested list
containing clusters of genomic regions for each haplotype and
chromosome. Each cluster consists of regions with highly correlated copy
number profiles.The identify_region_clusters function identifies groups
of cells with highly correlated copy number profiles (correlation >
0.85) and a minimum cluster size of 2 cells. These clusters represent
common copy number patterns that can be used to define simulation
parameters and study the impact of copy number variation.
The region clusters identified on maternal chromosome 6 represent groups of genomic bins with highly correlated copy number patterns. Each cluster corresponds to a contiguous genomic region where bins exhibit similar copy number profiles across cells.
Let’s examine the structure of region clusters by looking at the first cluster on maternal chromosome 6:
# Example of the first cluster
region_clusters$maternal$chr6[[1]]
## chr6_5000000_10000000 chr6_10000000_15000000
## 2 3
This cluster contains two adjacent genomic bins:
First
bin:Â "chr6_5000000_10000000"Â represents the region
from 5Mb to 10Mb on chromosome 6.
Second
bin:Â "chr6_10000000_15000000"Â represents the
region from 10Mb to 15Mb on chromosome 6.
The numbers below (e.g., 2 and 3) are
the bin indices in the original data matrix. These bins
were clustered together because they exhibit highly correlated copy
number patterns across cells, suggesting they likely underwent similar
copy number alterations during tumor evolution.
After identifying initial region clusters, we refine them based
on known copy number events from the CHISEL paper
(Zaccaria and Raphael, Nat Biotechnol 2021) and visual
inspection of the data. To facilitate focused examination of
specific genomic regions, we use
the subset_interested_profiles function to extract and
visualize copy number patterns in regions of interest.
# Initialize list for manually refined clusters
modified_region_clusters <- list()
According to the CHISEL paper:
chr6q LOH occurred in clone I → MRCA.
chr6p deletion occurred in MRCA → clone II.
Subset Regions of Interest:
The subset_interested_profiles function extracts copy
number profiles for specific genomic regions.
Visual Inspection:
The plot_truth_heatmap function generates a heatmap to
visualize copy number patterns in the selected regions.
Refinement: Only clusters validated by known
events and visual inspection are retained in
the modified_region_clusters list.
This manual refinement step ensures that the identified clusters are biologically meaningful and consistent with known copy number events, improving the accuracy of downstream analyses and simulations.
# Examine specific regions of interest
haplotype = "maternal"
chr = "chr6"
# Define regions of interest based on prior clustering results
# The indices (e.g., 4:15, 17:19, 31, 32) correspond to bins identified in the region_clusters list
# These regions were selected because they show copy number patterns consistent with known events
region_indices <- c(4:15, 17:19, 31, 32) # Regions showing these events
# Visualize focused regions
plot_truth_heatmap(
seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar,
integer_col = TRUE,
max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno
)
# Keep only validated clusters
modified_region_clusters$maternal$chr6 <- region_clusters$maternal$chr6[c(2, 3, 6)]
According to the CHISEL paper:
chr10q LOH occurred in clone I → MRCA.
chr10p deletion occurred in MRCA → clone II.
haplotype = "paternal"
chr = "chr10"
region_indices <- c(1:7, 17:24) # Regions corresponding to these events
plot_truth_heatmap(
seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar,
integer_col = TRUE,
max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno
)
modified_region_clusters$paternal$chr10 <- region_clusters$paternal$chr10[c(1,3)]
According to the CHISEL paper:
haplotype = "maternal"
chr = "chr11"
region_indices <- c(24:27)
plot_truth_heatmap(
seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar,
integer_col = TRUE,
max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno
)
modified_region_clusters$maternal$chr11 <- region_clusters$maternal$chr11[2]
Based on the CHISEL paper:
haplotype = "paternal"
chr = "chr16"
region_indices <- c(1:7)
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar, integer_col = TRUE, max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno)
# Keep first cluster but remove 7th window bin
modified_region_clusters$paternal$chr16 <- region_clusters$paternal$chr16[1]
modified_region_clusters$paternal$chr16[[1]] <- modified_region_clusters$paternal$chr16[[1]][-7]
Based on the CHISEL paper:
haplotype = "paternal"
chr = "chr17"
region_indices <- c(1:3, 6,7)
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar, integer_col = TRUE, max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno)
modified_region_clusters$paternal$chr17 <- region_clusters$paternal$chr17[c(1)]
Based on the CHISEL paper:
haplotype = "maternal"
chr = "chr22"
region_indices = 5
plot_truth_heatmap(seg_dat = subset_interested_profiles(haplotype_cn, haplotype, chr, region_indices),
title = paste(haplotype, chr, "focus", paste(region_indices, collapse = " ")),
chr_bar = chr_bar, integer_col = TRUE, max_int = max_int,
clone_identity_vector = clone_identity_vector,
clone_color_anno = clone_color_anno)
modified_region_clusters$maternal$chr22 <- list(5)
names(modified_region_clusters$maternal$chr22[[1]]) <- colnames(cn_cor$maternal$chr22)[5]
# Save complete environment for full reproducibility
save.image(file = "./tutorials_data/advanced_2_pattern_extraction.RData")