Preprocessing CHISEL’s output

Introduction

This tutorial demonstrates how to preprocess single-cell copy number profiles from CHISEL (Copy-number Haplotype Inference in Single-cell by Evolutionary Links) (https://github.com/raphael-group/chisel/) for downstream analysis such as copy number pattern extraction, simulation configuration, and ground truth generation. We’ll cover data loading, organization, and conversion of copy number states into analysis-ready matrices. The example data is from the CHISEL publication (Zaccaria and Raphael, Nat Biotechnol 2021).

Data Description

We use example data from the CHISEL paper’s data repository (https://github.com/raphael-group/chisel-data/). The data files have been renamed for clarity:

patientS0_sectionE_calls.tsv: Contains copy number states and haplotype-specific information for each genomic segment across cells (originally from CHISEL’s calls.tsv)
patientS0_sectionE_mapping.tsv: Contains cell-to-cluster and clone mapping information (originally from CHISEL’s mapping.tsv)

The mapping data (patientS0_sectionE_mapping.tsv) contains three columns:

CELL: Cell barcode identifier (e.g., “AAACCTGCACCAAAGG”)
CLUSTER: Numerical cluster ID assigned by CHISEL (e.g., “199”)
CLONE: Clone assignment with format “Clone[number]” or “None” for unassigned cells (e.g., “Clone199”)

The copy number data (patientS0_sectionE_calls.tsv) contains the following columns:

CHR: Chromosome
START: Segment start position
END: Segment end position
CELL: Cell barcode identifier
NORM_COUNT: Normalized read count
COUNT: Raw read count
RDR: Read depth ratio
A_COUNT, B_COUNT: Read counts for each haplotype
BAF: B-allele frequency
CLUSTER: Copy number cluster ID
CN_STATE: Haplotype-specific copy number state (format: maternal|paternal)

library(dplyr)
library(tidyr)
library(pals)
library(cancerSimCraft)

# Read copy number and cluster data
chisel_cn_dat <- read.table("./tutorials_data/patientS0_sectionE_calls.tsv", sep = "\t", header = TRUE)
chisel_cluster_dat <- read.table("./tutorials_data/patientS0_sectionE_mapping.tsv", sep = "\t", header = TRUE)
rownames(chisel_cluster_dat) <- chisel_cluster_dat[,1]

head(chisel_cn_dat)

head(chisel_cluster_dat)

Data Preprocessing

After loading the raw data, we need to preprocess it to create a format suitable for copy number pattern analysis. The preprocessing steps include merging datasets, filtering cells, and reshaping the data structure.

# Merge two datasets by the cell index
chisel_combine_dat <- merge(chisel_cn_dat, chisel_cluster_dat, by = "CELL", all.x = TRUE)

# Filter out cells without clone assignments or marked as "None"
# This step ensures we only analyze cells with clear lineage assignments
chisel_combine_dat <- chisel_combine_dat %>% filter(!is.na(CLONE) & CLONE != "None") 

# Create a unique identifier for each chromosomal region
# This combines chromosome name and coordinates into a single identifier
chisel_combine_dat$region_id <- paste(chisel_combine_dat$CHR, chisel_combine_dat$START, chisel_combine_dat$END, sep="_")

# Reshape the data to create a matrix-like structure
# Each row represents a cell
# Each column represents a genomic region
# Values are haplotype-specific copy number states
cn_state_df <- chisel_combine_dat %>%
  dplyr::select(CELL, region_id, CN_STATE) %>%
  tidyr::spread(key = region_id, value = CN_STATE)

# Set cell barcodes as row names and remove the CELL column
rownames(cn_state_df) <- cn_state_df[, 1]
cn_state_df <- cn_state_df[, -1]

cn_state_df[1:6, 1:6]

The resulting cn_state_df is a matrix where:

Rows represent individual cells identified by their barcodes (e.g., “AAACCTGCACCAAAGG”)
Columns represent unique genomic regions (e.g., “chr1_0_5000000”)
Each cell contains a haplotype-specific copy number state in the format “maternal|paternal” for example:
- “1|2”: one copy from maternal chromosome, two copies from paternal
- “2|1”: two copies from maternal chromosome, one copy from paternal
- “1|1”: one copy each from maternal and paternal chromosomes (normal diploid)

After creating the basic matrix structure, we need to organize the genomic regions and cells in a meaningful order to facilitate pattern analysis.

Organize Genomic Regions

First, we parse and sort the genomic regions to ensure they’re in chromosome order:

# Splitting the region_id into components
regions <- data.frame(region_id = colnames(cn_state_df), sep_column = colnames(cn_state_df))  # Excluding the first column (CELL)
regions <- regions %>%
  separate(sep_column, into = c("CHR", "START", "END"), sep = "_") %>%
  mutate(START = as.numeric(START), END = as.numeric(END)) %>%
  mutate(
    CHR_idx = as.numeric(gsub("chr", "", CHR))  # Extract numeric part of CHR
  )

# Sort regions by chromosome number and start position
sorted_regions <- regions %>%
  arrange(CHR_idx, START)

# Reorder matrix columns based on sorted genomic regions
cn_state_df <- cn_state_df[, sorted_regions$region_id]

Organize Cells by Clone

Next, we organize cells based on their clone assignments to group related cells together:

# Reorder rows (cells) based on clone assignments
cell_clone_map <- chisel_cluster_dat[rownames(cn_state_df),] %>% arrange(CLONE)
cn_state_df <- cn_state_df[rownames(cell_clone_map), ]

The final preprocessed data structure has:

Columns ordered by chromosomal position (from chr1 to chrX/Y)
Rows (cells) ordered by clone assignment
This organization makes it easier to:
1. Visualize chromosome-wide patterns
2. Identify clone-specific copy number alterations
3. Detect regional copy number changes
4. Compare patterns across different chromosomes

Clone Symbol Mapping

To maintain consistency with the published CHISEL paper’s clone annotations (Zaccaria and Raphael, Nat Biotechnol 2021), we map the clone names in the copy number data to their corresponding published Roman numerals:

# Create a mapping from chisel's output clone names to paper clone symbols
clone_symbols_map <- c("Clone5" = "I", "Clone63" = "II", "Clone156" = "III", 
                      "Clone172" = "IV", "Clone199" = "V", "Clone241" = "VI")

# Add paper clone symbols to cell mapping data
cell_clone_map <- cell_clone_map %>%
  mutate(PAPER_CLONE = clone_symbols_map[as.character(CLONE)])

Separate Haplotype-specific Copy Numbers

The copy number states in our data are stored as “maternal|paternal” strings. We’ll separate these into distinct matrices:

# Function to split haplotype string into numeric values
split_haplotypes <- function(cn_state) {
  haplotypes <- strsplit(as.character(cn_state), split = "\\|")
  sapply(haplotypes, function(x) as.numeric(x))
}

# Create separate matrices for maternal and paternal copy numbers
haplotype_cn_arrays <- apply(cn_state_df, c(1:2), split_haplotypes)

# Organize into a list of matrices
haplotype_cn <- list() 
haplotype_cn[['maternal']] <- haplotype_cn_arrays[1,,]
haplotype_cn[['paternal']] <- haplotype_cn_arrays[2,,]

# Calculate total copy number
total_cn <- haplotype_cn[['maternal']] + haplotype_cn[['paternal']]

# Set matrix dimensions names
rownames(haplotype_cn[['maternal']]) <- rownames(haplotype_cn[['paternal']]) <- 
  rownames(total_cn) <- rownames(cn_state_df)
colnames(haplotype_cn[['maternal']]) <- colnames(haplotype_cn[['paternal']]) <- 
  colnames(total_cn) <- colnames(cn_state_df)

Prepare Visualization Elements

# Create chromosome position data for visualization
window_data <- sorted_regions[, c('CHR', 'START', 'END')]
chr_bar <- draw_chr_bar(window_data)

# Set up clone color scheme
goodColor <- c("pink", "gold", "darkseagreen", "firebrick4", "lightskyblue", 
               "orange1", "lightyellow", "royalblue", "moccasin","violet", 
               "olivedrab3", "blueviolet", "yellow2", "darkgreen", "royalblue4", 
               "lightsalmon3", "mediumaquamarine", "coral1", "darkolivegreen1", 
               "cyan4", "tan4", "grey80")

# Prepare clone annotation
clone_identity_vector <- cell_clone_map[rownames(cn_state_df), 'PAPER_CLONE'] 
clone_names <- unique(clone_identity_vector)
clone_color_anno <- goodColor[1:length(clone_names)]
names(clone_color_anno) <- clone_names

Generate Copy Number Heatmaps

# Determine maximum copy number for consistent scale
max_int <- max(sapply(haplotype_cn, max))

# Generate heatmaps for each type of copy number
for(haplotype in c("maternal", "paternal")){
  plot_truth_heatmap(seg_dat = haplotype_cn[[haplotype]], 
                     title = paste0("CHISEL ", haplotype, " copy number"),
                     chr_bar = chr_bar, 
                     integer_col = TRUE, 
                     max_int = max_int, 
                     clone_identity_vector = clone_identity_vector, 
                     clone_color_anno = clone_color_anno)
}

# Generate total copy number heatmap
plot_truth_heatmap(seg_dat = total_cn, 
                   title = "CHISEL total copy number",
                   chr_bar = chr_bar, 
                   integer_col = TRUE, 
                   max_int = max_int, 
                   clone_identity_vector = clone_identity_vector, 
                   clone_color_anno = clone_color_anno)

# Save complete environment for full reproducibility
save.image(file = "./tutorials_data/advanced_1_chisel_data_preprocessing.RData")