This vignette demonstrates how to perform a basic manual simulation of clonal evolution using the cancerSimCraft package. We’ll walk through creating a simple phylogenetic tree structure and simulating genomic alterations including copy number variations (CNVs), whole genome duplication (WGD), and mutations (SNVs) across different clones.
The simulation process includes:
This example uses a simplified genome with three chromosomes to illustrate the basic functionality of the package. The concepts demonstrated here can be extended to more complex scenarios with full-genome simulations.
For this tutorial, we use a simplified genome structure consisting of three chromosomes. Each chromosome is divided into three distinct regions: p-arm, centromere, and q-arm. This simplified structure allows us to demonstrate the basic functionality while maintaining clarity.
Our mini genome simulates a simplified chromosome structure with both maternal and paternal copies (diploid). Each copy contains three chromosomes:
Each chromosome is divided into three regions: p-arm, centromere, and q-arm. This simplified structure allows us to demonstrate key genomic alterations including:
While much smaller than a real human genome, this mini genome provides an ideal framework for understanding the basic principles of clonal-level simulation.
data("genome")
print(genome)
## $maternal
## DNAStringSet object of length 3:
## width seq names
## [1] 100 AAAACTCCATGTGTAACTCCGGA...AAGTACAATTAGGATATTCATCC chr1
## [2] 200 CTACACTGTATATGCCGAACGTT...GAGTCGTCGGAAAGCAGTTGAAT chr2
## [3] 300 ATCTAACGAACCCTGCGCAAGGA...ATGACGAGCACCAGACCCGAGAG chr3
##
## $paternal
## DNAStringSet object of length 3:
## width seq names
## [1] 100 AAAACTCCATGTGTAACTCCGGA...AAGTACAATTAGGATATTCATCC chr1
## [2] 200 CTACACTGTATATGCCGAACGTT...GAGTCGTCGGAAAGCAGTTGAAT chr2
## [3] 300 ATCTAACGACCCCTGCGCAAGGA...ATGACGAGCACCAGACCCGAGAG chr3
We create a simple phylogenetic tree using igraph.
The tree structure represents evolutionary relationships of different tumor subclones.
library(igraph)
# Create tree using graph_from_literal:
# - The -+ syntax indicates directed edges (parent -> child)
# - D is the root node
# - C1 through C5 are descendant nodes
tree <- igraph::graph_from_literal(D-+C1-+C2-+C3, C1-+C4-+C5)
# Convert the tree to an edge list table format
tree_table <- data.frame(as_edgelist(tree))
colnames(tree_table) <- c("parent", "child")
Print the tree object
# The tree object contains all graph information including:
# - Nodes (vertices)
# - Edges (connections between nodes)
# - Directionality of relationships
tree
## IGRAPH 636e9e2 DN-- 6 5 --
## + attr: name (v/c)
## + edges from 636e9e2 (vertex names):
## [1] D ->C1 C1->C2 C1->C4 C2->C3 C4->C5
Print the tree table
# The tree_table shows parent-child relationships:
# - Each row represents one edge in the tree
# - 'parent' column shows the source node
# - 'child' column shows the destination node
tree_table
Plot the tree
plot(tree,
vertex.size = 30, # Larger nodes
vertex.label.dist = 0, # Label position (0 = centered)
vertex.label.color = "black", # Label color
vertex.color = "lightblue", # Node color
vertex.frame.color = "darkblue") # Node border color
The virtual genome structure is as follows:
# Define chromosome names
chr_names <- c("chr1", "chr2", "chr3")
# Create a list containing lengths for both maternal and paternal chromosomes
initial_chr_lengths <- list("maternal" = setNames(c(100, 200, 300), chr_names),
"paternal" = setNames(c(100, 200, 300), chr_names))
print(initial_chr_lengths)
## $maternal
## chr1 chr2 chr3
## 100 200 300
##
## $paternal
## chr1 chr2 chr3
## 100 200 300
Initialize a segment list for both maternal and paternal chromosomes. Each chromosome is divided into three regions:p, q, and centromere regions
# Create segment lists for both haplotypes
initial_seg_list <- list(
# Generate maternal segment table
"maternal" = initiate_seg_table(chr_lengths = initial_chr_lengths,
chr_names = chr_names,
centromere_length = 10,
haplotype = "maternal"),
# Generate paternal segment table
"paternal" = initiate_seg_table(chr_lengths = initial_chr_lengths,
chr_names = chr_names,
centromere_length = 10,
haplotype = "paternal"))
This section demonstrates how to manually generate genomic events along the branches of our predefined phylogenetic tree. Events include copy number variations (CNVs) and whole-genome doubling (WGD). Each event describes changes between a parent clone and its child clone.
The events occur in the following pattern:
From clone D to C1: Multiple CNVs and a whole-genome doubling event
From C1 to C2: CNVs in chr3q and chr2p
From C2 to C3: CNV in chr1p
From C1 to C4 and C4 to C5: CNVs affecting different regions
# Create a list of events using create_event function
# Each event specifies:
# - haplotype: maternal or paternal chromosome
# - parent and child: cells involved
# - region_name: affected chromosome region
# - CN_change: magnitude of copy number change
# - copy_index: which copy is affected
# - event_type: CNV or WGD
events <- list(
create_event(haplotype = "paternal", parent = "D", child = "C1", region_name = "chr1p", CN_change = 2, copy_index = 1, event_type = "CNV"),
create_event(haplotype = "paternal", parent = "D", child = "C1", region_name = "chr2q", CN_change = 1, copy_index = 1, event_type = "CNV"),
create_event(haplotype = "maternal", parent = "D", child = "C1", region_name = "chr3p", CN_change = 3, copy_index = 1, event_type = "CNV"),
create_event(haplotype = "paternal", parent = "D", child = "C1", region_name = "wgd", CN_change = 1, copy_index = 1, event_type = "WGD"),
create_event(haplotype = "paternal", parent = "C1", child = "C2", region_name = "chr3q", CN_change = 1, copy_index = 1, event_type = "CNV"),
create_event(haplotype = "maternal", parent = "C1", child = "C2", region_name = "chr2p", CN_change = 1, copy_index = 1, event_type = "CNV"),
create_event(haplotype = "paternal", parent = "C2", child = "C3", region_name = "chr1p", CN_change = 1, copy_index = 2, event_type = "CNV"),
create_event(haplotype = "paternal", parent = "C1", child = "C4", region_name = "chr1p", CN_change = 1, copy_index = 2, event_type = "CNV"),
create_event(haplotype = "maternal", parent = "C4", child = "C5", region_name = "chr3p", CN_change = 2, copy_index = 4, event_type = "CNV")
)
# Combine all events into a single dataframe
event_table <- do.call("rbind", events)
event_table
The update_sim_from_event_table function is a core
component for updating the simulation state based on the specified
genomic events. It processes events in a depth-first search order along
the phylogenetic tree, tracking changes in chromosome segments and
lengths at each node. This function:
Traverses the tree in depth-first order
Updates chromosome segments based on events (CNVs and WGD)
Tracks changes in both segment structure and chromosome lengths
Maintains the evolutionary history of genomic changes
Here we demonstrate updating the simulation state using our previously defined tree structure and event table:
# Update simulation state using:
# - tree: Our phylogenetic tree structure
# - event_table: List of genomic events
# - initial_seg_list: Initial segment information
# - initial_chr_lengths: Original chromosome lengths
sim_updates <- update_sim_from_event_table(tree,
event_table = event_table,
initial_chr_arm_seg_list = initial_seg_list,
initial_chr_lengths = initial_chr_lengths)
The all_node_events output is structured as a list
where:
Each list element represents one edge in the tree (e.g., ‘D_C1’ represents the edge from node D to node C1)
For each edge, we get a data frame containing all events that occurred along that edge:
haplotype: Indicates whether the event affects
paternal or maternal chromosomes
parent, child: Source and target nodes
defining the tree edge
region_name: The affected genomic region (chromosome
arm or ‘wgd’ for whole genome duplication)
CN_change: Copy number change magnitude
copy_index: Index for tracking multiple events on
the same region
event_type: Type of event (CNV or WGD)
This list structure makes it easy to trace events along specific lineages in the evolutionary tree.
sim_updates$all_node_events$D_C1
The all_node_segments output is structured as a nested
list where:
Each list element represents one node/clone in the tree (e.g., ‘D’)
For each node, segments are organized by haplotype ($maternal and $paternal)
Each haplotype contains a data frame with detailed segment information:
haplotype: Maternal or paternal chromosome
copy
chrom: Chromosome identifier
ref_start, ref_end: Reference
coordinates
ori_start, ori_end: Original
coordinates
start, end: Current
coordinates
region_name: Chromosome arm or centromere (e.g.,
chr1p, chr1c, chr1q)
copy_index: Index for tracking multiple copies of
the same region
seg_id: Unique segment identifier
CN_change: Copy number changes relative to initial
state
seg_source_edge: Edge where segment
originated
seg_source_event: Event that created this
segment
In this example for node D:
Each chromosome is divided into p-arm, centromere (c), and q-arm regions
All segments show base state (CN_change = 0) from before_root
This list structure allows tracking of chromosome segment states for each clone, maintaining separate records for maternal and paternal copies.
# Maternal segment of clone D
sim_updates$all_node_segments$D$maternal
# Paternal segment of clone D
sim_updates$all_node_segments$D$paternal
The all_node_chr_lengths output is structured as a
nested list where:
Each list element represents one node/clone in the tree (e.g., ‘D’)
For each node, chromosome lengths are organized by haplotype ($maternal and $paternal)
Each haplotype contains a named numeric vector where:
Names indicate chromosomes (chr1, chr2, chr3)
Values represent the total length of each chromosome
This list structure enables tracking of chromosome size changes throughout the evolutionary process, which is particularly important when events like CNVs or WGDs alter chromosome lengths.
sim_updates$all_node_chr_lengths$D
## $maternal
## chr1 chr2 chr3
## 100 200 300
##
## $paternal
## chr1 chr2 chr3
## 100 200 300
After simulating chromosome structural changes, we can add single nucleotide mutations to each node in our phylogenetic tree. The number of mutations can be specified for each clone:
# Specify number of mutations for each edge in the tree
mutation_number <- c(10, 3, 2, 5, 6) # Number of mutations for each edge
names(mutation_number) <- names(sim_updates$all_node_events) # Assign edge names
# Simulate mutation positions using:
# - tree: Phylogenetic tree structure
# - chr_lengths: Updated chromosome lengths after CNV/WGD events
# - mutation_number: Number of mutations per edge
clonal_mut_locs <- sim_clonal_mutation_pos(
tree = tree,
chr_lengths = sim_updates$all_node_chr_lengths,
mutation_number = mutation_number
)
print(clonal_mut_locs[1:5, ])
## clone edge_name haplotype chrom pos
## 1 C1 D_C1 paternal chr3 497
## 2 C1 D_C1 paternal chr3 549
## 3 C1 D_C1 paternal chr3 339
## 4 C1 D_C1 paternal chr1 307
## 5 C1 D_C1 paternal chr2 166
Define a transition probability matrix that specifies the likelihood of one nucleotide mutating to another. Each row represents a starting nucleotide (A, C, G, T, or N), and each column represents the probability of changing to another nucleotide.
# Create nucleotide transition probability matrix
# Rows: original nucleotide
# Columns: mutated nucleotide
nt_transition_matrix <- matrix(c(
# A C G T N
0, 0.1, 0.7, 0.2, 0, # A
0.1, 0, 0.2, 0.7, 0, # C
0.7, 0.2, 0, 0.1, 0, # G
0.2, 0.7, 0.1, 0, 0, # T
0, 0, 0, 0, 1 # N
), nrow = 5, byrow = TRUE)
# Assign row and column names
rownames(nt_transition_matrix) <- colnames(nt_transition_matrix) <- c("A", "C", "G", "T", "N")
# Print the transition matrix
print(nt_transition_matrix)
## A C G T N
## A 0.0 0.1 0.7 0.2 0
## C 0.1 0.0 0.2 0.7 0
## G 0.7 0.2 0.0 0.1 0
## T 0.2 0.7 0.1 0.0 0
## N 0.0 0.0 0.0 0.0 1
After determining mutation positions, we can simulate the specific nucleotide changes at each position using our transition probability matrix. This function applies mutations to the genome sequence based on our previously defined parameters:
# Simulate nucleotide changes for each mutation position using:
# - genome_sequence: Original genome sequences
# - seg_list: Updated segment information after structural changes
# - mutation_info: Previously simulated mutation positions
# - nt_transition_matrix: Nucleotide transition probabilities
# - tree: Phylogenetic tree for ancestor tracking
clonal_mut_nt <- sim_clonal_mutation_nt(
genome_sequence = genome,
seg_list = sim_updates$all_node_segments,
mutation_info = clonal_mut_locs,
nt_transition_matrix = nt_transition_matrix,
tree = tree
)
# Display the resulting mutations
print(clonal_mut_nt[1:5,])
## clone edge_name haplotype chrom pos seg_id ref_pos original_nt
## 1 C1 D_C1 paternal chr3 497 chr3q_2 197 A
## 2 C1 D_C1 paternal chr3 549 chr3q_2 249 C
## 3 C1 D_C1 paternal chr3 339 chr3p_2 39 T
## 4 C1 D_C1 paternal chr1 307 chr1p_5 19 C
## 5 C1 D_C1 paternal chr2 166 chr2q_1 166 G
## alternative_nt
## 1 G
## 2 A
## 3 C
## 4 T
## 5 A
To generate the genome sequence for a specific clone, we can use
the synth_clone_genome function. In this example, we’ll
reconstruct the genome of clone C1, which descended from clone D:
# Specify the target clone and its nearest sequenced ancestor
target_clone = "C1" # Clone we want to reconstruct
nearest_clone = "D" # Immediate ancestor with known genome
# Synthesize genome for clone C1 by:
# - Starting with ancestor's genome (D)
# - Applying accumulated mutations
# - Incorporating structural changes
clone_genome <- synth_clone_genome(
target_clone = target_clone, # Clone to reconstruct (C1)
nearest_genome = genome, # Original genome sequence
nearest_clone = nearest_clone, # Ancestor (D)
tree = tree, # Phylogenetic relationships
seg_list = sim_updates$all_node_segments, # Segment information
mut_table = clonal_mut_nt # Mutation data
)
We can generate genome sequences for the entire phylogenetic tree in
one step. The synth_tree_genomes function automates this
process for all clones:
# Generate genome sequences for all clones in the phylogenetic tree
tree_genomes <- synth_tree_genomes(
tree = tree, # Phylogenetic tree structure
root_genome = genome, # Original starting genome
seg_list = sim_updates$all_node_segments, # Segment information for all clones
mut_table = clonal_mut_nt # All mutations across the tree
)
## [1] "Synthesize of C1 genome is finished!"
## [1] "Synthesize of C2 genome is finished!"
## [1] "No new segments formed for haplotype maternal in the edge C2_C3"
## [1] "Synthesize of C3 genome is finished!"
## [1] "No new segments formed for haplotype maternal in the edge C1_C4"
## [1] "Synthesize of C4 genome is finished!"
## [1] "No new segments formed for haplotype paternal in the edge C4_C5"
## [1] "Synthesize of C5 genome is finished!"
After generating genomes for all clones, it’s important to verify that the chromosome lengths match our expectations. This validation step ensures the accuracy of our simulation:
# Get all clone names from the phylogenetic tree
clone_names <- V(tree)$name
# Check chromosome lengths for each clone
all_chr_length_check <- list()
for(clone in clone_names){
# Compare actual vs expected chromosome lengths for each clone:
# - clone_genome: Synthesized genome sequence
# - expected_chr_lengths: Lengths after structural changes
all_chr_length_check[[clone]] <- check_genome_chr_length(
clone_genome = tree_genomes[[clone]],
expected_chr_lengths = sim_updates$all_node_chr_lengths[[clone]]
)
}
# Print the chr length check result
print(all_chr_length_check$D$maternal$chr1)
## [1] "chr1 in maternal matches the expected length of 100"
After generating clone genomes, we perform a sanity check to ensure all mutations have been correctly incorporated. This validation process verifies that mutations are properly inherited and implemented in each clone:
# Get mutations specific to each clone from the mutation table
clone_mutations <- acquire_clone_mutations(
mutation_table = clonal_mut_nt,
tree = tree
)
# Check mutations for each clone (excluding the root clone)
all_mutation_check <- list()
for(clone in clone_names[-1]){
# Verify mutations in each clone's genome:
# - clone_genome: Synthesized genome sequence
# - clone_mutation_table: Expected mutations
# - clone_seg_list: Current segment structure
all_mutation_check[[clone]] <- check_genome_mutations(
clone_genome = tree_genomes[[clone]],
clone_mutation_table = clone_mutations[[clone]],
clone_seg_list = sim_updates$all_node_segments[[clone]]
)
}
# Print the mutation check result
print(all_mutation_check$C1$maternal$chr2)
## NULL
After generating clone genomes, we verify that genome segments marked as “loss” events are correctly handled. This validation ensures that chromosomal deletions are properly represented in each clone:
# Check loss segments for each clone (excluding the root clone)
all_loss_segment_check <- list()
for(clone in clone_names[-1]){
# Verify loss events in each clone:
# - clone_genome: Synthesized genome sequence
# - clone_segments: Current segment information
all_loss_segment_check[[clone]] <- check_loss_segments(
clone_genome = tree_genomes[[clone]],
clone_segments = sim_updates$all_node_segments[[clone]]
)
}
## No loss segments found in maternal haplotype.
## No loss segments found in paternal haplotype.
## No loss segments found in maternal haplotype.
## No loss segments found in paternal haplotype.
## No loss segments found in maternal haplotype.
## No loss segments found in paternal haplotype.
## No loss segments found in maternal haplotype.
## No loss segments found in paternal haplotype.
## No loss segments found in maternal haplotype.
## No loss segments found in paternal haplotype.
After generating and validating all clone genomes, we save them as FASTA files for future use or analysis. Each clone’s maternal and paternal genomes are saved separately:
output_dir <- "./output/"
# Save each clone's genome sequences to FASTA files
for(clone in clone_names){
for(haplotype in c("maternal", "paternal")){
writing_genome <- tree_genomes[[clone]][[haplotype]]
# Write to FASTA file with naming pattern:
# "clone_[clone name]_[haplotype].fa"
writeXStringSet(writing_genome,
paste0(output_dir, "clone_",
clone, "_", haplotype, ".fa")
)
# Clean up to manage memory
rm(writing_genome)
}
}
Note about Read Simulation: The generated FASTA files can be used as reference sequences for read simulation using tools like ART, DWGSIM, or other sequencing simulators. However, in this tutorial, we skip the read simulation step as our example genomes are too small to generate meaningful sequencing reads. For realistic read simulation, please refer to our whole genome simulation tutorial, which demonstrates the process with full-size human genomes.