
XploR is an R package specifically developed for large-scale (≥5 Mb) copy number analysis in clinical genomics testing using whole exome sequencing (WES) data. It provides accurate copy number calling, as well as robust estimation of tumor purity and ploidy. XploR supports flexible rerun options based on chromosome region, tumor purity, or diploid coverage, and includes integrated ISCN annotation and visualization. These capabilities make XploR a powerful solution for clinical and research applications in genomic copy number analysis.
🚧 Project Status
Note: XploR is actively under development. Some features may evolve or be refined.
We greatly appreciate bug reports, feature suggestions, and user feedback.
Please open an issue if you encounter any problems.
Contents
- Features
- Installation
- Test Run
- Prepare input files
- Prepare reference files
- Algorithm
- Output
- Model selection and rerun
- Full function and parameter list
Features
- BAF and coverage denoise, smoothing, binning, and quality control
- Exome-wide copy number segmentation and allelic imbalance detection
- Purity and ploidy estimation with model selection
- Rerun based on chromosome region, purity or diploid coverage
- Cytoband and gene annotation of CNV segments
- Visualization
Installation
Install the latest version from GitHub using devtools:
install.packages("devtools")
devtools::install_github("sj-cmpb-se/XploR")Quick Test run
All files needed for a test run in placed at inst/extdata folder. RunExamplePipeline() will use the files in inst/exdata for a test run. Panel of normal generation is not included in the test run. Details for build a panel of normals please refer to Prepare reference files
library(XploR)
RunExamplePipeline( out_dir = "/path_to_output_dir" )Running this function is same with running the steps separately like:
1. Run segmentation based on Allelic imbalance information. The example used “cbs” segmentation method.
RunAIsegmentation(
seg = seg,
cov = cov,
ai = ai,
gender = gender,
out_dir = out_dir,
prefix = prefix,
ai_pon = ai_pon,
aitype = "dragen"
)Parameters for RunAIsegmentation
| Parameter | Type | Description | Example Value |
|---|---|---|---|
seg |
character | Path to the GATK segment file. | "sample.seg" |
cov |
character | Path to the GATK denoised coverage count file. | "sample.counts" |
ai |
character | Path to the BAF file or allelic count file. | "sample.baf" |
ai_pon |
character | Path to PON Rdata. AI panel of normals generated by PONAIprocess. |
"PON_AI.Rdata" |
gender |
character | Sample gender ("female" or "male"), passed to ReadAI(). |
"female" |
out_dir |
character | Output directory path. | "results/" |
prefix |
character | Output file prefix. | "Sample1" |
mergeai |
numeric | MAF difference threshold for merging segments under “merge” segmentation mode (default: 0.15). | 0.15 |
mergecov |
numeric | CNV difference threshold for merging segments (default: 0.2). | 0.2 |
snpmin |
numeric | Minimum SNPs for MAF segmentation under “merge” segmentation mode (default: 7). | 7 |
minsnpcov |
numeric | Minimum coverage of SNPs to be included (default: 20). | 20 |
maxgap |
numeric | Maximum gap size inside a bin; if exceeded, start a new bin (default: 1,000,000). | 1000000 |
snpnum |
integer | SNP number in each bin (default: 30). | 30 |
maxbinsize |
numeric | Maximum bin size (default: 5,000,000). | 5000000 |
minbinsize |
numeric | Minimum bin size (default: 500,000). | 500000 |
minsnpcallaicutoff |
numeric | Minimum SNPs for reliable CNLOH/GAINLOH (default: 10). | 10 |
mergecovminsize |
numeric | Minimum size for GATK segment merge (default: 500,000). | 500000 |
segmethod |
character | Segmentation method: "merge" for stepwise merging, "cbs" for CBS segmentation. |
"cbs" |
cbssmooth |
character | If using CBS, "yes" to apply smoothing before segmentation, "no" to skip smoothing. |
"yes" |
aitype |
character | Type of allelic imbalance data: "gatk", "other", or "dragen" (see below for requirements). |
"dragen" |
Note on aitype column requirements: - If "gatk" or "other": input must include columns CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, and ALT_NUCLEOTIDE. - If "dragen": input must include columns contig, start, stop, refAllele, allele1, allele2, allele1Count, allele2Count, allele1AF, and allele2AF.
2. Run model likelihood calculation and selection.
RunModelLikelihood(
seg = paste0(out_dir,"/",prefix,"_GATK_AI_segment.tsv"),
out_dir = out_dir,
prefix = prefix,
gender = gender,
modelminprobes = 20,
modelminAIsize = 5000000,
minsf = 0.4,
callcov = 0.3,
thread = 6)Parameters for RunModelLikelihood
| Parameter | Type | Description | Example Value |
|---|---|---|---|
seg |
character | Path to the combined segment file (e.g., output from segmentation step above | "results/Sample1_GATK_AI_segment.tsv" |
out_dir |
character | Output directory for results | "results/" |
prefix |
character | Prefix for output files | "Sample1" |
gender |
character | Sample gender ("male" or "female") |
"female" |
modelminprobes |
integer | Minimum number of probes/SNPs per segment to include in modeling | 20 |
modelminAIsize |
numeric | Minimum segment size (bp) to include in modeling | 5000000 |
minsf |
numeric | Minimum scale factor to consider in model selection | 0.4 |
callcov |
numeric | Subclonal events calling cutoff based on total copy number | 0.3 |
thread |
integer | Number of CPU threads to use for parallel processing | 6 |
callcovcutoff |
numeric | (Optional) Threshold for calling without modeling. | 0.3 |
callaicutoff |
numeric | (Optional) Threshold for calling without modeling. | 0.3 |
minsnpcallaicutoff |
integer | (Optional) Minimum SNPs to call AI segment | 10 |
Notes:
- Parameters marked as (Optional) can be omitted and have defaults. - For a full description of all arguments and advanced options, see the function reference or ?RunModelLikelihood in R.
3. Run annotation segments.
AnnotateSegments(
input = paste0(out_dir,"/",prefix,"_final_calls.tsv"),
out_dir = out_dir,
prefix = prefix,
cytoband = cytoband,
whitelist_edge = whitelist_edge,
gene = gene)Parameters for AnnotateSegments
| Parameter | Type | Description | Example Value |
|---|---|---|---|
input |
character | Path to XploR CNV calling output. | "results/Sample1_final_calls.tsv" |
out_dir |
character | Output directory for results | "results/" |
prefix |
character | Prefix for output files | "Sample1" |
cytoband |
character | Path to cytoband annotation file (TSV). See Prepare input for detail. | "data/cytoBand.txt" |
whitelist_edge |
character | Path to detectable edge for each chromosomes.See Prepare input for detail. | "data/whitelist.txt" |
gene |
character | Path to gene annotation file. See Prepare input for detail. | "data/gene_anno.txt" |
4. Generating CNV plot
RunPlotCNV(
seg = paste0(out_dir,"/",prefix,"_CNV_annotation.tsv"),
cr =cr,
ballele = ai,
ai_binsize = 100000,
cov_binsize = 100000,
whitelist = whitelist_bed,
gender = gender,
out_dir = out_dir,
prefix = prefix,
aitype = "dragen"
)Parameters for RunPlotCNV
| Parameter | Type | Description | Example Value |
|---|---|---|---|
seg |
character | Path to final annotated call file. | "results/Sample1_CNV_annotation.tsv" |
cr |
character | Path to the GATK denoised copy ratio file with extension .denoisedCR.tsv
|
"data/sample.denoisedCR.tsv" |
ballele |
character | Path to the B-allele file (from DRAGEN, GATK, or other source). See aitype for required columns. |
"data/sample.tumor.baf.gz" |
ai_binsize |
numeric | Bin size for AI plot (default: 100,000) | 100000 |
cov_binsize |
numeric | Bin size for coverage plot (default: 100,000) | 100000 |
whitelist |
character | Path to whitelist file for regions to include | "data/whitelist.txt" |
gender |
character | Sample gender ("male" or "female") |
"female" |
out_dir |
character | Output directory for plot | "results/" |
prefix |
character | Sample ID or output prefix | "Sample1" |
aitype |
character | Type of allelic imbalance data: "gatk", "dragen", or "other". |
"dragen" |
5. Generating AI segment quality file.
BafQC(
annofile = paste0(out_dir,"/",prefix,"_CNV_annotation.tsv"),
out_dir = out_dir,
prefix = prefix)Parameters for BafQC
| Parameter | Type | Description | Example Value |
|---|---|---|---|
annofile |
character | Path to the CNV annotation file (e.g., *_CNV_annotation.tsv) | "results/Sample1_CNV_annotation.tsv" |
out_dir |
character | Output directory for the QC summary file | "results/" |
prefix |
character | Prefix for the QC output file | "Sample1" |
Prepare input files
- Run GATK in tumor-only mode by default parameters. Below is a summary of the GATK tumor-only mode command used in our pipeline. Please see the GATK website for details. Files will be used in XploR is sample.counts, sample.called.seg, sample.allelic_counts and sample.denoisedCR.tsv.
- The allelic count file also could generate by other software like DRAGEN or samtools.
Supporting allelic count file format
| aitype parameter value | software | minimum columns | File extention |
|---|---|---|---|
dragen |
Illumina DRAGEN | contig, start, refAllele, allele2, allele1Count,allele2Count | "sample..tumor.ballele.counts.gz" |
gatk |
GATK | CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, ALT_NUCLEOTIDE | "sample.allelic_counts" |
other |
Other (e.g. samtools) | CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, ALT_NUCLEOTIDE | "" |
Prepare Reference Files
Panel of normal reference
A Panel of Normals (PON) is required and should be generated using GATK, DRAGEN, or any other software capable of producing allelic count files.
Note: Male and female PON files need to be generated separately.
A. Whitelist, Blacklist, and Detectable Boundary Files
These files are generated from the PON HD5 file (from GATK), a cytoband file, and gender information. They are essential for downstream processing and include:
- Blacklist BED: Regions to exclude
- Whitelist BED: Regions to include
- Detectable Edge File: Defines detectable boundaries
These files are created based on the GATK Panel of Normals.
See the function documentation in R: ?PonProcess or help("PonProcess", package = "XploR").
Example usage:
PonProcess(
pon_file = pon_hdh5_file,
blacklist_bed = output_blacklist_bed,
whitelist_bed = output_whitelist_bed,
cytoband = cytoband,
detectable_edge = output_detectable_edge,
gender = gender
)B.Panel of Normals Based on Allelic Count Files
The ai_pon_file should be a text file listing the paths to normal allelic count files generated by GATK, DRAGEN, or other software.
You can process these files to generate the PON reference for allelic imbalance using:
PONAIprocess(
ai_pon_file = ai_pon_file,
aitype = "GATK",
minsnpcov = 20,
output = "/Pathtoresults",
prefix = "PONAI",
maxgap = 2000000,
maxbinsize = 5000000,
minbinsize = 500000,
snpnum = 30,
gender = "female"
)Parameters for PONAIprocess
| Parameter | Type | Description | Example Value |
|---|---|---|---|
ai_pon_file |
character | Path to a text file listing PoN AI file paths (one per line) | "pon_ai_file_list.txt" |
aitype |
character | Type of AI input file ("gatk", "dragen", or "other"), passed to ReadPonAI()
|
"gatk" |
minsnpcov |
integer | Minimum SNP coverage to include a site in the AI calculation | 20 |
maxgap |
numeric | Maximum allowed gap between SNPs within a bin (in base pairs) | 1000000 |
maxbinsize |
numeric | Maximum allowed bin size (in base pairs) | 5000000 |
minbinsize |
numeric | Minimum allowed bin size (in base pairs) | 500000 |
snpnum |
integer | Target number of SNPs per bin | 30 |
output |
character | Output directory for the processed PoN AI Rdata file | "results/" |
prefix |
character | Prefix for the output file | "PON" |
Gene annotation reference:
Gene annotation can be obtained from various sources (e.g., Ensembl, UCSC, Gencode, RefSeq). An example file is included with the package:
gene <- system.file("extdata", "RefSeqCurated.genePred.gene_region.txt", package = "XploR")
head(read.table(gene, header = TRUE, sep = "\t"))Cytoband annotation reference:
Cytoband annotation files are typically downloaded from UCSC. An example file is included:
cytoband <- system.file("extdata", "hg19_cytoBand.dat", package = "XploR")
head(read.table(cytoband, header = TRUE, sep = "\t"))Algorithm
Binning Strategy for allelic count Data
The BinMaf function implements a flexible binning strategy for minor allele frequency (MAF) data, supporting both tumor samples and panels of normal (PoN) samples. The binning can be performed using either a fixed number of SNPs per bin with additional criteria to handle genomic gaps and bin size limits. Within each bin, Gaussian mixture modeling (GMM) is applied to identify clusters in the MAF distribution.
Key features:
-
Flexible binning:
Bins are created for each chromosome by grouping consecutive SNPs until one of the following conditions is met:- The number of SNPs in the bin reaches a specified target (e.g., 20 SNPs per bin).
- The genomic span of the bin exceeds a maximum bin size (e.g., 2,000,000 bp).
- The gap between consecutive SNPs exceeds a maximum allowed gap (e.g., 1,000,000 bp).
- The bin size is at least the specified minimum size (e.g., 500,000 bp).
This strategy ensures that bins are of consistent size and SNP content, while avoiding the inclusion of widely separated SNPs in the same bin, and is robust for both tumor and normal samples.
Segmentation of MAF track
In addition to CBS (Circular Binary Segmentation), our pipeline supports a “merge” mode for segmentation based on minor allele frequency (MAF) values. While CBS is the default and recommended strategy, “merge” mode offers a step-wise, rule-based approach to combine adjacent MAF segments.
Step-wise Merging Strategy:
- The process begins with a minimal SNP counts, which is incrementally increased until reaching the final SNP count defined by “snpmin”.
- At each round, the following steps are performed:
- Remove Small Segments: Discard all segments smaller than the current minimum SNP count.
-
Merge Adjacent Segments: Evaluate and merge adjacent segments if the difference in segment MAF values is less than or equal to the
mergeaiparameter (user-specified MAF difference threshold).
Note:
- CBS segmentation remains the default and is generally recommended for most workflows.
- The “merge” mode is useful for certain applications where a rule-based, stepwise merging of MAF segments is preferred.
MAF Bias Correction Using Panel of Normal (PoN) Allelic Counts
To address systematic MAF bias, XploR incorporates a locus-specific reference built from a panel of normal (PoN) allelic count files. For each tumor segment, the per-SNP BAF values within the segment are extracted, and the segment is overlapped with the PoN to obtain the corresponding distribution of normal MAFs. The combined information is then used to evaluate whether the segment behaves like a balanced diploid region or reflects true allelic imbalance.
The tumor BAF distribution is assessed using a mixture-model and density-based framework, which distinguishes single-peak, near-0.5 distributions from segments showing clear deviation or multiple peaks. Only segments classified as balanced undergo correction: the segment mean MAF is transformed onto the logit scale, centered using the median MAF from the PoN at that locus, and transformed back, with values capped at 0.5. Segments demonstrating allelic imbalance retain their original MAF so that true biological signal is preserved.
Purity and diploid coverage scale factor estimation
Estimate a Beta–Binomial Over-Dispersion Parameter from a Panel of Normals (PoN)
To accurately model over-dispersion in minor allele frequency (MAF) data, we estimate a beta-binomial dispersion parameter () using a panel of normal (PoN) samples. This allows us to account for extra-binomial variation and improves the likelihood calculation for each segment.
For each bin and depth stratum:
- Observed variance across normals:
- Binomial expectation:
- Representative depth in that (bin, stratum):
- Moment estimator of per (bin, stratum):
Within each depth stratum, we take a robust center (median) of to obtain for that stratum.
where:
- : MAFs of per sample per bin
- : Reference (PoN) mean MAF for bin
- : Median depth for sample in bin
- : Observed variance of MAF across samples in bin
- : Expected binomial variance for bin
- : Representative (median) depth in bin
- : Estimated over-dispersion for bin
- : Final over-dispersion parameter for the depth stratum
Prior assignment based on parsimony principle
Priors are assigned to each potential copy number combination based on the principle of parsimony, which favors simpler (biologically less complex) allele configurations. The biological difficulty level reflects the number of steps required to reach a given allele combination from the baseline diploid state (1,1), where each step represents either a gain or loss of one allele.
- Baseline State: The diploid state (1,1) is assigned a difficulty level of 1.
- Single-Step Changes: Combinations reachable from (1,1) in one step (single gain or loss) are assigned a difficulty level of 2.
- Two-Step Changes: Combinations requiring two steps (e.g., loss then gain, or two sequential gains) receive a difficulty level of 3.
- Three-, Four-, Five-Step Changes: Difficulty levels increase with the number of steps, up to 6 for the most complex.
-
Special Considerations:
- Whole Chromosome Duplication: Combinations like (2,2) are assigned higher difficulty due to their rarity.
- Sequential Gains: Combinations with sequential gains of the same chromosome (e.g., (3,1)) are considered less difficult than those involving loss followed by gain (e.g., (2,0)). The prior for each allele combination is calculated using an exponential decay function controlled by a decay rate parameter λ. prior = exp(-λ × Bio_diff) where Bio_diff is the assigned biological difficulty score for the copy number configuration.
Tumor Copy Number Estimation
For each genomic segment, the model computes a range of possible tumor-specific copy numbers (CN_tumor) that could result from observed data under different cancer cell fractions (ccf):
- Tumor Copy Number Formula:
where:
: Observed segment copy number
: Diploid coverage scale factor
: Tumor purity
: Cancer cell fraction
Combination Generation:
For each potential tumor CN, all feasible major and minor allele combinations are generated and filtered based on biological plausibility.CCF Value Calculation:
For non-diploid segments, CCF is calculated; for diploid segments, it is set as NA.
Likelihood Calculation
-
MAF Likelihood:
For each combination, the B allele frequency likelihood is computed using the beta distribution, parameterized by and , derived from the expected MAF:
where:
: Beta-binomial precision parameter, estimated for each segment based on local read depth and the over-dispersion parameter (see below).
: Small positive value for numerical stability
Estimation of :
For each segment, is calculated as:
where:
: Median SNP depth for the segment
: Beta-binomial over-dispersion parameter, estimated from the panel of normals (PoN) for the corresponding depth stratum
Posterior Likelihood:
The posterior likelihood for each combination incorporates both the BAF likelihood and the prior, weighted by a factor :
where:
-
: Likelihood of the observed minor frequency under the current model
-
: Prior probability assigned to the allele combination based on biological plausibility
- : Weighting factor controlling the influence of the prior in the posterior calculation
Assigning Calls for Each Segment
The SelectCallpersegment() function refines and selects the most likely allele combinations for each genomic segment, handling both clonal and subclonal events, and incorporates coverage differences and prior knowledge.
-
Initialization and Preprocessing:
- Replace any zero BAF likelihoods with a minimum likelihood value to ensure all terms contribute meaningfully.
- Compute expected coverage for each model and the difference (
cov_diff) from observed coverage.
-
Selection of Top Likelihood Models:
- For each segment, identify the top two models with the highest MAF likelihoods.
- If a subclonal event is likely (e.g., the second-ranked model has ccf > 0.3 and comparable likelihood), select it as a subclonal event.
- If both major and minor copy numbers are equal, selection is based on
cov_diff.
-
Handling Models with
minor = 0:- For segments with
minor = 0, where MAF likelihood is unreliable, select the model with the smallestcov_diff. - Ensure consistency in selection for both
minor = 0andminor ≠ 0cases.
- For segments with
-
Post-Processing:
- Calculate and store the log-transformed likelihood value (
log_MAF_likelihood) for each selected model.
- Calculate and store the log-transformed likelihood value (
Output
Segmentation output
sample_GATK_AI_segment.tsv ( Generared by ?RunAIsegmentation function)
| Column | Type | Description | Example_value |
|---|---|---|---|
| Sample | character | Sample identifier | Sample1 |
| Chromosome | character | Chromosome name | 1 |
| Start | integer | Start position (base pair) | 123456 |
| End | integer | End position (base pair) | 234567 |
| Num_Probes | integer | Number of probes/SNPs in the segment | 25 |
| Segment_Mean | numeric | Segment mean (log2 ratio) from CNV analysis | 0.42 |
| gatk_SM_raw | numeric | Raw segment mean from GATK | 0.38 |
| gatk_count | integer | Number of counts in GATK segment | 30 |
| gatk_baselinecov | numeric | The GATK baseline is an intermediate value calculated using gatk_SM_raw and gatk_count. | 100.5 |
| gatk_gender | character | Gender as reported by GATK | female |
| pipeline_gender | character | Gender as used in pipeline | female |
| MAF | numeric | Minor allele frequency for the segment | 0.21 |
| MAF_Probes | integer | Number of probes used to calculate MAF | 18 |
| MAF_gmm_G | integer | Number of GMM clusters in MAF distribution | 2 |
| MAF_gmm_weight | numeric | Mixture weight of the main GMM cluster | 0.85 |
| size | integer | Segment size in base pairs | 111111 |
| balance_tag | character | Balance test result ( balanced or imblalanced ) |
balanced |
| BreakpointSource | character | Source of breakpoint (GATK or Postprocess) |
GATK |
| FILTER | character | Quality tag for the segment (PASS or FAILED) |
PASS |
| depth | numeric | Median read depth of the segment | 80 |
| depth_bin | numeric | Read depth group for assigning theta | 1 |
| theta | numeric | Beta–Binomial Over-Dispersion estimated based on depth | 0 |
| K | numeric | Beta-binomial precision parameter estimated based on depth and theta | 80 |
Raw likelihood results under each configuration
sample_likelihood_raw.tsv (Generated by ?RunModelLikelihood() function)
| Column | Type | Description | Example_value |
|---|---|---|---|
| major | integer | Major allele copy number | 2 |
| minor | integer | Minor allele copy number | 1 |
| CN | integer | Total copy number (major + minor) | 3 |
| ccf | numeric | Cancer cell fraction | 0.85 |
| Bio_diff | integer | Biological difficulty score for the allele combination | 3 |
| prior | numeric | Prior probability for the allele combination | 0.12 |
| expected_maf | numeric | Expected minor allele frequency for this configuration | 0.21 |
| maf_ll | numeric | Log-likelihood for the observed MAF under this configuration | -0.56 |
| weighted_prior | numeric | Weighted log-prior (prior × gamma) | -2.13 |
| exp_maf_ll | numeric | Exponentiated MAF log-likelihood | 0.57 |
| exp_prior | numeric | Exponentiated weighted prior | 0.11 |
| MAF_likelihood | numeric | Posterior likelihood for this configuration | 0.065 |
| Segcov | numeric | Pseudo Segment coverage | 280 |
| MAF | numeric | Observed minor allele frequency | 0.19 |
| mu | numeric | Diploid coverage scale factor | 1.0 |
| rho | numeric | Tumor purity (fraction between 0 and 1) | 0.7 |
| index | character | Segment index or identifier | "12" |
| Tag | character | Segment inclusion/exclusion tag for summarizing total likelihood for a model (e.g., "Include", "Exclude") |
"Include" |
| ccf_MAF | numeric | Cancer cell fraction estimated from MAF and allele configuration only | 0.81 |
Allelic combiantion ressult with maximum likelihood under each configuration
sample_top_likelihood_calls.tsv ( Generated by ?SelectCallpersegment() function ) The format is simillar with sample_likelihood_raw.tsv, with best allelic combiantion is selected for each segment under each diploid coverage scale factor and tumor purity configuration.
Likelihood for each combination of diploid coverage scale factor and tumor purity
sample_Models_likelihood.tsv ( Generated by ?SelectFinalModel() function )
| Column | Type | Description | Example_value |
|---|---|---|---|
| mu | numeric | Diploid coverage scale factor (model parameter) | 1.0 |
| rho | numeric | Tumor purity (model parameter, fraction between 0 and 1) | 0.7 |
| total_log_likelihood_before_refine | numeric | Total log-likelihood for the model before refinement | -1234.5 |
| segments_n | integer | Number of segments included in the model | 27 |
| Likelihood_penalty_rows | integer | Number of segments penalized due to failed likelihood calculation | 2 |
| total_log_likelihood_after_refine | numeric | Total log-likelihood for the model after refinement | -1220.2 |
| diploid_n | integer | Number of diploid segments in the model | 15 |
| diploid_distance_to_integer | numeric | Mean distance to integer copy number for diploid segments | 0.04 |
| nondiploid_n | integer | Number of non-diploid segments in the model | 12 |
| nondiploid_distance_to_integer | numeric | Mean distance to integer copy number for non-diploid segments | 0.11 |
| total_distance_to_integer | numeric | Sum of diploid and non-diploid mean distances to integer copy number | 0.15 |
| ploidy | numeric | Mean copy number (ploidy) across all segments | 2.4 |
| Tier1 | character | Model tier label (e.g., "Tier1_Models", "Final_model_MAF") |
"Tier1_Models" |
| total_likelihood_cluster | integer | Rank based on total likelihood ( lower is better ) | 1 |
| diploid_distance_cluster | integer | Rank based on diploid distance to integer copy number ( lower is better ) | 1 |
| nondiploid_distance_cluster | integer | Rank based on non-diploid distance to integer copy number (lower is better) | 1 |
| total_likelihood_cluster_mean | numeric | Mean total log-likelihood for the level | -1200.0 |
| diploid_distance_cluster_mean | numeric | Mean diploid distance to integer for the level | 0.03 |
| nondiploid_distance_cluster_mean | numeric | Mean non-diploid distance to integer for the level | 0.10 |
Final output of CNV calling
sample_final_calls.tsv (Generated by ?RunModelLikelihood() function)
| Column | Type | Description | Example_value |
|---|---|---|---|
| Chromosome | character | Chromosome name | 1 |
| Start | integer | Start position (base pair) | 3301463 |
| End | integer | End position (base pair) | 247784114 |
| size | integer | Segment size (bp) | 244367069 |
| Num_Probes | integer | Number of probes from GATK segment file |
222. |
| Call | character | Copy number call (e.g., REF, GAIN, LOSS,GAINLOH,CNLOH) |
REF |
| ccf_COV | numeric | Cancer cell fraction estimated from coverage | 1 |
| ccf_MAF | numeric | Cancer cell fraction estimated from MAF | 0 |
| ccf_final | numeric | Final cancer cell fraction after refinement | 1 |
| Segment_Mean | numeric | Final Segment mean (log2 ratio) | 0.057631093 |
| CNF_correct | numeric | Purity corrected copy number estimate from coverage | 2.086898584 |
| major | integer | Major allele copy number | 1 |
| minor | integer | Minor allele copy number | 1 |
| CN | integer | Total copy number (major + minor) | 2 |
| MAF | numeric | Observed minor allele frequency | 0.5 |
| MAF_correct | numeric | Purity corrected minor allele frequency | 0.5 |
| expected_maf | numeric | Expected minor allele frequency for this configuration | 0.5 |
| expected_cov | numeric | Expected pseudo coverage for this segment | 90 |
| MAF_Probes | integer | Number of probes used for MAF calculation | 1110 |
| MAF_gmm_G | integer | Number of GMM clusters in MAF distribution | 5 |
| MAF_gmm_weight | numeric | Mixture weight of the main GMM cluster | 0.667871528 |
| balance_tag | character | Balance test result ( balanced or imblalanced ) |
balanced |
| BreakpointSource | character | Source of breakpoint (GATK or Postprocess) |
GATK |
| FILTER | character | Quality tag for the segment (PASS or FAILED) |
PASS |
| maf_ll | numeric | Log-likelihood for the observed MAF | 2.625299941 |
| MAF_likelihood | numeric | Posterior likelihood for this configuration | 8.891628731 |
| mu | numeric | Diploid coverage scale factor | 0.9 |
| rho | numeric | Tumor purity (fraction between 0 and 1) | 0.938 |
| index | character | Segment index or identifier | 1 |
| gatk_SM_raw | numeric | Raw segment mean from GATK | -0.094372 |
| gatk_count | integer | Number of counts in GATK segment | 361 |
| gatk_baselinecov | numeric | The GATK baseline is an intermediate value calculated using gatk_SM_raw and gatk_count. | 385.4038109 |
| gatk_gender | character | Gender as reported by GATK | female |
| pipeline_gender | character | Gender as used in pipeline | female |
| CN_mix | character | Indicator for copy number mixture (No or CN_Mix) |
No |
| Model_source | character | Source of model selection (Coverage, Coverage + MAF, Diploid ) |
Coverage + MAF |
Model selection plots
Likelihood dot plot:
The plot displays the likelihood ranking for all combinations of diploid coverage scale factor and tumor purity. The vertical dashed line indicates the likelihood cutoff used to define Tier 1 models.
Model plot:
The model plot displays the likelihood values of different models, which are calculated based on potential combinations of diploid coverage scale factor and tumor purity. In the plot, red indicates higher likelihood, while blue signifies lower likelihood. The light blue dot indicates the final model selected by XploR.
Tier1 Models Overall:
This plot shows copy number calls for each combination of diploid coverage scale factor and tumor purity. Red indicates gain, blue indicates loss, and white indicates no change. Each configuration is labeled on the y-axis. By evaluating coverage and allelic imbalance patterns in this overview, you can identify the reasonable range of diploid coverage scale factors and tumor purity values. This helps guide reruns with optimized parameter ranges if needed.
Tier1 Models Zoom in:
A zoomed-in view that makes the y-axis configurations more visible for detailed inspection.
QC Summary Table
sample_PASS_STAT_chr.txt ( Generated by ?BafQC() function )
| Column | Type | Description | Example_value |
|---|---|---|---|
| chrom | character | Chromosome name (e.g., 1, 2, …, X, Y) |
1 |
| FILTER | character | Segment filter status | PASS |
| Total_segment_count | integer | Total number of segments on the chromosome | 25 |
| PASS_Seg_Count | integer | Number of segments with PASS filter status |
20 |
| PASS_Seg_Percent | numeric | Percentage of segments with PASS status (0–1) |
0.80 |
| Total_segment_size | integer | Total size (bp) of all segments on the chromosome | 249250621 |
| PASS_Seg_Size | integer | Total size (bp) of PASS segments on the chromosome |
199400497 |
| PASS_Seg_Size_Percent | numeric | Percentage of total segment size that is PASS (0–1) |
0.80 |
Annotation file
sample_CNV_annotation.tsv ( Generated by ?AnnotateSegments() function, only unique columns are listed ).
ISCN calculation rules: 1. All segments will be reported with start and end cytoband in ISCN format. however certain considerations are made for the position of the centromere: a. In metacentric chromosomes, if a segment crosses the centromere and the gaps between the segment and the telomere on both sides are less than 5MB, only the chromosome number will be reported. b. In metacentric chromosomes, if a segment does not cross the centromere, and the gaps between the segment and the centromere and the telomere are both less than 5MB, the chromosome number followed by ‘p’ or ‘q’ will be reported. c. In acrocentric chromosomes, if the segment fulfills rule ‘b’ above, only the chromosome number will be reported.
| Column | Type | Description | Example_value |
|---|---|---|---|
| p_chromStart | integer | Detectable start position of p arm | 10 |
| p_chromEnd | integer | Detectable end position of p arm | 121535434 |
| p_first_name | character | Detectable name of first cytoband in p arm | p36.33 |
| p_last_name | character | Detectable name of last cytoband in p arm | p11.2 |
| q_chromStart | integer | Detectable Start position of q arm | 121535435 |
| q_chromEnd | integer | Detectable end position of q arm | 247784114 |
| q_first_name | character | Detectable name of first cytoband in q arm | q11.1 |
| q_last_name | character | Detectable name of last cytoband in q arm | qter |
| p_gap_to_tel | integer | Gap from segment start to p arm telomere | 0 |
| p_gap_to_cen | integer | Gap from segment end to p arm centromere | 10000 |
| q_gap_to_tel | integer | Gap from segment end to q arm telomere | 0 |
| q_gap_to_cen | integer | Gap from segment start to q arm centromere | 10000 |
| ISCN | character | ISCN-style cytogenetic annotation | 1p36.33-p11.2 |
| Gene | character | Overlapping gene(s) in the segment | TP53 |
| Gene_count | integer | Number of overlapping genes | 1 |
CNV plot
sample_CNV_plot.png ( Generated by ?RunPlotCNV() function).
The CNV Plot shows a genome-wide summary of the copy number (top track), B-allele frequency (BAF, second track) data, tumor fraction( ccf, third tract ) and quality of segment ( bottom track). The Copy Number (CN), on the Y-axis, is a linear count of the number of copies of each chromosome in the tumor cells, taking tumor purity and tumor fraction into account. Each chromosome is plotted as a set of dots that collectively show the estimated sequence coverage for the chromosome, and as a narrow turquoise line that shows the final CN call for the chromosome. The BAF plot shows the variant allele fraction of SNPs across the genome with the same coloration used in the Copy Number plot. When the copy number of a chromosome changes, the BAF plot for an affected chromosome splits due to imbalance in chromosome counts. The variance of B-allele frequencies is quite high so the splitting of the BAF may be difficult to discern. To assist with interpreting the BAF plot, a turquoise line is drawn at the median level to show the imbalance.
Model selection and rerun
XploR allows users to rerun copy number variant (CNV) calling with custom purity and scale factor (size factor) ranges, or by specifying a diploid region for normalization. This function supports both “model” and “region” rerun modes and outputs refined CNV calls. The diploid coverage region can be specified using three parameters: chromosome, start, and end. If only chromosome is provided, the entire chromosome is used. If chromosome and start are provided, the region from start to the end of the chromosome is considered. If chromosome and end are provided, the region from the start of the chromosome to end is used. If all three parameters are specified, the defined region between start and end on the selected chromosome is used.
# Rerun using specific purity and scale factor ranges
RerunCNV(
seg = "results/Sample1_GATK_AI_segment.tsv",
input = "results/Sample1_top_likelihood_calls.tsv",
models = "results/Sample1_Models_likelihood.tsv",
call = "results/Sample1_final_call.tsv",
gender = "female",
dicovsf = "0.95:1.05",
purity = "0.6:0.8",
mode = "model",
out_file = "results/Sample1_final_call_refined.tsv"
)
# Rerun using a user-defined diploid region
RerunCNV(
seg = "results/Sample1_GATK_AI_segment.tsv",
input = "results/Sample1_top_likelihood_calls.tsv",
models = "results/Sample1_Models_likelihood.tsv",
call = "results/Sample1_final_call.tsv",
gender = "male",
chromosome = "3",
start = 1000000,
end = 50000000,
mode = "region",
out_file = "results/Sample1_final_call_refined.tsv"
)Parameters for RerunCNV
| Parameter | Type | Description | Example Value |
|---|---|---|---|
seg |
character | Path to AI segment file generated by RunAIsegmentation. |
"results/Sample1_GATK_AI_segment.tsv" |
input |
character | Path to top likelihood row file generated by RunModelLikelihood. |
"results/Sample1_top_likelihood_calls.tsv" |
models |
character | Path to model likelihood file generated by RunModelLikelihood. |
"results/Sample1_Models_likelihood.tsv" |
call |
character | Path to final call file generated by RunModelLikelihood. |
"results/Sample1_final_call.tsv" |
gender |
character | Sample gender, either "male" or "female". |
"female" |
dicovsf |
character | Desired scale factor range for normalization (must be in "min:max" format). Required if mode = "model"
|
"0.95:1.05" |
purity |
character | Desired purity range for model selection (must be in "min:max" format). Required if mode = "model"
|
"0.6:0.8" |
callcov |
numeric | Subclonal event calling cutoff (no models, coverage-based). Default is 0.3.
|
0.3 |
chromosome |
character | Chromosome for user-defined diploid region (required if mode = "region"). |
"3" |
start |
integer | Start position for diploid region (optional; used in "region" mode). |
1000000 |
end |
integer | End position for diploid region (optional; used in "region" mode). |
50000000 |
mode |
character | Rerun mode: either "model" (custom purity/scale factor) or "region" (user-defined diploid region). |
"model" |
out_file |
character | Output file path for refined CNV calls. | "results/Sample1_final_call_refined.tsv" |
Note:
After rerunning, the resulting file can be used in downstream steps—such as plotting and annotation—just like files generated prior to rerun.
Full function and parameter list
For a complete list of all functions and their parameters, please visit the XploR function reference.
Each function page includes detailed parameter descriptions, usage examples, and links to related documentation.