Large-Scale Allele-Specific CNV Detection Using Coverage and Allelic Imbalance in WES data • XploR

XploR is an R package specifically developed for large-scale (≥5 Mb) copy number analysis in clinical genomics testing using whole exome sequencing (WES) data. It provides accurate copy number calling, as well as robust estimation of tumor purity and ploidy. XploR supports flexible rerun options based on chromosome region, tumor purity, or diploid coverage, and includes integrated ISCN annotation and visualization. These capabilities make XploR a powerful solution for clinical and research applications in genomic copy number analysis.

🚧 Project Status

Note: XploR is actively under development. Some features may evolve or be refined.
We greatly appreciate bug reports, feature suggestions, and user feedback.
Please open an issue if you encounter any problems.

Features
Installation
Test Run
Prepare input files
Prepare reference files
Algorithm
Output
Model selection and rerun
Full function and parameter list

Features

BAF and coverage denoise, smoothing, binning, and quality control
Exome-wide copy number segmentation and allelic imbalance detection
Purity and ploidy estimation with model selection
Rerun based on chromosome region, purity or diploid coverage
Cytoband and gene annotation of CNV segments
Visualization

Installation

Install the latest version from GitHub using devtools:

install.packages("devtools")
devtools::install_github("sj-cmpb-se/XploR")

Quick Test run

All files needed for a test run in placed at inst/extdata folder. RunExamplePipeline() will use the files in inst/exdata for a test run. Panel of normal generation is not included in the test run. Details for build a panel of normals please refer to Prepare reference files

library(XploR)
RunExamplePipeline( out_dir = "/path_to_output_dir" )

Running this function is same with running the steps separately like:

1. Run segmentation based on Allelic imbalance information. The example used “cbs” segmentation method.

RunAIsegmentation(
    seg = seg,
    cov = cov,
    ai = ai,
    gender = gender,
    out_dir = out_dir,
    prefix = prefix,
    ai_pon = ai_pon,
    aitype = "dragen"
  )

Parameters for `RunAIsegmentation`

Parameter	Type	Description	Example Value
`seg`	character	Path to the GATK segment file.	`"sample.seg"`
`cov`	character	Path to the GATK denoised coverage count file.	`"sample.counts"`
`ai`	character	Path to the BAF file or allelic count file.	`"sample.baf"`
`ai_pon`	character	Path to PON Rdata. AI panel of normals generated by `PONAIprocess`.	`"PON_AI.Rdata"`
`gender`	character	Sample gender (`"female"` or `"male"`), passed to `ReadAI()`.	`"female"`
`out_dir`	character	Output directory path.	`"results/"`
`prefix`	character	Output file prefix.	`"Sample1"`
`mergeai`	numeric	MAF difference threshold for merging segments under “merge” segmentation mode (default: 0.15).	`0.15`
`mergecov`	numeric	CNV difference threshold for merging segments (default: 0.2).	`0.2`
`snpmin`	numeric	Minimum SNPs for MAF segmentation under “merge” segmentation mode (default: 7).	`7`
`minsnpcov`	numeric	Minimum coverage of SNPs to be included (default: 20).	`20`
`maxgap`	numeric	Maximum gap size inside a bin; if exceeded, start a new bin (default: 1,000,000).	`1000000`
`snpnum`	integer	SNP number in each bin (default: 30).	`30`
`maxbinsize`	numeric	Maximum bin size (default: 5,000,000).	`5000000`
`minbinsize`	numeric	Minimum bin size (default: 500,000).	`500000`
`minsnpcallaicutoff`	numeric	Minimum SNPs for reliable CNLOH/GAINLOH (default: 10).	`10`
`mergecovminsize`	numeric	Minimum size for GATK segment merge (default: 500,000).	`500000`
`segmethod`	character	Segmentation method: `"merge"` for stepwise merging, `"cbs"` for CBS segmentation.	`"cbs"`
`cbssmooth`	character	If using CBS, `"yes"` to apply smoothing before segmentation, `"no"` to skip smoothing.	`"yes"`
`aitype`	character	Type of allelic imbalance data: `"gatk"`, `"other"`, or `"dragen"` (see below for requirements).	`"dragen"`

Note on aitype column requirements: - If "gatk" or "other": input must include columns CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, and ALT_NUCLEOTIDE. - If "dragen": input must include columns contig, start, stop, refAllele, allele1, allele2, allele1Count, allele2Count, allele1AF, and allele2AF.

2. Run model likelihood calculation and selection.

RunModelLikelihood(
    seg = paste0(out_dir,"/",prefix,"_GATK_AI_segment.tsv"),
    out_dir = out_dir,
    prefix = prefix,
    gender = gender,
    modelminprobes = 20,
    modelminAIsize = 5000000,
    minsf = 0.4,
    callcov = 0.3,
    thread = 6)

Parameters for `RunModelLikelihood`

Parameter	Type	Description	Example Value
`seg`	character	Path to the combined segment file (e.g., output from segmentation step above	`"results/Sample1_GATK_AI_segment.tsv"`
`out_dir`	character	Output directory for results	`"results/"`
`prefix`	character	Prefix for output files	`"Sample1"`
`gender`	character	Sample gender (`"male"` or `"female"`)	`"female"`
`modelminprobes`	integer	Minimum number of probes/SNPs per segment to include in modeling	`20`
`modelminAIsize`	numeric	Minimum segment size (bp) to include in modeling	`5000000`
`minsf`	numeric	Minimum scale factor to consider in model selection	`0.4`
`callcov`	numeric	Subclonal events calling cutoff based on total copy number	`0.3`
`thread`	integer	Number of CPU threads to use for parallel processing	`6`
`callcovcutoff`	numeric	(Optional) Threshold for calling without modeling.	`0.3`
`callaicutoff`	numeric	(Optional) Threshold for calling without modeling.	`0.3`
`minsnpcallaicutoff`	integer	(Optional) Minimum SNPs to call AI segment	`10`

Notes:
- Parameters marked as (Optional) can be omitted and have defaults. - For a full description of all arguments and advanced options, see the function reference or ?RunModelLikelihood in R.

3. Run annotation segments.

AnnotateSegments(
    input = paste0(out_dir,"/",prefix,"_final_calls.tsv"),
    out_dir = out_dir,
    prefix = prefix,
    cytoband = cytoband,
    whitelist_edge = whitelist_edge,
    gene = gene)

Parameters for `AnnotateSegments`

Parameter	Type	Description	Example Value
`input`	character	Path to XploR CNV calling output.	`"results/Sample1_final_calls.tsv"`
`out_dir`	character	Output directory for results	`"results/"`
`prefix`	character	Prefix for output files	`"Sample1"`
`cytoband`	character	Path to cytoband annotation file (TSV). See Prepare input for detail.	`"data/cytoBand.txt"`
`whitelist_edge`	character	Path to detectable edge for each chromosomes.See Prepare input for detail.	`"data/whitelist.txt"`
`gene`	character	Path to gene annotation file. See Prepare input for detail.	`"data/gene_anno.txt"`

4. Generating CNV plot

RunPlotCNV(
    seg = paste0(out_dir,"/",prefix,"_CNV_annotation.tsv"),
    cr =cr,
    ballele = ai,
    ai_binsize = 100000,
    cov_binsize = 100000,
    whitelist = whitelist_bed,
    gender = gender,
    out_dir = out_dir,
    prefix = prefix,
    aitype = "dragen"
  )

Parameters for `RunPlotCNV`

Parameter	Type	Description	Example Value
`seg`	character	Path to final annotated call file.	`"results/Sample1_CNV_annotation.tsv"`
`cr`	character	Path to the GATK denoised copy ratio file with extension `.denoisedCR.tsv`	`"data/sample.denoisedCR.tsv"`
`ballele`	character	Path to the B-allele file (from DRAGEN, GATK, or other source). See `aitype` for required columns.	`"data/sample.tumor.baf.gz"`
`ai_binsize`	numeric	Bin size for AI plot (default: 100,000)	`100000`
`cov_binsize`	numeric	Bin size for coverage plot (default: 100,000)	`100000`
`whitelist`	character	Path to whitelist file for regions to include	`"data/whitelist.txt"`
`gender`	character	Sample gender (`"male"` or `"female"`)	`"female"`
`out_dir`	character	Output directory for plot	`"results/"`
`prefix`	character	Sample ID or output prefix	`"Sample1"`
`aitype`	character	Type of allelic imbalance data: `"gatk"`, `"dragen"`, or `"other"`.	`"dragen"`

5. Generating AI segment quality file.

BafQC(
    annofile = paste0(out_dir,"/",prefix,"_CNV_annotation.tsv"),
    out_dir = out_dir,
    prefix = prefix)

Parameters for `BafQC`

Parameter	Type	Description	Example Value
`annofile`	character	Path to the CNV annotation file (e.g., *_CNV_annotation.tsv)	`"results/Sample1_CNV_annotation.tsv"`
`out_dir`	character	Output directory for the QC summary file	`"results/"`
`prefix`	character	Prefix for the QC output file	`"Sample1"`

Prepare input files

Run GATK in tumor-only mode by default parameters. Below is a summary of the GATK tumor-only mode command used in our pipeline. Please see the GATK website for details. Files will be used in XploR is sample.counts, sample.called.seg, sample.allelic_counts and sample.denoisedCR.tsv.
The allelic count file also could generate by other software like DRAGEN or samtools.

Supporting allelic count file format

aitype parameter value	software	minimum columns	File extention
`dragen`	Illumina DRAGEN	contig, start, refAllele, allele2, allele1Count,allele2Count	`"sample..tumor.ballele.counts.gz"`
`gatk`	GATK	CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, ALT_NUCLEOTIDE	`"sample.allelic_counts"`
`other`	Other (e.g. samtools)	CONTIG, POSITION, ALT_COUNT, REF_COUNT, REF_NUCLEOTIDE, ALT_NUCLEOTIDE	`""`

Prepare Reference Files

Panel of normal reference

A Panel of Normals (PON) is required and should be generated using GATK, DRAGEN, or any other software capable of producing allelic count files.

Note: Male and female PON files need to be generated separately.

A. Whitelist, Blacklist, and Detectable Boundary Files

These files are generated from the PON HD5 file (from GATK), a cytoband file, and gender information. They are essential for downstream processing and include:

Blacklist BED: Regions to exclude
Whitelist BED: Regions to include
Detectable Edge File: Defines detectable boundaries

These files are created based on the GATK Panel of Normals.
See the function documentation in R: ?PonProcess or help("PonProcess", package = "XploR").

Example usage:

PonProcess(
  pon_file = pon_hdh5_file,
  blacklist_bed = output_blacklist_bed,
  whitelist_bed = output_whitelist_bed,
  cytoband = cytoband,
  detectable_edge = output_detectable_edge,
  gender = gender
)

B.Panel of Normals Based on Allelic Count Files

The ai_pon_file should be a text file listing the paths to normal allelic count files generated by GATK, DRAGEN, or other software.

You can process these files to generate the PON reference for allelic imbalance using:

PONAIprocess(
  ai_pon_file = ai_pon_file,
  aitype = "GATK",
  minsnpcov = 20,
  output = "/Pathtoresults",
  prefix = "PONAI",
  maxgap = 2000000,
  maxbinsize = 5000000,
  minbinsize = 500000,
  snpnum = 30,
  gender = "female"
)

Parameters for `PONAIprocess`

Parameter	Type	Description	Example Value
`ai_pon_file`	character	Path to a text file listing PoN AI file paths (one per line)	`"pon_ai_file_list.txt"`
`aitype`	character	Type of AI input file (`"gatk"`, `"dragen"`, or `"other"`), passed to `ReadPonAI()`	`"gatk"`
`minsnpcov`	integer	Minimum SNP coverage to include a site in the AI calculation	`20`
`maxgap`	numeric	Maximum allowed gap between SNPs within a bin (in base pairs)	`1000000`
`maxbinsize`	numeric	Maximum allowed bin size (in base pairs)	`5000000`
`minbinsize`	numeric	Minimum allowed bin size (in base pairs)	`500000`
`snpnum`	integer	Target number of SNPs per bin	`30`
`output`	character	Output directory for the processed PoN AI Rdata file	`"results/"`
`prefix`	character	Prefix for the output file	`"PON"`

Gene annotation reference:

Gene annotation can be obtained from various sources (e.g., Ensembl, UCSC, Gencode, RefSeq). An example file is included with the package:

gene <- system.file("extdata", "RefSeqCurated.genePred.gene_region.txt", package = "XploR")
head(read.table(gene, header = TRUE, sep = "\t"))

Cytoband annotation reference:

Cytoband annotation files are typically downloaded from UCSC. An example file is included:

cytoband <- system.file("extdata", "hg19_cytoBand.dat", package = "XploR")
head(read.table(cytoband, header = TRUE, sep = "\t"))

Algorithm

Binning Strategy for allelic count Data

The BinMaf function implements a flexible binning strategy for minor allele frequency (MAF) data, supporting both tumor samples and panels of normal (PoN) samples. The binning can be performed using either a fixed number of SNPs per bin with additional criteria to handle genomic gaps and bin size limits. Within each bin, Gaussian mixture modeling (GMM) is applied to identify clusters in the MAF distribution.

Key features:

Flexible binning:
Bins are created for each chromosome by grouping consecutive SNPs until one of the following conditions is met:
- The number of SNPs in the bin reaches a specified target (e.g., 20 SNPs per bin).
- The genomic span of the bin exceeds a maximum bin size (e.g., 2,000,000 bp).
- The gap between consecutive SNPs exceeds a maximum allowed gap (e.g., 1,000,000 bp).
- The bin size is at least the specified minimum size (e.g., 500,000 bp).

This strategy ensures that bins are of consistent size and SNP content, while avoiding the inclusion of widely separated SNPs in the same bin, and is robust for both tumor and normal samples.

Segmentation of MAF track

In addition to CBS (Circular Binary Segmentation), our pipeline supports a “merge” mode for segmentation based on minor allele frequency (MAF) values. While CBS is the default and recommended strategy, “merge” mode offers a step-wise, rule-based approach to combine adjacent MAF segments.

Step-wise Merging Strategy:

The process begins with a minimal SNP counts, which is incrementally increased until reaching the final SNP count defined by “snpmin”.
At each round, the following steps are performed:
1. Remove Small Segments: Discard all segments smaller than the current minimum SNP count.
2. Merge Adjacent Segments: Evaluate and merge adjacent segments if the difference in segment MAF values is less than or equal to the mergeai parameter (user-specified MAF difference threshold).

Note:
- CBS segmentation remains the default and is generally recommended for most workflows.
- The “merge” mode is useful for certain applications where a rule-based, stepwise merging of MAF segments is preferred.

MAF Bias Correction Using Panel of Normal (PoN) Allelic Counts

To address systematic MAF bias, XploR incorporates a locus-specific reference built from a panel of normal (PoN) allelic count files. For each tumor segment, the per-SNP BAF values within the segment are extracted, and the segment is overlapped with the PoN to obtain the corresponding distribution of normal MAFs. The combined information is then used to evaluate whether the segment behaves like a balanced diploid region or reflects true allelic imbalance.

The tumor BAF distribution is assessed using a mixture-model and density-based framework, which distinguishes single-peak, near-0.5 distributions from segments showing clear deviation or multiple peaks. Only segments classified as balanced undergo correction: the segment mean MAF is transformed onto the logit scale, centered using the median MAF from the PoN at that locus, and transformed back, with values capped at 0.5. Segments demonstrating allelic imbalance retain their original MAF so that true biological signal is preserved.

Purity and diploid coverage scale factor estimation

Estimate a Beta–Binomial Over-Dispersion Parameter $\theta$ from a Panel of Normals (PoN)

To accurately model over-dispersion in minor allele frequency (MAF) data, we estimate a beta-binomial dispersion parameter ( $\theta$ ) using a panel of normal (PoN) samples. This allows us to account for extra-binomial variation and improves the likelihood calculation for each segment.

For each bin $b$ and depth stratum:

Observed variance across normals:

$v_b = \mathrm{Var}_s(p_{sb})$ - Binomial expectation:

$m_b = \mathbb{E}_s\left[\frac{p_{0b}(1-p_{0b})}{D_{sb}}\right] \approx p_{0b}(1-p_{0b}) \cdot \mathrm{mean}_s\left(\frac{1}{D_{sb}}\right)$

Representative depth in that (bin, stratum):

$\tilde d_b = \mathrm{median}_s(D_{sb})$

Moment estimator of $\theta$ per (bin, stratum):

$\widehat{\theta}_{b} = \max\left(\frac{v_b/m_b - 1}{\tilde d_b - 1},\ 0\right)$

Within each depth stratum, we take a robust center (median) of $\widehat{\theta}_b$ to obtain $\theta$ for that stratum.

where:

$p_{sb}$ : MAFs of per sample $s$ per bin $b$
$p_{0b}$ : Reference (PoN) mean MAF for bin $b$
$D_{sb}$ : Median depth for sample $s$ in bin $b$
$v_b$ : Observed variance of MAF across samples in bin $b$
$m_b$ : Expected binomial variance for bin $b$
$\tilde d_b$ : Representative (median) depth in bin $b$
$\widehat{\theta}_b$ : Estimated over-dispersion for bin $b$
$\theta$ : Final over-dispersion parameter for the depth stratum

Prior assignment based on parsimony principle

Priors are assigned to each potential copy number combination based on the principle of parsimony, which favors simpler (biologically less complex) allele configurations. The biological difficulty level reflects the number of steps required to reach a given allele combination from the baseline diploid state (1,1), where each step represents either a gain or loss of one allele.

Baseline State: The diploid state (1,1) is assigned a difficulty level of 1.
Single-Step Changes: Combinations reachable from (1,1) in one step (single gain or loss) are assigned a difficulty level of 2.
Two-Step Changes: Combinations requiring two steps (e.g., loss then gain, or two sequential gains) receive a difficulty level of 3.
Three-, Four-, Five-Step Changes: Difficulty levels increase with the number of steps, up to 6 for the most complex.
Special Considerations:
- Whole Chromosome Duplication: Combinations like (2,2) are assigned higher difficulty due to their rarity.
- Sequential Gains: Combinations with sequential gains of the same chromosome (e.g., (3,1)) are considered less difficult than those involving loss followed by gain (e.g., (2,0)). The prior for each allele combination is calculated using an exponential decay function controlled by a decay rate parameter λ. prior = exp(-λ × Bio_diff) where Bio_diff is the assigned biological difficulty score for the copy number configuration.

Tumor Copy Number Estimation

For each genomic segment, the model computes a range of possible tumor-specific copy numbers (CN_tumor) that could result from observed data under different cancer cell fractions (ccf):

Tumor Copy Number Formula:

$CN_{tumor} = \frac{C_i \times 2 / (\mu \times 100) - (1 - \rho) \times 2 - \rho \times (1 - ccf) \times 2}{\rho \times ccf}$

where:

$C_i$ : Observed segment copy number
$\mu$ : Diploid coverage scale factor
$\rho$ : Tumor purity
$ccf$ : Cancer cell fraction
Combination Generation:
For each potential tumor CN, all feasible major and minor allele combinations are generated and filtered based on biological plausibility.
CCF Value Calculation:
For non-diploid segments, CCF is calculated; for diploid segments, it is set as NA.

Likelihood Calculation

MAF Likelihood:
For each combination, the B allele frequency likelihood is computed using the beta distribution, parameterized by $\alpha$ and $\beta$ , derived from the expected MAF:

$\mathrm{Beta}(\alpha, \beta)$

where:

$\alpha = K \times \mathrm{BAF} + \epsilon$
$\beta = K \times (1 - \mathrm{BAF}) + \epsilon$
$K$ : Beta-binomial precision parameter, estimated for each segment based on local read depth and the over-dispersion parameter (see below).
$\epsilon$ : Small positive value for numerical stability
Estimation of $K$ :

For each segment, $K$ is calculated as:

$K = \frac{\text{depth}}{1 + (\text{depth} - 1) \cdot \theta} - 1$

where:

$\text{depth}$ : Median SNP depth for the segment
$\theta$ : Beta-binomial over-dispersion parameter, estimated from the panel of normals (PoN) for the corresponding depth stratum
Posterior Likelihood:
The posterior likelihood for each combination incorporates both the BAF likelihood and the prior, weighted by a factor $\gamma$ :

$\text{Posterior Likelihood} = \text{MAF Likelihood} \times (\text{Prior})^\gamma$

where:

$\text{MAF Likelihood}$ : Likelihood of the observed minor frequency under the current model
$\text{Prior}$ : Prior probability assigned to the allele combination based on biological plausibility
$\gamma$ : Weighting factor controlling the influence of the prior in the posterior calculation

Assigning Calls for Each Segment

The SelectCallpersegment() function refines and selects the most likely allele combinations for each genomic segment, handling both clonal and subclonal events, and incorporates coverage differences and prior knowledge.

Initialization and Preprocessing:
- Replace any zero BAF likelihoods with a minimum likelihood value to ensure all terms contribute meaningfully.
- Compute expected coverage for each model and the difference (cov_diff) from observed coverage.
Selection of Top Likelihood Models:
- For each segment, identify the top two models with the highest MAF likelihoods.
- If a subclonal event is likely (e.g., the second-ranked model has ccf > 0.3 and comparable likelihood), select it as a subclonal event.
- If both major and minor copy numbers are equal, selection is based on cov_diff.
Handling Models with minor = 0:
- For segments with minor = 0, where MAF likelihood is unreliable, select the model with the smallest cov_diff.
- Ensure consistency in selection for both minor = 0 and minor ≠ 0 cases.
Post-Processing:
- Calculate and store the log-transformed likelihood value (log_MAF_likelihood) for each selected model.

Output

Segmentation output

sample_GATK_AI_segment.tsv ( Generared by ?RunAIsegmentation function)

Column	Type	Description	Example_value
Sample	character	Sample identifier	`Sample1`
Chromosome	character	Chromosome name	`1`
Start	integer	Start position (base pair)	`123456`
End	integer	End position (base pair)	`234567`
Num_Probes	integer	Number of probes/SNPs in the segment	`25`
Segment_Mean	numeric	Segment mean (log2 ratio) from CNV analysis	`0.42`
gatk_SM_raw	numeric	Raw segment mean from GATK	`0.38`
gatk_count	integer	Number of counts in GATK segment	`30`
gatk_baselinecov	numeric	The GATK baseline is an intermediate value calculated using gatk_SM_raw and gatk_count.	`100.5`
gatk_gender	character	Gender as reported by GATK	`female`
pipeline_gender	character	Gender as used in pipeline	`female`
MAF	numeric	Minor allele frequency for the segment	`0.21`
MAF_Probes	integer	Number of probes used to calculate MAF	`18`
MAF_gmm_G	integer	Number of GMM clusters in MAF distribution	`2`
MAF_gmm_weight	numeric	Mixture weight of the main GMM cluster	`0.85`
size	integer	Segment size in base pairs	`111111`
balance_tag	character	Balance test result ( `balanced` or `imblalanced` )	`balanced`
BreakpointSource	character	Source of breakpoint (`GATK` or `Postprocess`)	`GATK`
FILTER	character	Quality tag for the segment (`PASS` or `FAILED`)	`PASS`
depth	numeric	Median read depth of the segment	`80`
depth_bin	numeric	Read depth group for assigning theta	`1`
theta	numeric	Beta–Binomial Over-Dispersion estimated based on depth	`0`
K	numeric	Beta-binomial precision parameter estimated based on depth and theta	`80`

Raw likelihood results under each configuration

sample_likelihood_raw.tsv (Generated by ?RunModelLikelihood() function)

Column	Type	Description	Example_value
major	integer	Major allele copy number	`2`
minor	integer	Minor allele copy number	`1`
CN	integer	Total copy number (major + minor)	`3`
ccf	numeric	Cancer cell fraction	`0.85`
Bio_diff	integer	Biological difficulty score for the allele combination	`3`
prior	numeric	Prior probability for the allele combination	`0.12`
expected_maf	numeric	Expected minor allele frequency for this configuration	`0.21`
maf_ll	numeric	Log-likelihood for the observed MAF under this configuration	`-0.56`
weighted_prior	numeric	Weighted log-prior (prior × gamma)	`-2.13`
exp_maf_ll	numeric	Exponentiated MAF log-likelihood	`0.57`
exp_prior	numeric	Exponentiated weighted prior	`0.11`
MAF_likelihood	numeric	Posterior likelihood for this configuration	`0.065`
Segcov	numeric	Pseudo Segment coverage	`280`
MAF	numeric	Observed minor allele frequency	`0.19`
mu	numeric	Diploid coverage scale factor	`1.0`
rho	numeric	Tumor purity (fraction between 0 and 1)	`0.7`
index	character	Segment index or identifier	`"12"`
Tag	character	Segment inclusion/exclusion tag for summarizing total likelihood for a model (e.g., `"Include"`, `"Exclude"`)	`"Include"`
ccf_MAF	numeric	Cancer cell fraction estimated from MAF and allele configuration only	`0.81`

Allelic combiantion ressult with maximum likelihood under each configuration

sample_top_likelihood_calls.tsv ( Generated by ?SelectCallpersegment() function ) The format is simillar with sample_likelihood_raw.tsv, with best allelic combiantion is selected for each segment under each diploid coverage scale factor and tumor purity configuration.

Likelihood for each combination of diploid coverage scale factor and tumor purity

sample_Models_likelihood.tsv ( Generated by ?SelectFinalModel() function )

Column	Type	Description	Example_value
mu	numeric	Diploid coverage scale factor (model parameter)	`1.0`
rho	numeric	Tumor purity (model parameter, fraction between 0 and 1)	`0.7`
total_log_likelihood_before_refine	numeric	Total log-likelihood for the model before refinement	`-1234.5`
segments_n	integer	Number of segments included in the model	`27`
Likelihood_penalty_rows	integer	Number of segments penalized due to failed likelihood calculation	`2`
total_log_likelihood_after_refine	numeric	Total log-likelihood for the model after refinement	`-1220.2`
diploid_n	integer	Number of diploid segments in the model	`15`
diploid_distance_to_integer	numeric	Mean distance to integer copy number for diploid segments	`0.04`
nondiploid_n	integer	Number of non-diploid segments in the model	`12`
nondiploid_distance_to_integer	numeric	Mean distance to integer copy number for non-diploid segments	`0.11`
total_distance_to_integer	numeric	Sum of diploid and non-diploid mean distances to integer copy number	`0.15`
ploidy	numeric	Mean copy number (ploidy) across all segments	`2.4`
Tier1	character	Model tier label (e.g., `"Tier1_Models"`, `"Final_model_MAF"`)	`"Tier1_Models"`
total_likelihood_cluster	integer	Rank based on total likelihood ( lower is better )	`1`
diploid_distance_cluster	integer	Rank based on diploid distance to integer copy number ( lower is better )	`1`
nondiploid_distance_cluster	integer	Rank based on non-diploid distance to integer copy number (lower is better)	`1`
total_likelihood_cluster_mean	numeric	Mean total log-likelihood for the level	`-1200.0`
diploid_distance_cluster_mean	numeric	Mean diploid distance to integer for the level	`0.03`
nondiploid_distance_cluster_mean	numeric	Mean non-diploid distance to integer for the level	`0.10`

Final output of CNV calling

sample_final_calls.tsv (Generated by ?RunModelLikelihood() function)

Column	Type	Description	Example_value
Chromosome	character	Chromosome name	`1`
Start	integer	Start position (base pair)	`3301463`
End	integer	End position (base pair)	`247784114`
size	integer	Segment size (bp)	`244367069`
Num_Probes	integer	Number of probes from GATK segment file	`222`.
Call	character	Copy number call (e.g., `REF`, `GAIN`, `LOSS`,`GAINLOH`,`CNLOH`)	`REF`
ccf_COV	numeric	Cancer cell fraction estimated from coverage	`1`
ccf_MAF	numeric	Cancer cell fraction estimated from MAF	`0`
ccf_final	numeric	Final cancer cell fraction after refinement	`1`
Segment_Mean	numeric	Final Segment mean (log2 ratio)	`0.057631093`
CNF_correct	numeric	Purity corrected copy number estimate from coverage	`2.086898584`
major	integer	Major allele copy number	`1`
minor	integer	Minor allele copy number	`1`
CN	integer	Total copy number (major + minor)	`2`
MAF	numeric	Observed minor allele frequency	`0.5`
MAF_correct	numeric	Purity corrected minor allele frequency	`0.5`
expected_maf	numeric	Expected minor allele frequency for this configuration	`0.5`
expected_cov	numeric	Expected pseudo coverage for this segment	`90`
MAF_Probes	integer	Number of probes used for MAF calculation	`1110`
MAF_gmm_G	integer	Number of GMM clusters in MAF distribution	`5`
MAF_gmm_weight	numeric	Mixture weight of the main GMM cluster	`0.667871528`
balance_tag	character	Balance test result ( `balanced` or `imblalanced` )	`balanced`
BreakpointSource	character	Source of breakpoint (`GATK` or `Postprocess`)	`GATK`
FILTER	character	Quality tag for the segment (`PASS` or `FAILED`)	`PASS`
maf_ll	numeric	Log-likelihood for the observed MAF	`2.625299941`
MAF_likelihood	numeric	Posterior likelihood for this configuration	`8.891628731`
mu	numeric	Diploid coverage scale factor	`0.9`
rho	numeric	Tumor purity (fraction between 0 and 1)	`0.938`
index	character	Segment index or identifier	`1`
gatk_SM_raw	numeric	Raw segment mean from GATK	`-0.094372`
gatk_count	integer	Number of counts in GATK segment	`361`
gatk_baselinecov	numeric	The GATK baseline is an intermediate value calculated using gatk_SM_raw and gatk_count.	`385.4038109`
gatk_gender	character	Gender as reported by GATK	`female`
pipeline_gender	character	Gender as used in pipeline	`female`
CN_mix	character	Indicator for copy number mixture (`No` or `CN_Mix`)	`No`
Model_source	character	Source of model selection (`Coverage`, `Coverage + MAF`, `Diploid` )	`Coverage + MAF`

Model selection plots

Likelihood dot plot: The plot displays the likelihood ranking for all combinations of diploid coverage scale factor and tumor purity. The vertical dashed line indicates the likelihood cutoff used to define Tier 1 models.

Model plot: The model plot displays the likelihood values of different models, which are calculated based on potential combinations of diploid coverage scale factor and tumor purity. In the plot, red indicates higher likelihood, while blue signifies lower likelihood. The light blue dot indicates the final model selected by XploR.

Tier1 Models Overall: This plot shows copy number calls for each combination of diploid coverage scale factor and tumor purity. Red indicates gain, blue indicates loss, and white indicates no change. Each configuration is labeled on the y-axis. By evaluating coverage and allelic imbalance patterns in this overview, you can identify the reasonable range of diploid coverage scale factors and tumor purity values. This helps guide reruns with optimized parameter ranges if needed.

Tier1 Models Zoom in: A zoomed-in view that makes the y-axis configurations more visible for detailed inspection.

QC Summary Table

sample_PASS_STAT_chr.txt ( Generated by ?BafQC() function )

Column	Type	Description	Example_value
chrom	character	Chromosome name (e.g., `1`, `2`, …, `X`, `Y`)	`1`
FILTER	character	Segment filter status	`PASS`
Total_segment_count	integer	Total number of segments on the chromosome	`25`
PASS_Seg_Count	integer	Number of segments with `PASS` filter status	`20`
PASS_Seg_Percent	numeric	Percentage of segments with `PASS` status (0–1)	`0.80`
Total_segment_size	integer	Total size (bp) of all segments on the chromosome	`249250621`
PASS_Seg_Size	integer	Total size (bp) of `PASS` segments on the chromosome	`199400497`
PASS_Seg_Size_Percent	numeric	Percentage of total segment size that is `PASS` (0–1)	`0.80`

Annotation file

sample_CNV_annotation.tsv ( Generated by ?AnnotateSegments() function, only unique columns are listed ).

ISCN calculation rules: 1. All segments will be reported with start and end cytoband in ISCN format. however certain considerations are made for the position of the centromere: a. In metacentric chromosomes, if a segment crosses the centromere and the gaps between the segment and the telomere on both sides are less than 5MB, only the chromosome number will be reported. b. In metacentric chromosomes, if a segment does not cross the centromere, and the gaps between the segment and the centromere and the telomere are both less than 5MB, the chromosome number followed by ‘p’ or ‘q’ will be reported. c. In acrocentric chromosomes, if the segment fulfills rule ‘b’ above, only the chromosome number will be reported.

Column	Type	Description	Example_value
p_chromStart	integer	Detectable start position of p arm	`10`
p_chromEnd	integer	Detectable end position of p arm	`121535434`
p_first_name	character	Detectable name of first cytoband in p arm	`p36.33`
p_last_name	character	Detectable name of last cytoband in p arm	`p11.2`
q_chromStart	integer	Detectable Start position of q arm	`121535435`
q_chromEnd	integer	Detectable end position of q arm	`247784114`
q_first_name	character	Detectable name of first cytoband in q arm	`q11.1`
q_last_name	character	Detectable name of last cytoband in q arm	`qter`
p_gap_to_tel	integer	Gap from segment start to p arm telomere	`0`
p_gap_to_cen	integer	Gap from segment end to p arm centromere	`10000`
q_gap_to_tel	integer	Gap from segment end to q arm telomere	`0`
q_gap_to_cen	integer	Gap from segment start to q arm centromere	`10000`
ISCN	character	ISCN-style cytogenetic annotation	`1p36.33-p11.2`
Gene	character	Overlapping gene(s) in the segment	`TP53`
Gene_count	integer	Number of overlapping genes	`1`

CNV plot

sample_CNV_plot.png ( Generated by ?RunPlotCNV() function). CNV plot: The CNV Plot shows a genome-wide summary of the copy number (top track), B-allele frequency (BAF, second track) data, tumor fraction( ccf, third tract ) and quality of segment ( bottom track). The Copy Number (CN), on the Y-axis, is a linear count of the number of copies of each chromosome in the tumor cells, taking tumor purity and tumor fraction into account. Each chromosome is plotted as a set of dots that collectively show the estimated sequence coverage for the chromosome, and as a narrow turquoise line that shows the final CN call for the chromosome. The BAF plot shows the variant allele fraction of SNPs across the genome with the same coloration used in the Copy Number plot. When the copy number of a chromosome changes, the BAF plot for an affected chromosome splits due to imbalance in chromosome counts. The variance of B-allele frequencies is quite high so the splitting of the BAF may be difficult to discern. To assist with interpreting the BAF plot, a turquoise line is drawn at the median level to show the imbalance.

Model selection and rerun

XploR allows users to rerun copy number variant (CNV) calling with custom purity and scale factor (size factor) ranges, or by specifying a diploid region for normalization. This function supports both “model” and “region” rerun modes and outputs refined CNV calls. The diploid coverage region can be specified using three parameters: chromosome, start, and end. If only chromosome is provided, the entire chromosome is used. If chromosome and start are provided, the region from start to the end of the chromosome is considered. If chromosome and end are provided, the region from the start of the chromosome to end is used. If all three parameters are specified, the defined region between start and end on the selected chromosome is used.

# Rerun using specific purity and scale factor ranges
RerunCNV(
  seg = "results/Sample1_GATK_AI_segment.tsv",
  input = "results/Sample1_top_likelihood_calls.tsv",
  models = "results/Sample1_Models_likelihood.tsv",
  call = "results/Sample1_final_call.tsv",
  gender = "female",
  dicovsf = "0.95:1.05",
  purity = "0.6:0.8",
  mode = "model",
  out_file = "results/Sample1_final_call_refined.tsv"
)

# Rerun using a user-defined diploid region
RerunCNV(
  seg = "results/Sample1_GATK_AI_segment.tsv",
  input = "results/Sample1_top_likelihood_calls.tsv",
  models = "results/Sample1_Models_likelihood.tsv",
  call = "results/Sample1_final_call.tsv",
  gender = "male",
  chromosome = "3",
  start = 1000000,
  end = 50000000,
  mode = "region",
  out_file = "results/Sample1_final_call_refined.tsv"
)

Parameters for `RerunCNV`

Parameter	Type	Description	Example Value
`seg`	character	Path to AI segment file generated by `RunAIsegmentation`.	`"results/Sample1_GATK_AI_segment.tsv"`
`input`	character	Path to top likelihood row file generated by `RunModelLikelihood`.	`"results/Sample1_top_likelihood_calls.tsv"`
`models`	character	Path to model likelihood file generated by `RunModelLikelihood`.	`"results/Sample1_Models_likelihood.tsv"`
`call`	character	Path to final call file generated by `RunModelLikelihood`.	`"results/Sample1_final_call.tsv"`
`gender`	character	Sample gender, either `"male"` or `"female"`.	`"female"`
`dicovsf`	character	Desired scale factor range for normalization (must be in `"min:max"` format). Required if `mode = "model"`	`"0.95:1.05"`
`purity`	character	Desired purity range for model selection (must be in `"min:max"` format). Required if `mode = "model"`	`"0.6:0.8"`
`callcov`	numeric	Subclonal event calling cutoff (no models, coverage-based). Default is `0.3`.	`0.3`
`chromosome`	character	Chromosome for user-defined diploid region (required if `mode = "region"`).	`"3"`
`start`	integer	Start position for diploid region (optional; used in `"region"` mode).	`1000000`
`end`	integer	End position for diploid region (optional; used in `"region"` mode).	`50000000`
`mode`	character	Rerun mode: either `"model"` (custom purity/scale factor) or `"region"` (user-defined diploid region).	`"model"`
`out_file`	character	Output file path for refined CNV calls.	`"results/Sample1_final_call_refined.tsv"`

Note:
After rerunning, the resulting file can be used in downstream steps—such as plotting and annotation—just like files generated prior to rerun.

Full function and parameter list

For a complete list of all functions and their parameters, please visit the XploR function reference.

Each function page includes detailed parameter descriptions, usage examples, and links to related documentation.

🚧 Project Status

Contents

Features

Installation

Quick Test run

Running this function is same with running the steps separately like:

1. Run segmentation based on Allelic imbalance information. The example used “cbs” segmentation method.

Parameters for RunAIsegmentation

2. Run model likelihood calculation and selection.

Parameters for RunModelLikelihood

3. Run annotation segments.

Parameters for AnnotateSegments

4. Generating CNV plot

Parameters for RunPlotCNV

5. Generating AI segment quality file.

Parameters for BafQC

Prepare input files

Supporting allelic count file format

Prepare Reference Files

Panel of normal reference

A. Whitelist, Blacklist, and Detectable Boundary Files

B.Panel of Normals Based on Allelic Count Files

Parameters for PONAIprocess

Gene annotation reference:

Cytoband annotation reference:

Algorithm

Binning Strategy for allelic count Data

Segmentation of MAF track

MAF Bias Correction Using Panel of Normal (PoN) Allelic Counts

Purity and diploid coverage scale factor estimation

Estimate a Beta–Binomial Over-Dispersion Parameter θ\theta from a Panel of Normals (PoN)

Prior assignment based on parsimony principle

Tumor Copy Number Estimation

Likelihood Calculation

Assigning Calls for Each Segment

Output

Segmentation output

Raw likelihood results under each configuration

Allelic combiantion ressult with maximum likelihood under each configuration

Likelihood for each combination of diploid coverage scale factor and tumor purity

Final output of CNV calling

Model selection plots

QC Summary Table

Annotation file

CNV plot

Model selection and rerun

Parameters for RerunCNV

Full function and parameter list

Parameters for `RunAIsegmentation`

Parameters for `RunModelLikelihood`

Parameters for `AnnotateSegments`

Parameters for `RunPlotCNV`

Parameters for `BafQC`

Parameters for `PONAIprocess`

Estimate a Beta–Binomial Over-Dispersion Parameter $\theta$ from a Panel of Normals (PoN)

Parameters for `RerunCNV`