Repeat Expansion Detection with ExpansionHunter
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. ExpansionHunter then genotypes the length of the repeat in each allele based on these graph alignments.
The ExpansionHunter is designed for PCR-free whole genome samples. Repeats are only genotyped if the coverage at the locus is at least 10x. The ExpansionHunter cannot be run on multiple FASTQ files that are assigned to different library IDs in the fastq_list.csv file.
More information and analysis are available in the following ExpansionHunter papers:
• | Dolzhenko et al., Detection of long repeat expansions from PCR-free whole-genome sequence data 2017 |
• | Dolzhenko et al., ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions 2019 |

To enable DRAGEN repeat expansion detection, the following command line options are required.
• | --repeat-genotype-enable=true |
• | --repeat-genotype-specs=<path to specification file> |
You can use the --sample-sex option to specify the sex of the sample.
The following options are optional.
• | --repeat-genotype-region-extension-length=<length of region around repeat to examine> (default 1000 bp) |
• | --repeat-genotype-min-baseq=<Minimum base quality for high confidence bases> (default 20) |
For more information on the specification file specified by --repeat-genotype-specs option, see Repeat Expansion Specification Files.
The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.

The repeat-specification (also called variant catalog) JSON file defines the repeat regions for ExpansionHunter to analyze. Default repeat-specification for some pathogenic repeats are in the /opt/edico/repeat-specs/ directory, based on the reference genome used with DRAGEN.
You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.
--repeat-genotype-specs is required for ExpansionHunter. If the option is not provided, DRAGEN attempts to autodetect the applicable catalog file from /opt/edico/repeat-specs/ based on the reference provided.

The current variant catalogs contain specifications on disease-causing repeats located in AFF2, AR, ATN1, ATXN10, ATXN1, ATXN2, ATXN3, PHOX2B, ATXN7, ATXN8OS, C9ORF72, CACNA1A, CBL, CNBP, CSTB, DIP2B, DMPK, FMR1, FXN, HTT, JPH3, NOP56, PPP2R2B, TBP, TCF4, NIPA1, GLS, RFC1, and PABPN1 genes. Also, GRCh38/hg38 catalog defines NOTCH2NL repeat. This repeat is not included into GRCh37/hg19 catalog due to issues with aligning in this region.
The ExpansionHunter can detect pathogenic expansions of FXN, ATXN3, ATN1, AR, DMPK, HTT, FMR1, ATXN1, C9ORF72 repeats with high accuracy (see the ExpansionHunter papers above). The pathogenicity status of some repeats might depend on the presence of sequence interruptions or motif changes that ExpansionHunter does not call. If you would like to visually inspect the relevant read alignments, you can use a Repeat Expansion Viewer third-party tool.


The results of repeat genotyping are output as a separate VCF file, which provides the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf (*.gz).
The VCF output file lists with the following fields first.
Field |
Description |
---|---|
CHROM |
Chromosome identifier |
POS |
Position of the first base before the repeat region in the reference |
ID |
Always . |
REF |
The reference base at position POS |
ALT |
List of repeat alleles in format <STRn> . N is the number of repeat units. |
QUAL |
Always . |
FILTER |
LowDepth filter is applied when the overall locus depth is below 10x or the number of reads that span one or both breakends is below 5. |
Field |
Description |
---|---|
END |
Position of the last base of the repeat region in the reference |
REF |
Number of repeat units spanned by the repeat in the reference |
RL |
Reference length in bp |
VARID |
Variant ID from the variant catalog |
RU |
Repeat unit in the reference orientation |
REPID |
Variant ID from the variant catalog |
Field |
Description |
---|---|
GT |
Genotype |
SO |
Type of reads that support the allele. Values can be SPANNING, FLANKING, or INREPEAT. These values indicate if the reads span, flank, or are fully contained in the repeat. |
REPCN |
Number of repeat units spanned by the allele |
REPCI |
Confidence interval for REPCN |
ADSP |
Number of spanning reads consistent with the allele |
ADFL |
Number of flanking reads consistent with the allele |
ADIR |
Number of in-repeat reads consistent with the allele |
LC |
Locus Coverage |
For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA13537
chr6 16327864 . G <STR33>,<STR58> . PASS END=16327954;REF=30;RL=90;RU=TGC;VARID=ATXN1;REPID=ATXN1 GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC 1/2:SPANNING/INREPEAT:33/58:33-33/52-71:4/0:69/83:0/4:37.459459
In this example, the first allele spans 33 repeat units while the second allele spans 58 repeat units. The repeat unit is TGC (RU INFO field), so the sequence of the first allele is TGC x 33 and the sequence of the second allele is TGC x 58. The repeat spans 30 repeat units in the reference (REF INFO field).
The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (52,71). There are 4 spanning and 69 flanking reads consistent with the repeat allele of size 33 that is 4 reads fully contain the repeat of size 33 and 69 flanking reads overlap at most 33 repeat units. There are 83 flanking and 4 in-repeat reads consistent with the repeat allele of size 58. The average coverage of this locus is 37.46x.

The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool available on GitHub to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMS.
The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>.
• | LocusName—A locus identifier that matches the corresponding entry in the repeat expansion specification file. |
• | StartPosition—The starting alignment position of a read on the first graph node. |
• | GraphCIGAR—The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node. |
Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.