Skip to content

Data curation

We obtained white-listed data from the ICGC and TCGA PCAWG dataset. The term ‘white-listed’ refers to samples that passed quality control by the PCAWG consortium24. Data were accessed through the Cancer Genome Collaboratory. We used aligned reads (BAM files), which were aligned to GRCh37 as described previously24. These data are available through the PCAWG data portal ( A list of samples included in the analysis is available in Supplementary Table 2.

Identification of somatic rREs

We analysed tumour and matching normal samples for each cancer type independently. We executed EHdn (v0.9.0)16 with the following parameters: –min-anchor-mapq 50 –max-irr-mapq 40. To prioritize loci, we developed a workflow termed Tandem Repeat Locus Prioritization in Cancer (TROPIC). We included loci from chromosomes 1–22, X and Y for downstream analysis. We removed loci where >10% of Anchored in-repeat read (IRR) values were >40, which is the theoretical maximum value. The P value (from a non-parametric one-sided Wilcoxon rank-sum test) for each locus was used to calculate an FDR q value. Loci with FDR < 0.10 are reported. We selected loci where >5% of samples had an Anchored IRR quotient of >2.5. The results of our filtering are available in Supplementary Table 3. For a repeat expansion to be detected by EHdn, the TR was required to be larger than the sequencing read length. A somatic repeat expansion was defined as having FDR q < 0.05 in a comparison of the tumour and normal samples. We next calculated a preliminary estimate of the frequency of rREs in each cancer. To call repeat expansions in individual cancer samples, we analysed the distribution of tumour and normal Anchored IRR values and selected a conservative threshold for the Anchored IRR quotient ((tumour Anchored IRR – normal Anchored IRR)/(normal Anchored IRR + 1)) > 2.5 (Extended Data Fig. 4).

Local read depth normalization

EHdn normalizes the number of Anchored IRRs for a given locus to the global read depth. To account for chromosomal amplifications and other forms of genetic variation that could alter local read depth, we performed the following normalization. For each rRE locus and sample in its corresponding cancer, samtools v1.13 was used with the parameter depth -r to find the read depth at each base pair within the locus and a 500-bp region encompassing the start and stop positions of the TR. We calculated the average read depth at each base pair and defined this as the local read depth. Finally, we calculated the local read depth-normalized Anchored IRR value specific to a sample and rRE combination by dividing the non-normalized Anchored IRR value from EHdn by the local read depth at the locus.

Generation of CABOSEN cells

CABOSEN cells were generated from a cabozantinib-sensitive (CABOSEN) human papillary RCC xenograft tumour grown in Rag2–/–γC–/– mice, as described previously51. Tumour tissue was minced with a sterile blade, and the cell suspension was cultured in DMEM/F-12 medium (Corning) supplemented with 10% (vol/vol) Cosmic calf serum (ThermoFisher). Cells were expanded and cryopreserved in growth medium supplemented with 10% (vol/vol) DMSO, and cells from passage 8 were used for analysis.

Analysis of rREs by gel electrophoresis

We performed PCR with CloneAmp HiFi PCR Mix (Takara Biosciences) and added DMSO to a final concentration of 5–10% (vol/vol) as needed. A list of the primers used to analyse the loci is available in Supplementary Table 5. All cell lines tested negative for mycoplasma contamination with the MycoAlert Mycoplasma Detection kit (Lonza). Cell line identities were authenticated through STR profiling by the Genetic Resources Core Facility at Johns Hopkins University, with the exception of SNU-349 cells, which did not match the reported STR profile of SNU-349 cells or any other catalogued cell line but had a mutated VHL gene and expressed high levels of PAX8 and CA9, in line with a clear cell RCC origin.

Visualization of repeat expansions with ExpansionHunter and REViewer

To inspect the reads supporting a repeat expansion, we annotated the repeat as described on the GitHub page for ExpansionHunter. We then profiled the region with ExpansionHunter (v4.0.2) using the default settings15. The resulting reads were visualized with REViewer (v0.1.1) using the default settings. REViewer is available at A repeat expansion was called when the repeat tract length for one allele of the tumour sample was greater than 100 bp and exceeded the repeat tract length of both normal alleles. A locus was considered validated if at least ten cancer genomes had a repeat expansion.

Validation of rREs in independent cohorts of samples

Twelve pairs of matching normal and tumour samples from patients with clear cell RCC were obtained with the patients’ informed consent ex vivo upon surgical tumour resection (Stanford institutional review board-approved protocols 26213 and 12597) and analysed. Eighteen and 15 pairs of matching normal and tumour samples for prostate and breast cancer, respectively, were obtained from the Tissue Procurement Shared Resource facility at the Stanford Cancer Institute and analysed. These samples were obtained with patients’ informed consent (Stanford institutional review board-approved protocols 11977 and 55606). Nucleic acid was isolated with either the Quick Microprep Plus kit (D7005) or the Zymo Quick Miniprep Plus kit (D7003) (Zymo Research). Gel electrophoresis was performed as described above. A locus was considered detected if a somatic repeat expansion was identified in at least one patient tumour sample compared with a matching normal sample.

Downsampling analysis

For the downsampling analysis, tumour genomes from RCC samples were downsampled from their mean (52×) sequencing depth to 40×, 30×, 20× and 10× depth with the samtools view command. EHdn was run, as described above, for each of the sequencing depths, and the Bonferroni-corrected P value was plotted for the rRE in UGT2B7 (GAAA, chr4:69929297–69930148).

Benchmarking the local read depth normalization filter

We benchmarked the local read depth filter in silico by observing its behaviour with simulated reads. First, we created a reference genome containing artificially expanded repeats. We randomly selected ten TRs located on chromosome 1 that were shorter than the sequencing read length of 100 bp. We artificially expanded these TRs on chromosome 1 of GRCh37 with the BioPython Python package (v1.79). Next, we used wgsim (v0.3.1-r13) to simulate reads from the reference file with the command ‘wgsim -N 291269925 –1 100 –2 100 reference_file.fasta output.read1.fastq output.read2.fastq’. The number of reads (specified by the -N option) was calculated to achieve 30× coverage of chromosome 1. The resulting pair of files, hereafter referred to as the base fastq files, contained a copy number of 2 for all of the expansions.

To simulate copy number amplification, the read simulation process was repeated using reference files that contained only the artificially expanded repeats and their surrounding 1,000-bp flanking regions. We created ten pairs of fastq files, each with an increasing copy number. We specified the copy number by multiplying the number of reads to generate (wgsim -N option) by the required number. To generate the final set of fastq files, we concatenated each pair of copy number-amplified fastq files with the base fastq files. The end result was eight pairs of fastq files that contained reads for chromosome 1 and copy number amplification varying from 2 to 10 of the expanded repeats.

The base fastq file with a copy number of 2, in addition to the eight copy number-amplified fastq files, was aligned to chromosome 1 of GRCh37 with bwa-mem (v0.6) with the default options. The resulting SAM files were converted to BAM format with samtools (v1.15) using the default options. Finally, we ran the EHdn profile command (v0.9.0) with the minimum anchor mapping quality set to 50 and maximum IRR mapping quality set to 40. Finally, the Anchored IRR values were extracted by overlapping the STR coordinates with the de novo repeat expansion calls.

Short-read and long-read DNA sequencing

We sequenced the Caki-1 and 786-O cell lines with both short-read sequencing (60× sequencing coverage, 150-bp paired-end sequencing on a NovaSeq 6000 instrument) and long-read sequencing (50× sequencing coverage, PacBio HiFi sequencing on a Sequel IIe instrument). We aligned the long reads to GRCh37 with pbmm2 (v1.7.0), using the parameters –sort –min-concordance-perc 70.0 –min-length 50. We aligned the short reads to GRCh37 with Sentieon (v202112.01) using parameters -K 10000000 -M, an implementation of BWA-MEM, and analysed the samples with EHdn, as described above. We included loci for which at least one sample had an Anchored IRR value of >0 for further analysis. Anchored IRR values >0 arise when the repeat length exceeds the sequencing read length. To benchmark EHdn against long-read sequencing data, we manually determined the TR length of a given locus in the long-read sequencing data. If the TR length in the long-read sequencing data exceeded the short-read sequencing read length of 150 bp, we considered that locus to have been confirmed.

The PacBio HiFi data were aligned to GRCh37 with pbmm2 (v1.7.0) and visualized at the UGT2B7 locus with Tandem Repeat Genotyper (v0.2.0;

Analysis of rRE loci

To determine whether rREs were associated with any human diseases, rREs were mapped to genes with GREAT (v4.0.4, default settings)52. The resulting genes were analysed with Enrichr using Jensen Diseases53. The output of this analysis is available in Supplementary Table 4. To determine whether repeat expansions were associated with MSI-high cancers, we obtained data from ref. 3. The percentage of MSI-high cancers was obtained for colon adenocarcinoma (COAD), stomach adenocarcinoma (STAD), kidney renal cell carcinoma (KIRC), ovarian serous cystadenocarcinoma (OV), prostate adenocarcinoma (PRAD), head and neck squamous cell carcinoma (HNSC), liver hepatocellular carcinoma (LIHC), bladder urothelial carcinoma (BLCA), glioblastoma multiforme (GBM), skin cutaneous melanoma (SKCM), thyroid carcinoma (THCA) and breast invasive carcinoma (BRCA) and compared with the number of repeat expansions and the percentage of patients with at least one repeat expansion in the corresponding cancer type from the PCAWG dataset. We also overlapped cancer genomes containing rREs with the microsatellite mutation rate (data available for all but 157 PCAWG genomes analysed in this study), which we term the STR mutation rate, and MSI calls from ref. 28. The association of rREs with STR mutation rate was assessed with the two-tailed Wilcoxon rank-sum test. The association of rREs with MSI calls was assessed by chi-squared test with Yates’ correction.

To determine whether rREs were associated with known mutational signatures, we downloaded mutational signatures from the ICGC Data Coordination Center (DCC; We performed multiple linear regression for each SBS and DBS signature to identify predictors of the number of rREs present in a sample. To choose the predictors, we performed best subset selection on DBS and SBS signatures and included age as a possible confounding factor. We used statsmodels (v0.12.2) in Python and, specifically, the ordinary least-squares model found in the statsmodels.api.OLS module to estimate the coefficients of the selected predictors in their corresponding multiple linear regression model54.

To determine whether repeat expansions were associated with a difference in cytotoxic activity, we calculated cytotoxic activity as previously described for four cancers that had matching RNA-seq and WGS data40. For each locus, we compared the cytolytic activity for patients with a repeat expansion to that for patients without a detected repeat expansion using a Welch’s t test (a two-tailed test) with correction for multiple-hypothesis testing (Benjamini–Hochberg FDR q < 0.05). rREs were annotated with genic elements using annotatr (v1.18.1)32.

To determine whether rREs were associated with regulatory elements, we downloaded cCREs33 and mapped them to GRCh37 with LiftOver (UCSC) (n = 950,091 after removing 174 outliers)55. We determined the distance between rREs and cCREs with the bedtools closest command (v2.27.1)56 and compared this distance to that for a simple repeats catalogue57. To compare the distance to ENCODE cCREs, a Welch’s t test was performed.

To determine whether prostate cancer rREs were associated with prostate cancer susceptibility loci35, we calculated the distance to three sets of loci using the ‘bedtools closest’ command. We calculated the distance between (1) rREs present in prostate cancer samples and prostate cancer susceptibility loci, (2) rREs not present in prostate cancer samples and prostate cancer susceptibility loci and (3) simple repeats and prostate cancer susceptibility loci. To compare the distances between these three associations, we performed a Welch’s t test with FDR correction (Benjamini–Hochberg).

To determine whether rREs were associated with replication timing, we downloaded Repli-seq replication timing data for seven cell lines from the ENCODE website (NCI-H460, T470, A549, Caki2, G401, LNCaP and SKNMC)58. We selected regions for which all cell lines had concordant signals for analysis (early or late replication designations in agreement for each cell line at a given locus). We determined whether there was a difference in the distribution of rREs across early- and late-replicating regions compared with the simple repeats catalogue by using bootstrapping (n= 10,000). We sampled 54 loci (the number of rREs present in a concordant replication region) from rREs and simple repeats. A Welch’s t test was performed on the bootstrapped samples to estimate a P value. We applied FDR correction (Benjamini–Hochberg) to the estimated P values. To determine whether rRE status in UGT2B7 was associated with survival outcome in patients with clear cell RCC (TCGA abbreviation, KIRC), we used Welch’s t-test quartile.

To identify motifs enriched and depleted in the rRE catalogue, we followed the same method as in the motifscan Python module (v1.3.0)59. We compared our rRE catalogue to the simple repeats catalogue (TRF) as a control. For each unique motif present, we built a contingency table specifying the count of rREs and simple repeats with and without the motif. Two one-tailed Fisher’s exact tests were applied to the table to test for significance in both directions, that is, enrichment and depletion. The ‘stats’ module in the Scipy Python package (v1.7.0) was used to conduct the significance test. Because multiple-hypothesis tests were performed, we applied FDR correction (Benjamini–Hochberg) for multiple-hypothesis testing to the P values, with a cut-off (FDR) of 0.01.

For the comparison of SNVs in COSMIC genes to rREs, we first divided the cancer genomes into two categories: an rRE cohort and a non-rRE cohort. The rRE cohort contained all genomes that had at least one rRE detected (n = 615), and the non-rRE cohort contained all genomes that had no rREs detected (n = 1,897). We then looked at the number of donors in the rRE cohort that had at least one mutation in a given gene (COSMIC tier 1 genes) i and the number of donors in the non-rRE cohort that had at least one mutation in a given gene i with a contingency table. We calculated the P value (Fisher’s exact test) for the significance of associating genes with either the rRE or non-rRE cohort. This P-value calculation was repeated for all COSMIC genes, using FDR at a significance level of 0.05 (Benjamini–Hochberg) to correct for multiple-hypothesis testing.

Estimation of expansions in the general population

To estimate the frequency of rREs in the general population, EHdn (v0.9.0) was run on 1000 Genomes Project samples60 (n = 2,504) (GRCh38) and Medical Genome Reference Bank61 samples (n = 4,010) (GRCh37 lifted over to GRCh38).

The genomic coordinates of the 160 rREs (GRCh37) were padded with 1,000 bp and translated to GRCh38 coordinates with UCSC LiftOver. Then, the rRE coordinates (GRCh38) were overlapped with loci from the population samples containing Anchored IRR calls. rREs that overlapped with matching motifs in the population samples were selected for further analysis. We next sought to identify expanded rREs in the population samples to quantify their prevalence. To do so, we converted their global-normalized Anchored IRR values to be comparable to ICGC values. This step was necessary because sequencing read lengths in the PCAWG dataset are generally 100 bp while the read lengths in the 1000 Genomes and Medical Genome Reference Bank datasets are 150 bp. Conversion followed the formula (Anchored IRR, 100 bp) = 0.5 + 1.5 × (Anchored IRR, 150 bp)16. A sample in the population samples was counted as expanded if its Anchored IRR value was greater than the 99th percentile of Anchored IRR values in the normal samples from the PCAWG dataset, a threshold that is comparable to the threshold used to call expansions in tumour samples (Extended Data Fig. 4). In future rRE catalogues, for the rare instance where the estimated frequency of repeat expansions in the population samples is higher than expected, these data could be used to further filter rREs to improve the detection of cancer-specific repeat expansions.

To compare the length of TRs in normal samples with and without a matching rRE in a tumour sample, donors in the Prost-AdenoCA and Kidney-RCC cohorts whose data are available for download through the Cancer Collaboratory were included (n = 253). We used ExpansionHunter (v5.0.0) with the default options to genotype prostate and kidney cancer rREs in the normal samples of the selected donors. When there were two alleles of an rRE in a sample, both alleles were included and treated as distinct data points. For each rRE, we tested whether the distribution of genotypes from donors who had an expansion in their tumour samples differed from that for donors who did not have an expansion. Student’s t test was used to compute P values with FDR correction (Benjamini–Hochberg) to adjust for multiple-hypothesis testing.

Association of rREs with gene expression

Matching RNA-seq and WGS data were available for Kidney–RCC, Ovary–AdenoCA, Panc–AdenoCA and Panc–Endocrine. RNA-seq data from these samples were obtained from the DCC (, and values were converted to transcripts per million (TPM). Normalized gene expression (TPM) values were compared for samples with and without an rRE (Welch’s t test, with FDR correction). For isoform analysis, normalized gene expression counts were compared for samples with and without a repeat expansion using the DESeq2 (v1.32.0) package in R (v4.0.5). We used the DESeq function to calculate the log2-transformed fold change for three isoforms of the UGT2B7 gene (ENST00000305231.7, ENST00000508661.1 and ENST00000502942.1) and performed a Wald test with FDR correction using the Benjamini–Hochberg procedure (q-value threshold of q < 0.01).

Design, synthesis and characterization of Syn-TEFs and polyamides

Syn-TEFs and polyamides were designed to target a GAAA repeat (Syn-TEF3 and PA3) or a control GGAA repeat (Syn-TEF4 and PA4). Syn-TEF3, Syn-TEF4, PA3 and PA4 were synthesized and purified to a minimum of 95% compound purity by WuXi Apptec and used without further characterization. HPLC conditions for chemical characterization were as follows: flow rate of 1.0 ml min–1; solvent A: 0.1% (vol/vol) trifluoroacetic acid (TFA) in water; solvent B: 0.075% (vol/vol) TFA in acetonitrile; Gemini column: C18 5 μm 110A 150 × 4.6 mm. Full results of characterization can be found in Supplementary Fig. 2.

Treatment of RCC cell lines with Syn-TEFs

Caki-1, 786-O and Caki-2 cells were obtained from the American Type Culture Collection (ATCC) and grown in RPMI-1640 with l-glutamine (Gibco, 11875093), supplemented with 10% (vol/vol) FBS. A498 and ACHN cells were obtained from ATCC and grown in DMEM with glucose, l-glutamine and sodium pyruvate (Corning, 10-013-CV), supplemented with 10% (vol/vol) FBS. RCC-4 cells were obtained from A. Giacca (Stanford University) and grown in DMEM with glucose, l-glutamine and sodium pyruvate (Corning, 10-013-CV), supplemented with 10% (vol/vol) FBS. Cell line identities were confirmed by STR profiling (Genetic Resource Core Facility, Johns Hopkins University) and tested negative for mycoplasma. Cells were seeded in 96-well plates on day 0. On day 1, cells were treated with the indicated molecules. Molecules were dissolved in DMSO (vehicle) and added to cells (0.1% (vol/vol) final concentration of DMSO). On day 4 (72 h later), relative metabolic activity was measured as a proxy for relative cell density, using the Cell Counting Kit (CCK-8, Dojindo Molecular Technologies) according to the manufacturer’s instructions. Absorbance (450 nm) of cells treated with molecules was normalized to that for cells treated with DMSO (0.1% (vol/vol)) or with no treatment. Absorbance was measured with an Infinite M1000 microplate reader (Tecan).

For microscopy, Caki-1 and 786-O cells were plated on glass-bottom 96-well plates under standard culture conditions. One day after plating, medium containing no drug, 50 μM Syn-TEF3 or 50 μM Syn-TEF4 was added, and the cells were incubated for 72 h at 37 °C. As a control, wells that received no treatment were incubated with 70% (vol/vol) ethanol for 30 s before staining. Cells were then stained with propidium iodide, Calcein-AM and Hoechst 33342 from the Live-Dead Cell Viability Assay kit (Millipore Sigma, CBA415) according to the manufacturer’s instructions and immediately imaged at ×10 magnification with a 0.17-NA CFI60 objective on a Keyence BZ-X710 microscope. Eight fields were measured for each treatment condition, and the experiment was repeated two times. Quantification was conducted using FIJI software (release 20220330-1517). For statistical analyses, one-way ANOVA adjusted with Bonferroni correction for multiple comparisons was conducted with GraphPad Prism (v9.3.1).

Statistics and reproducibility

Data are represented as the mean ± s.e.m. unless stated otherwise. All experiments were reproduced at least twice unless stated otherwise. Box plots were prepared with matplotlib (v3.4 or v3.6) as follows unless stated otherwise: the box extends from the first quartile (Q1 or 25th percentile) to the third quartile (Q3 or 75th percentile) of the data, with a line at the median. The whiskers extend from the box by 1.5 times the interquartile range (IQR). The IQR is the difference between the values at Q3 and Q1. Outliers were not plotted to improve clarity. Details on how box plots were generated are available at

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Leave a Reply

Your email address will not be published. Required fields are marked *