Mosaic chromosomal alterations in blood across ancestries using whole-genome sequencing
[ad_1]
Study population
We included 67,390 participants from 19 TOPMed studies: Genetics of Cardiometabolic Health in the Amish (n = 1,109) (ref. 32), Atherosclerosis Risk in Communities Study (n = 3,780) (ref. 33), Barbados Genetics Asthma Study (n = 980), Mount Sinai BioMe Biobank (n = 9,392) (ref. 34), Coronary Artery Risk Development in Young Adults (n = 3,293) (ref. 35), Cleveland Family Study (n = 1,281), CHS (n = 3,517) (ref. 36), COPDGene (n = 10,050) (ref. 37), Framingham Heart Study (n = 4,007) (ref. 38), Genetic Studies of Atherosclerosis Risk (n = 1,733) (ref. 39), Genetic Epidemiology Network of Arteriopathy (n = 1,157), Genetics of Lipid Lowering Drugs and Diet Network (n = 942), Hispanic Community Health Study—Study of Latinos (n = 3,857) (ref. 40), Hypertension Genetic Epidemiology Network (n = 1,865), Jackson Heart Study (n = 3,317) (ref. 41), MESA (n = 5,222) (ref. 42), Vanderbilt BioVU study of African Americans (n = 1,085), Women’s Genome Health Study (n = 108), and Women’s Health Initiative (WHI, n = 10,695) (ref. 43). The 67,390 TOPMed participants were categorized into discrete ancestry subgroups using the Harmonized Ancestry and Race/Ethnicity machine learning algorithm44, which uses genetically inferred ancestry to refine self-identified race/ethnicity and impute missing racial/ethnic values. The ancestry composition in this study was 57% European, 30% African, 11% Hispanic/Latino and 2% Asian (Supplementary Table 19). Further descriptions of the design of the participating TOPMed cohorts and the sampling of individuals within each cohort for TOPMed WGS are provided in Supplementary Note 3. All studies were approved by the appropriate institutional review boards (IRBs) and informed consent was obtained from all participants.
WGS data
WGS was performed as part of the National Heart, Lung and Blood Institute TOPMed program. The WGS was performed at an average depth of 38X by six sequencing centers (Broad Genomics, Northwest Genome Institute, Illumina, New York Genome Center, Baylor, and McDonnell Genome Institute) using Illumina X10 technology and DNA from blood. Here we report analyses from ‘Freeze 8’, for which reads were aligned to the Genome Reference Consortium human genome build 38 using a common pipeline across all centers. To perform variant quality control (QC), a support vector machine classifier was trained on known variant sites (positive labels) and Mendelian inconsistent variants (negative labels). Further variant filtering was done for variants with excess heterozygosity and Mendelian discordance. Sample QC measures included: concordance between annotated and inferred genetic sex, concordance between prior array genotype data and TOPMed WGS data, and pedigree checks. Additional details can be found in Taliun et al.26.
Detection of mCAs
Detection of mCAs was performed on the WGS-based genotype and read depth data. The mCA call set was generated using the MoChA v1.11 caller. This approach utilizes phased genotypes, coverage (log R ratio, LRR) and B allele frequency (BAF) at heterozygous sites for detection of mCAs. Input data at heterozygous markers came from a previous analysis of the TOPMed cohort as outlined in Taliun et al.26. However, not all variants were included in the analyses. First, heterozygous markers with a MAF less than 1% and those where the read depth of either allele was less than five were removed. Second, we removed markers within germline copy number variants previously generated in TOPMed. Third, when more than one marker was present in a 1,000 base pair genomic region, then only one marker was retained. The MoChA caller was run with the extra option ‘–LRR-weight 0.0–bdev-LRR-BAF 6.0’ to disable the LRR + BAF model. The resulting mCA calls were filtered by excluding (1) those that span less than 2,000 informative markers, that is heterozygous sites; (2) those with logarithm of the odds score less than 5; (3) those on chrX but with inferred sex ‘unknown’; (4) those with estimated relative coverage higher than 2.9; and (5) those with BAF deviation larger than 0.16 and relative coverage higher than 2.5. Steps 4 and 5 are used to exclude putative germline duplications. Classification of mCAs as lymphoid or myeloid was performed following criterion from Niroula et al.23.
Downsampling
The total number of heterozygous sites can affect the power for detection of mCAs as the mCA calling method relies on heterozygous sites for detecting imbalances in the parental haplotypes5,45. The AA, HA and EA groups had on average higher number of heterozygous sites compared with the EAS group (Supplementary Table 4); therefore, to adjust for this difference we downsampled heterozygous sites in the AA, HA and EA groups, and then used those data to generate mCA calls and reasses reported associations of mCAs with ancestry. The downsampling was conducted by matching the distribution of heterozygous sites for AA, HA and EA groups to that of the EAS group. This adjustment was done separately for females and males. For example, if a HA female sample had 925,935 heterozygous markers, which is equivalent to the 50th percentile for HA females, then heterozygous markers were removed at random across the genome until the sample had 749,959 markers, which is equal to the 50th percentile for EAS females.
Comparisons of mCAs across ancestries
Subsequent to downsampling, we investigated possible batch effects that may have influenced mCA detection rates across both autosomes and chrX. After adjusting for age, age2, sex and ancestry, a variable representing ‘study’ had no effect on autosomal mCA detection, although we did find a study effect for chrX mCA detection. Therefore, in all of our analyses comparing mCA detection across ancestries, we included age, age2, sex and study as covariates.
Estimation of genetic ancestry proportions
Ancestry was estimated in the TOPMed WGS data using RFMix46 with a three-way reference panel of 92 Europeans and 92 Africans from the 1,000 Genomes project47 and 92 Native American samples from Human Genome Diversity Project48. In TOPMed we only considered SNPs with MAF >0.05 to speed up the computation. RFMix was run separately for each chromosome. We estimated the proportion of African ancestry for all AA individuals at the chromosome level by averaging the local ancestry proportions of all loci within that chromosome. To determine whether estimated African ancestry on chrX was associated with the prevalence of an mCA on that same chromosome, we ran logistic regressions with mCA status as the response variable, age, age2, sex and study as covariates, and the four quartiles of estimated African ancestry as the main effect of interest, with 0–25% as the reference group. For autosomal mCAs (Supplementary Table 9) African ancestry was treated as a continuous variable.
Association analyses
We performed a WGS-based GWAS between germline variants observed in TOPMed and presence of an mCA, separately for autosomal (N = 67,518) and chrX loss (N = 39,585) mCAs. For each sample, we defined the phenotype as presence/absence of one or more autosomal mCAs and tested against all variants with MAC ≥ 5 that passed the quality filters. Samples with uncertain identity or poor quality were excluded from analysis. For chrX loss analyses we excluded samples with a chrX mCA that was a gain, CN-LOH or undetermined. Principal components and genetic relatedness estimates were calculated using PC-AiR49 and PC-Relate50, as described previously in Hu et al.51 QC replicates or duplicate samples were removed after selecting the sample with the highest average autosomal depth rate. All logistic regression analyses included age, age2, sex, study and genetic ancestry as covariates. The final sample set included five genetic ancestry categories consisting of AA, EA, EAS, HA and a group of 1,099 samples that were characterized as having “unknown” ancestry. To test for association of sex with specific mCA types, for example 20q loss, we first conducted a chi-squared test (R chisq.test, simulate.p.value = TRUE, B = 100,000). For mCA types with marginal significance (P < 0.1), we then conducted logistic regression to test for association using a Bonferroni correction to account for the 156 independent tests.
We performed genetic association tests in cis and in trans state using a generalized linear mixed model approach using the generalized linear mixed model association test method52 as implemented in the GENESIS software53. For each analysis, a null model assuming no association between the outcome and any variant was fit, adjusting for sex, age, study-sequencing phase and the first 11 principal components (PCs) to capture genetic ancestry. A fourth degree sparse empirical kinship matrix computed with PC-Relate was included as a random effect to account for genetic relatedness among participants. The residuals from this null model were then used to perform genome-wide score tests of genetic association.
For the trans association analyses, we defined cases as those with a detectable mCA and tested all genetic variants with MAC ≥20 and had less than 10% of samples with sequencing read depth <10 at that particular variant.
For the cis associations (that is, variants within the same genomic locus as the mCA), we identified eight genomic loci of interest, which included the ATM and MPL genes, as well as chromosome arms with recurrent autosomal mCAs (n > 100), which included 14q CN-LOH, 1q CN-LOH, 11q CN-LOH, 12p gain, 12q gain, 20q loss and 1q CN-LOH. Cases were defined as those with an mCA call spanning the chromosome arm or gene, while controls were defined as those without any mCA calls on the chromosome arm tested. Because the case-control ratio was highly unbalanced for these analyses, we matched cases to controls using study, sequencing phase, sex and age to obtain a 1:10 ratio before fitting the null model. We tested all variants that passed the quality filters and had MAF ≥0.01. We defined a significance threshold of P < 0.05/(effective number of variants), where the effective number of variants tested was calculated using simpleM54.
For the ATM and MPL gene analyses, we further filtered variants based on annotations. Variants were annotated using ANNOVAR (v2019-10-24) and selected for use on the basis of their presence in exons and/or potential involvement in splicing. In addition to canonical splice sites, we also tested variants ±6 bp from exon boundaries, as well as less-canonical splice sites identified by SPIDEX47,55. To specifically account for promoters, we identified promoters for these two genes using the Eukaryotic Promoter Database and included variants found in the promoter region. Similar to the cis-association analyses, we defined a significance threshold of P < 0.05/(effective number of variants), where the effective number of variants tested was calculated using simpleM. We also ran secondary analyses of these variants with a lower MAC threshold (MAC ≥5).
In addition to single variant testing, we conducted gene-based aggregate tests to assess the cumulative effect of rare variants on mCA presence. Variants were aggregated by gene using the GENCODE v29 gene model. We used two strategies for filtering variants. For both strategies, variants were first filtered to MAF <0.01 in the sample set being tested. The first strategy (coding_filter1) includes only high confidence predicted loss-of-function variants inferred using LOFTEE56 and missense variants filtered using MetaSV score >0 (ref. 57). The second strategy (coding_noncoding_filter1) includes all variants from the first strategy plus additional regulatory variants. Regulatory variants were included if they overlapped with enhancer(s) or promoters linked to a gene using GeneHancer58 or 5 Kb upstream of the transcription start site. Within these regions only those variants were retained that had Fathmm-XF score >0.5 or overlap with regions labeled as either ‘CTCF binding sites’ or ‘transcription factor binding sites’ as annotated by the Ensembl regulatory build annotation59. Results were filtered to only those aggregation units with a cumulative MAC ≥20. We defined the significance threshold as P < 0.05/(number of aggregation units tested).
The annotation-based variant filtering and gene-based aggregation was performed using TOPMed Freeze 8 Whole Genome Sequence Annotator (WGSA) Google BigQuery annotation database on the BiodataCatalyst powered by Seven Bridges platform60. The annotation database was built using variant annotations generated by WGSA version v0.8 (ref. 61) and formatted by WGSAParsr version 6.3.8 (ref. 62). The GENCODE v29 gene model-based variant consequences were obtained from Ensembl Variant Effect Predictor63 incorporated within WGSA.
Allelic shift analysis
Allelic shift analyses were conducted as outlined in Loh et al.6. Variants in the MPL and ATM genes with P < 0.05 in the cis-association analyses and with OR >1 were included in these analyses.
Co-occurence analysis
Co-occurence between CHIP status and mCA status was analyzed as in previous work28. First, CHIP ‘carriers’ (individuals with an observed acquired mutation) were assigned to different categories by the gene where the mutations were located. Carriers of mCAs were assigned to categories based on the location (chromosome, p-arm and q-arm) and the changes of copy numbers (gain, loss and CN-LOH) of the mCA. A CHIP carrier or an mCA carrier may have been assigned to different categories if that individual carried multiple CHIP mutations or multiple mCAs. In our analysis, we required there to be at least ten carriers in each category, leaving 30 CHIP categories and 66 mCA categories for comparison (Fig. 3b). P values for co-occurence of CHIP and mCA carrier status were obtained via the Wald test (note that 0.5 was added to all cells in the 2 × 2 table if zero value(s) exist). A Bonferroni correction was implemented to assess significance.
Analysis of hematologic malignancy data
Due to a paucity of cancer outcome data in all other cohorts, we restricted our analysis of hematologic malignancies to the WHI. We assigned the available hematologic cancer outcomes in the WHI cohort into categories: lymphoid and myeloid cancers. Patients diagnosed as chronic lymphocytic leukemia, non-Hodgkins lymphoma or multiple myeloma were assigned to the lymphoid group (n = 237). All other patients diagnosed with leukemia were assigned to the myeloid group (n = 53). We further excluded patients who were diagnosed as having any cancer before blood draw (that is, the time at which DNA for mCA calling was collected), which reduced the lymphoid cancer case number to 223 and the myeloid cancer case number to 52.
We ran separate logistic regression models to test the association between mCA carrier status and risk for lymphoid or myeloid cancer. Covariates in our model included CHIP carrier status, the interaction between CHIP carrier status and mCA carrier status, age, ancestry and smoking. Ancestry groups with fewer than five cases were not included in this analysis. To determine whether any associations with myeloid or lymphoid malignancies were implicating mCAs as biomarkers for early subclinical disease, we re-ran these logistic regressions excluding individuals with cytopenias or cytoses. Cytopenia and cytosis cases were defined on the basis of the blood cell counts collected at any of three WHI visits, using the definitions in Supplementary Table 20.
Analysis with inflammatory and blood cell traits
A description of the measurement and quality control of the blood cell and inflammation traits can be found in Stilp et al.64. Each trait was defined as follows: hematocrit is the percentage of volume of blood that is composed of red blood cells. Hemoglobin is the mass per volume (g dl−1) of hemoglobin in the blood. Mean corpuscular hemoglobin is the average mass in picograms (pg) of hemoglobin per red blood cell. Mean corpuscular hemoglobin concentration is the average mass concentration (g dl−1) of hemoglobin per red blood cell. Mean corpuscular volume is the average volume of red blood cells, measured in femtoliters (fl). Red blood cell count is the count of red blood cells in the blood, by number concentration in millions per microliter (µl). Red cell distribution width is the measurement of the ratio of variation in width to the mean width of the red blood cell volume distribution curve taken at ±1 coefficient of variation. Total white blood cell count (WBC), neutrophil (NEU), monocyte, lymphocyte (LYM), eosinophil, basophil (BASO) and platelet count are defined with respect to cell concentration in blood, measured in thousands per microliter (µl). Because of a typical large point mass at zero, we dichotomized the BASO phenotype at BASO >0. The proportion of neutrophils, monocytes, lymphocytes or eosinophils was calculated by dividing the respective WBC subtype count by the total measured WBC. Mean platelet volume was measured in femtoliters. CRP was measured in mg l−1. IL6 was measured in pg ml−1. BMI was calculated from standing height and weight and smoking was dichotomized as ever/never smoker.
We tested for the association between each of these traits and presence of autosomal mCAs, chrX mCAs and mCAs at either high (>3%) or low CF. We ran standard linear models treating each quantitative trait as the dependent variable and presence/absence of mCA as the main independent variable of interest, while adjusting for age, sex, ancestry group and TOPMed study phase. BMI, WBC, NEU, LYM, monocytes and eosinophils were log transformed. We ran logistic regressions to test the association with smoking and BASO, adjusting for the same set of covariates. R version 4.2.1 was used for all statistical analyses.
Statistics and reproducibility
No statistical methods were used to predetermine sample size. There were no interventions to which subjects were randomized. The mCA calls were filtered by excluding (1) those that span less than 2,000 informative markers, that is, heterozygous sites; (2) those with logarithm of the odds score less than 5; (3) those on chrX but with inferred sex ‘unknown’; (4) those with estimated relative coverage higher than 2.9; and (5) those with BAF deviation larger than 0.16 and relative coverage higher than 2.5. Steps 4 and 5 are used to exclude putative germline duplications. Step 3 was used to exclude potential sex mismatches and steps 1 and 2 were used to exclude low confidence mCAs. For the associations with hematologic malignancies, we excluded patients who were diagnosed as having any cancer before blood draw. To assess the possibility that the associations with hematologic malignancies were implicating mCAs as biomarkers for early subclinical disease, we re-ran the associations excluding individuals with cytopenias or cytoses as defined in Supplementary Table 20. Samples with uncertain identity or poor quality were excluded from germline association analyses. For chrX loss analyses we excluded samples with a chrX mCA that was a gain, CN-LOH or undetermined.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
[ad_2]
Source link