A call for caution: Uncovering biases in GWAS-by-proxy for Alzheimer’s disease research

The challenges of GWAS-by-proxy (GWAX)
Parental disease history is often used as a proxy for genetic association studies in GWAS-by-proxy (GWAX), but is it reliable?
Nonrandom participation can skew genetic studies in complex diseases due to survival and reporting biases resulting in misleading genetic correlations which impact risk predictions and cognitive research. Wu et al. (2024) and Goudey et al. (2025) discuss potential solutions using Machine Learning (ML) for improving GWAX methodologies.
GWAX, a flawed shortcut
GWAX has gained popularity in complex genetic disease research, particularly for neurodegenerative diseases like Alzheimer’s disease (AD). Most of these association studies rely on samples from population biobanks, which provide extensive genotype and phenotype data on large cohorts. While this approach boosts sample sizes, new research reveals that GWAX results may be biased. The implications? Potentially misleading conclusions about genetic risk factors for Alzheimer’s and cognitive function.
The challenge with using biobank data is that most participants are middle-aged and haven’t yet developed late-onset diseases like Alzheimer’s. To get around this, Liu et al. (2017) proposed GWAX, which uses participants’ reports of their parents’ diagnoses through the family health history survey —and their own inherited genetics—as a stand-in for parental disease status.
The problems with GWAX for Alzheimer’s disease
GWAX fails to account for these key biases that could distort results:
- Survival bias: only individuals whose parents lived long enough to develop AD can report a parental diagnosis. This skews genetic associations, favoring genetic variants linked to longevity rather than AD itself.
- Reporting bias: participants may overreport or underreport their parents' AD cases, leading to inaccuracies in genetic associations.
- Nonrandom participation: socioeconomic factors, education levels, and family structures influence whether individuals participate in health surveys and how accurately they report family history.
The importance of study design for genetic research
Interpreting genetic risk isn’t just about finding associations, it’s about making sure those associations are real. For researchers and clinicians relying on genetic studies to guide Alzheimer’s diagnosis and treatment strategies, this serves as an important reminder: critical evaluation of study design is just as important as the genetic discoveries themselves.
The education-AD risk paradox
One of the most striking consequences of GWAX biases is the unexpected positive correlation between educational attainment and AD risk. Traditional AD GWAS shows that higher education is protective against AD. However, studies incorporating GWAX often report the opposite trend—that education increases AD risk. This discrepancy is a red flag, suggesting that the biases in proxy GWAS are artificially inflating associations.
Why does this happen? Education is strongly linked to longevity and health awareness. Parents who live longer are more likely to be diagnosed with AD and have children who are more engaged in reporting their health histories. This creates a spurious association between genetic markers for higher education and increased AD risk.
Addressing the bias: approaches to improve GWAX
Exploring methods to mitigate GWAX biases in AD, Wu et al. identified promising strategies that include:
- Separating true genetic signals from bias-driven associations using GWAS-by-subtraction, (GSUB), a statistical approach.
- Controlling for parental age and vital status to effectively reduce survival bias.
- Accounting for non-random survey participation using 14 variables to train a participation prediction model.
Goudey et al. applied machine learning to generate proxy phenotypes for Alzheimer’s disease by treating the task as pseudo-labeling and incorporating features such as age, sex, education, and family history. Their model:
- Outperformed traditional family history-based approaches in predicting future AD conversion (concordance index: 0.7 vs. 0.5).
- Enabled GWAX to uncover a greater number of significant genetic associations in the UK Biobank.
- Generated polygenic risk scores (PRS) that more effectively distinguished true AD cases in independent validation datasets.
Conclusion
While innovative approaches like GWAX hold great promise for advancing genetic discovery in complex neurodegenerative diseases, they introduce the risk of erroneous findings due to reliance on proxy-reported parental disease history and inferred phenotypes. ML-derived proxy phenotypes for AD and participation prediction modeling offer a valuable strategy to reduce these risks. As research methods continue to evolve, it will be essential to critically address underlying biases and validate findings to ensure accurate and reliable insights into the genetic influences on brain health.
Reference:
Wu, Y. et al. Pervasive biases in proxy genome-wide association studies based on parental history of Alzheimer's disease. Nat Genet. 56, 2696-2703 (2024).
Goudey, B. et al. Machine-learning derived proxy phenotypes for Alzheimer’s disease genome-wide association study by proxy. Alzheimer's Dement. 20, e093165 (2024).
Liu, J Z. et al. Case-control association mapping by proxy using family history of disease. Nat Genet. 49, 325-331 (2017).
Takara Bio USA, Inc.
United States/Canada: +1.800.662.2566 • Asia Pacific: +1.650.919.7300 • Europe: +33.(0)1.3904.6880 • Japan: +81.(0)77.565.6999
FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. © 2025 Takara Bio Inc. All Rights Reserved. All trademarks are the property of Takara Bio Inc. or its affiliate(s) in the U.S. and/or other countries or their respective owners. Certain trademarks may not be registered in all jurisdictions. Additional product, intellectual property, and restricted use information is available at takarabio.com.