During the last two years, genetics research has seen a surge of computational approaches that work directly on summary data from Genome-Wide Association Studies (GWAS) to avoid privacy concerns and logistics of sharing individual-level genotype data and to cope with ever increasing sample sizes. Recently, fine-mapping approaches for identifying causal variants have been extended to use GWAS summary data (CAVIAR, CAVIARBF, PAINTOR). Common to these approaches is that they rely on computationally expensive exhaustive search restricting their use to only a few hundred variants. Although all these approaches require information about the Linkage Disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs compared to the original individual-level GWAS data.
We introduce a software package FINEMAP that replaces the exhaustive search by an ultrafast stochastic search. We demonstrate that (1) FINEMAP opens up completely new opportunities by fine-mapping the HDL-C association of the LIPC locus with 20,000 variants in less than 90 seconds while exhaustive search would require thousands of years. By jointly modeling the whole locus, (2) FINEMAP can identify more plausible variant combinations than standard conditional analysis. At the LIPC locus we identify a 3-SNP configuration with 190-fold higher likelihood than the top configuration from conditional analysis. We suggest that a missense variant and a promoter polymorphism are likely to be causal whereas the lead variant in single-SNP testing has less evidence than a regulatory variant correlated with it. With extensive simulations we further show that (3) FINEMAP is as accurate as exhaustive search when the latter can be completed and (4) achieves even higher accuracy when the latter must be restricted due to computational reasons.
We also report important practical results showing that a reference panel size of 1,000 individuals from the target population is adequate for a GWAS cohort size of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size, which has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. Our results are based on comprehensive simulations with UK biobank data and Finnish cohorts over 100 GWAS regions from coronary artery disease, Crohn’s disease, lipids, schizophrenia and type 2 diabetes and on Finnish data of the APOE locus that we fine-map in detail discovering a novel variant associated with LDL-C.
Christian Benner (joint work with Matti Pirinen)