Genes Are Overrated: My Critique of: NATURE GENETICS - Genome-wide association study identifies five new schizophrenia loci, Stephan Ripke et al

If I was even willing to accept this study on its face, this is my “shorter” interpretation of its conclusion:

“We did a mega-analysis of 17 genetic loci genome scan studies for schizophrenia. We found 7 genetic loci correlating to schizophrenia. Of those, 5 were NOT FOUND IN ANY OF THE ORIGINAL 17 STUDIES, NOR IN ANY PREVIOUS STUDY. The remaining 2 were found in SOME of the studies. Thus, any correlations other than those two found in any of the studies have been effectively negated. “

Thus, even if you accept the study, the conclusion seems to point to the fact that almost all of the findings in the studies it analyzes were, in fact, statistical artifacts. Interesting. It's more interesting that the authors did not seem to be aware of this fact, or didn't think it was pertinent.
Moreover, like many meta-analyses (or mega-analyses at the case may be) the study itself uses some rather complicated statistical analysis that I think is already questionable, and I discuss that briefly below the fold. It will cost you $32 to check out the full study for yourself. I paid it so you don't have to...

Let's start with some highlights from the abstract:

We examined the role of common genetic variation in schizophrenia in a genome-wide association study of substantial size:

…

The combined stage 1 and 2 analysis yielded genome-wide significant associations with schizophrenia for seven loci, five of which are new

So we already might wonder why we have five associations that have not appeared before in the three decades of performing such studies. Only two associations have ever appeared before in a study and these two are have never been consistently replicated. Thus, it seems quite possible from the first sentence in the abstract that we are just dealing with randomly generated data and the associations are just what you might expect by chance.

Now, let’s move on to the study itself. The first sentence of the study is a good tipoff that we are headed into dubious statistical machinations:

“In stage 1, we conducted a mega-analysis combining genome-wide assocation study (GWAS) data from 17 separate studies”

So we know we are not working completely blindly. The authors have chosen 17 separate studies for which they already effectively know the outcome. They do not say why they chose these 17 studies or whether there were other studies that they chose not to include in the “mega-analysis.” Thus there is immediate potential for stacking the mega-analysis with studies that are more conducive to the desired results (for the record, I am not accusing the authors of doing this deliberately. Quite the contrary, I believe such decisions are usually related to the unconscious bias of those performing the study. Of course, you want to show that the different studies are consistent for such a mega-analyis and they try to do so, here:

We tested for association using logistic regression of imputed dosages with sample identifiers and three principal components as covariates to minimize inflation in significance testing caused by population stratification. The quantile-quantile plot (Supplementary Fig. 1) deviated from the null distribution with a population stratification inflation factor of λ = 1.23. However, Lambdaλ₁₀₀₀, a metric that standardizes the degree of inflation by sample size, was only 1.02, similar to that observed in other GWAS meta-analyses2,3.

This deviation persisted despite comprehensive quality control and inclusion of up to 20 principal components (Supplementary Fig. 1). Thus, we interpret this deviation as indicative of a large number of weakly associated SNPs consistent with polygenic inheritance.

We need to get at what they are claiming here regarding a quantile-quantile plot. Here is what a quantile-quantile plot is designed to measure:

The q-q plot is used to answer the following questions:

. Do two data sets come from populations with a common distribution?

. Do two data sets have common location and scale?

. Do two data sets have similar distributional shapes?

Do two data sets have similar tail behavior?

So, it appears right off the bat, the quantile-quantile plot indicated a problem, as it does not show that the populations of these studies come from a common distribution. How is this shored up? Well, they change the metric and assume, circularly, that the deviation was due to weakly associated polygenic inheritance. Polygenic inheritance, weakly or not, has actually never been demonstrated to be true. It is an assumption of many scientists, due to the inability to find specific 1:1 correlations for genes and mental disorders. In fact, the whole reason for doing a mega-analysis such as this is to find these elusive, weakly associated SNP’s, since no individual studies to date have been able to do so. Thus, they are effectively saying that the mega-analysis, designed to find weakly associated SNP’s for for polygenic inheritance deviates from the null because of weakly associated polygenic inheritance. If you doubt that such “weakly associated polygenic inheritance” is factual, you can already rule out the validity of the study. Of course, that is too easy, so I will continue for those who need more convincing…

What we can see already is that, when they run into an inconvenient statistical truth, they can find a way to “re-examine” the data so that this truth goes away. This happens again with the next paragraph:

We also examined 298 ancestry-informative markers (AIMs) that reflect European-ancestry population substructure5. Unadjusted analyses showed greater inflation in the test statistics than we saw for all markers (AIMs λ = 2.26 compared to all markers λ = 1.56). After inclusion of principal components, the distributions of the test statistics did not differ between AIMs (λ = 1.18) and all markers (λ = 1.23), a result inconsistent with population stratification explaining the residual deviation seen in Supplementary Figure 1.

It appears that they needed to double down on the problem already noted, which is that the generally accepted statistical analysis is not giving them the distribution they wanted. This is the kind of after-the-fact statistical machinations available when you have no real hypothesis stated for your study. It’s “Let’s do a mega-analysis, see what happens, then find ways to interpret the data to meet our bias.”

Okay, so you are bored and you are saying, “Steve, why beat a dead horse. They are just trying to show that real world distribution is not going to match up to statistical analyis and adjustments need to be made (even if it is after-the-fact).” I could go on, but since the study really doesn't show much of anything, what is the point?

Genes Are Overrated

Sunday, August 26, 2012

My Critique of: NATURE GENETICS - Genome-wide association study identifies five new schizophrenia loci, Stephan Ripke et al

1 comment: