Yet Another Highly Unethical and Socially Irresponsible “Genes-Only” Study Fails to Show that Autism is 80% “Genetic”

Yet Another Highly Unethical and Socially Irresponsible “Genes-Only” Study Fails to Show that Autism is 80% “Genetic”

In a WebMD article, the results from a large genetic-factor-only study gleefully reports that the newest, highest-ever estimate of the percent liability of autism risk that can be attributed to “genetics” is 80%, leaving the remaining 20% to environmental factors.

The article also claims that this new, highest estimate is reported by the study authors to be “…roughly in line with those from prior, smaller studies on the issue, further bolstering their validity“.

Consistent Results From Invalid Methodology Does not Make Those Results “Valid”.  It Makes Them “Consistent”.

The “roughly in line with” is an appeal to consistency.  But the Liability Threshold Models differ from other approaches methodologically. Previous studies, one of which was conducted by the same group of researchers, had estimates that ranged from 0 to 99% heritability.  The average, until this group started using liability-threshold models, was around 40% attribution to genetics. Their studies increased the average, but it still hovered around 50% liability.  Only the liability threshold models, used by this group, show results around 80% liability.  So their method is consistent with itself.  No surprise there. But that’s nowhere near “roughly in line” with all prior studies.

6.jpg

One of those studies is discussed in the article “Non-genetic factors play surprisingly large role in determining autism, says study by group“.

Why Autism is Not “Genetic”

The article skips over the fact that the newest, latest study, like the prior studies, fails to actually measure the contribution of a single environmental factor.  While the article rails against “anti-vaxxers”, the study ignores the vaccination status of those involved in the study.  The mantra of so many studies never showing association has to be tempered with a mature, responsible and realistic interpretation in the context of how those studies were conducted: restricted to one vaccine (MMR), and then there is this:

4.jpg

Assumptions Without Measurement Lead to Assumptions as Conclusions

Their entire methodology is based on familial correlations. In the current study under consideration, no exposure levels to pesticides, medical exposures in utero, smoking history, nothing environmental was measured.  And yet somehow the study authors pretend they can estimate the % liability from environmental factors.  How do they pretend to achieve such a feat?

The first problem is that they have not measured any interaction between genetics and environmental factors.  There is, in fact, established knowledge of special risk of autism that involves combined risk of specific genes and specific environmental factors.  Check out, for example, Bowers and Erickson (2014):2

Their Liability Threshold Model Approach is Both Under- and Mis-Specified

You really have to understand population genetics a bit to get this next part, so I apologize to the lay public, but please take what understanding you can from this:

Their model (generically represented) is

ASD risk  =  “Genetics” + e

where = measurement error, leaving whatever variation appears to be unexplained to Environment.  That’s unusual because the usual interpretation of such unexplained variation is “Error” and “Unknown Variation”.   In technical terms, their model is underspecified.  Environmental variation is not “Error” in a genetic model, it’s “Environmental Variation”.

If they HAD measured environmental factors, say, vaccination exposure, their model form would be

ASD risk = “Genetics” + “Environment” + e

but this model would still be underspecified.

The more fully specified model would be

ASD risk = “Genetics” + “Environment“+ “(Genetics x Environment)” + e

And if the interaction term “(Genetics + Environment)” is more highly significant than “Genetics” or “Environment“, a reasonable interpretation would be that we cannot interpret genetics in a vacuum, that the significance of many ADK risk alleles must be modified by environmental factors.  If during model selection, G or E is significant, but then in the full model G x E is significant, we attribute liability to both G and E working together.

Instead of this standard approach to studying genetic and environmental contribution to phenotypic variation (ASD phenotype), they do something very odd.

In the Supplementary Material, they report that they made assumptions about environmental factors.  Non-specified “Shared Environmental” effects are ASSUMED to be 1.0 for siblings and 0 for cousins.  Families quite often stop vaccinating after an older sibling experiences seizures.  The study authors also EQUATE “Non-Shared Environmental Factors” with “residual errors”, which is patently absurd.  That’s “e“, which is unspecified variation (error), not designated environmental factors.

If I had conducted an analysis of environmental factors and their contribution to ASD, and used their methodology, I would be able to attribute any unexplained variation to “Genetics” after allowing “Environmental Factors” to consume most of the variation.  I might arbitrarily add in some assumptions, such as assuming that risk from dominant alleles were 1.0 (which they are not, if the impact of those alleles are modified by environmental factors) and all recessive risk alleles contributed zero risk, which would be, as described, arbitary.  Their conclusions draw directly from their assumptions.

Evidence? What Evidence?

The WebMD article cites the entire team of researchers as saying “the current study results provide the strongest evidence to our knowledge to date that the majority of risk for autism spectrum disorders is from genetic factors,” [‘said a team led by Sven Sandin, an epidemiological researcher at the Karolinska Institute in Stockholm, Sweden’] – as quoted by WebMD.

Evidence?  What evidence? If you assume no contribution of environment, measure no environment, and conclude no contribution, there is no evidence.

There are over 850 genes that have been determined to contribute to ASD risk – and not one of them explains >1% of ASD risk individually.  Most of these are Common Variants – meaning they are ancient – as in, they pre-date both the ASD epidemic (and yes, there is an epidemic) and vaccination.  Here’s a figure from my book, which reviews all of the genetic and environmental studies published to mid-2016:

 

1

This explains why ASD pedigrees look like humanity dipping its toes into a toxic soup:

pedigree

The study also does not explain why >20% of children with ASD have higher copy number variation – de novo genetic variation – compared to the rest of the population, nor why people with ASD – and their mothers – have anti-brain protein antibodies – nor why people with ASD have strange misfolded proteins, lifelong microglial activation, why studies of replacing the microbiome show a reduction in the severity of autism traits by 50%… a feat for a diagnosis that is allegedly 80% “genetic”… and so on, and so on.

Then There is Phenomimicry

The study ignores the fact that environmental factors can impact genes, proteins and biological pathways in a manner that is identical to the effects of genetic variation. This is called Phenomimicry – a term so cool I wish I had invented it.  Examples of Phenomimicry are known in science relevant to ASD.

3“Guess What? Being Human is Heritable”

It’s worth pointing out that thousands of human “traits” are heritable, and that includes traits that contribute to sociality, language ability, intellect, and even perhap tendency toward repetitive motion.  That means that genetic studies must subtract the heritability of these traits in the non-ASD population from the estimate of heritability in their contribution to ASD.

5.jpg

The WebMD article, and the research report itself, lauds the study for involving over 2 million people from five countries.  This is not impressive because the study falls into the category of “Science-Like Activities“.

No More YAHUGS

It is highly unethical – and socially irresponsible  – for “Genes-Only” studies to be conducted that claim to rule out environmental factors.  All “Yet Another Highly Unethical Genes-Only Study”s – YAHUGS – should be replaced with fully and correctly specified models. That means measuring and studying both vaccination patterns and genetics.

WebMD article on archive.is

James Lyons-Weiler

Allison Park, PA

Note: A layman’s example will help.  Let’s say you want to understand thumb injuries among carpenters,and you specify a model

Risk of Injury = Hammer Size

You SHOULD also include Length of Nail, i.e.,

Risk of Injury = Hammer Size + Length of Nail

but it is socially unacceptable to conduct science on the Length of Nail.  So you leave it out.  You then model

Risk of Injury = Hammer Size + e

and incorrectly attribute variation in the “Length of Nails” to “e“.

You SHOULD specify

Risk of Injury = Hammer Size + Length of Nail + (Hammer Size x Length of Nail) + e

But that pesky social pressure to ignore Length of Nail goes a long way.

So you don’t know “(Hammer Size x Length of Nail)“because you do not know Length of Nail.

So you attribute everything to “Hammer Size”, totally ignorant of any direct or interactive effect of the “Length of Nail“and “Hammer Size“.

So you conclude “Hammer Size explains more than Length of Nails” when you should publish

“We Do Not Know the Effect of Length of Nails in Isolation nor with Interaction with Hammer Size”.

You can support me in my initiatives – going live in the fall with the WWDNYK Studios – join on Patreon where these and other pressing issues will be discussed with live guests.

 

Support jameslyonsweiler.com

This effort is not funded by any means, except some meager ad revenue. Your support will help me offset costs. Won’t you pitch in?

$5.00

 

On Simplicity

On Simplicity

OCCAM’S RAZOR is a rule that says that the simplest explanation is usually the right explanation.  Interestingly, careful analysis tells us that the probability that most explanations will be correct because they are simple is lower than the probability that explanations that are almost, but not quite the simplest will have a higher probability of being correct.

The quest for parsimonious explanations can lead to a preference for elegant “beautiful” equations in physics.  But when applied to the natural world simplicity is only a factor to the extent that the conservation of energy and mass will tend to restrict complexity from occurring.

The sociology of science is an amorphous beast – actually, it is more like a menagerie of beasts – because different disciplines and domains have different rules of engagement, different norms and acceptable behaviors.

Draconian Simpletons

Let’s start with a theoretical domain of inquiry that insists that Occam’s Razor is a rule – that the simplest explanation must ALWAYS be preferred.  Their expected probability distribution of accuracy of models would look something like this:

Draconian Simpletons

Given that most natural systems are product of the interplay between that which persists, that which exists, and that which is emerging, the actual degree of complexity will likely involve, from the perspective of the Draconian Simpletons, at least one “extra” parameter, both in reality, and allowing for resources to adequately detect the usefulness of the “extra” parameter, some proportion of those will be likely to be detected.

Too Many Parameters

Of one thing we can be certain: the Draconian Simpletons’ expectation that the simplest explanation will always be correct – is incorrect.  Given the subjectivity of the actors (scientists) in the process, there can never be a proof that the simplest explanation will always be correct.  This proves (to my satisfaction anyway) that the set of correct solutions for all problems includes solutions that are not simplest.

At the other end, models with many extra parameters are not likely to correct, either, and the ability of scientists in that domain to have sufficient acumen to detect the extraneous parameters will not be perfect, either.  So there is noise even about  our ability to assess accuracy of models, and I suspect if we mapped that noise (uncertainty) over time as a domain of science progresses, there will be a “buzz” of uncertainty before many major breakthroughs.  We all use the term “breakthrough” and we think we know what it means, but right now it seems to mean breaking through the staid muck and mire and conservative nay-saying traditions and traditionalists within a domain.

The Goldilocks Zone

In practice, empirical research will sometimes overestimate, and sometime underestimate the number of parameters needed to explain a natural phenomenon (due to limitations of Science).  Following the reality of the truth that the simplest explanation will not always be correct, expected (real, blue) and realized (estimated, orange) distributions might look something like this:

Extra

If the blue distribution is shifted to the left, the field of study might do well to apply Occam’s Razor more frequently.  If the positions are reversed – as is nearly always the case in the early ontological stages of any scientific discipline, not only is it true that “more data are needed” but also that “new questions must be asked”.

Now, depending on that disciplines’ ability to test hypotheses of merit, which is limited by available background knowledge, by technology, and by intellectual capitol – among other factors – the actual distribution may still differ from reality.  The “among other factors” include whether a study is sufficiently powered to detect the significance of just one more parameter.  Small sample sizes are notoriously likely to lead to model overfit – a better score of an incorrect model fitting the model very well to the data, but not very well to reality.    There are many objective criteria for choosing among multiple parameter models (a process called ‘model selection’); these include Mallow’s CP, and comparison of the performance evaluation measures of the model on the training set(s) to the performance on independent test sets (accuracy, sensitivity, specificity, Area Under the ROC Curve).  We should favor models that tend to generalize for the same reason we tend to prefer results of null hypothesis significance testing that tend to be reproducible.

Which is Worse: Too Few (Underspecification) or Too Many (Overspecification)?

The further away from fit of the scale of the instruments of measurement are from the actual size – and duration – of the things and events being studied, the worse the model fit will be.  Avoiding sources of bias – including human biases such as favored hypotheses – is essential.  In quantum physics, the speed of particles and the sensitivity of instruments to be able detect accurate variation may be conflate with the duality of particles existing both as a wave and as a solid.  In chemistry, too many parameters can lead to incorrect formulations, contamination.  In biology, the noise of small samples and the sheer complexity of biological systems compete and conflate ease of understanding and hypothesis testing.  In medicine, the answer appears to be known beforehand, and thus specification often tends to fit a pre-determined agenda.  In such science-like activities, objective considerations such as these need not apply.

But when the aim of Science is understanding, it is the impact of model over- and under-specification on our ability to predict the future that will ultimately determine the utility of the models.  In some settings, for some models, and infinite number of equally good models can be specified that trade-off the costs of errors in prediction among categories of entities or events being predicted; and the purist statistically minded type that insists that models should be demonstrated to be consistent – that is, converge to one, true model in the limit (with an infinite amount of data), or parameter coefficients that converge to the same precise values when the available data are analyzed again with the same model-fitting procedures are a little ODD for the realities that given that we work with finite data (a) we are likely not getting the “true model” no matter how well we fit a curve to the data, (b) we know that some machine-learning algorithms such as neural networks and genetic algorithms can evolve solutions that we cannot interpret exclusively for the lack of our ability to interpret them, not due to a limitation of the learning algorithms. 

I realize that sounds to some like a declaration of war between machine intellect and human intellect, but it is not.  It is a failing of our academic endeavors that we cannot use vestigial exercises such as Cartesian dissection of some beautifully accurate solutions rendered by artificial intelligences in this world, and the Jeff Bezoses who use them to multiple profit do not care about understanding how they work; they care about that they work.

I have just such an algorithm that dispenses with the need for parameter estimates per se and uses instead whether the model parameter values fall within a specified range sufficiently enough to matter.  Re-run the algorithm, and different inputs become important while others fall out as unimportant.  It works beautifully for heterogeneous situations such as cancer, but even though I invented the approach (GA-optimized k-of-m), I cannot look at the decision rules that are output and say that any one of them is comprehensible beyond the paradigm of “tell me which features when used together this way are most important” plus “the prediction algorithm works or it does not work”.  And that’s ok, because the accuracy – and more importantly, the generalizability of the accuracy of my framework is far more important than the intellectual satisfaction of having conversed with an algorithm.  NB:ML processes like bagging and boosting are conceptually understandable as algorithms, but combining them with decision trees leading to Random Forests led to a process that was similarly unintelligible to Leo Breiman, the co-inventor (with Adele Cutler). As a black box solution, it’s beautiful, and it’s a mistake for people who want to escape their own sense of depressed ego to label such approaches as “not in touch” with science.

Not content to wait another 10,000,000 years for organic evolution to catch up, Breiman and other machine learnists have embraced the totality of Science as a Way of Knowing by embracing its limits, and transcending them.  In my Five Ontological Stages of Science, I conclude that the best and truest test of knowledge about nature, the world and the Universe around us is to demonstrate our ability to predict what will happen when we perturb it, and this  is necessarily a pragmatic position, because society, our species, we as individuals tend to value things we can use.  My k of m GA is not simple, as it employs adaptive Darwinian evolution working on genetics modeled after evolving chromosomes, with mutations and crossing-over to boot. But it is beautiful to me nevertheless, and far superior to fixed-parameter models. It is its transcendence of fixed parameter values that is so beautiful to me, perhaps vaingloriously so, but only to the extent that I think “not bad for a biologist”: the fixed parameter values are sources of noise in predictions, whereas counting marks beyond thresholds appears to clean up the noise.

I’ll end this with a funny and true story.  I was walking along a street in downtown Boulder, CO after an Evolution conference sometime in the 1990’s when I happened upon a bar within which a troup of maximum likelihoodists had aggregated.  Among them was Nick Goldman, whom has scooped me out of a paper by publishing first[1] that the DNA Chaos Game fractal patterns were merely a numerically necessary result of unevenness in the use of nucleotide bases in the genomes being studied.  His solution was to show that nucleotide, dinucleotide and trinucleotide frequencies could explain the odd patterns, which were fractal.  Others had postulated layering, or nesting of information, or weird long-range correlations.  In my paper, which I submitted for publication ignorant of Nick’s, showed that nucleotide and dinucleotide frequencies alone were sufficient.  Both Nick’s findings and my findings show that mathematical frequencies alone were sufficient to explain the patterns, and that the CGR algorithm necessarily created recognizably fractal patterns When I saw his publication, I had sent him my manuscript with a note of congratulations.

Entering the bar, Nick was finishing up a game of billiards.  “Hey James” he said.  “You up for a round?”  I said “Sure.”  “You break” he said – my break didn’t sink any balls.  Nick was up, and he tried, but failed.

I proceeded then to clean the table.  Finishing my beer, I thanked him for the round.  “You know what your problem was there Nick?” I asked.

“No, what?” he said.

“You used too many parameters” I joked, and off I went, headed to my hotel room.

The second best part of the story is that I had never won a round of billiards before in my life.

The very best part of the story is that Nick Goldman did not know that fact.

“The aim of science is to seek the simplest explanation of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be ‘Seek simplicity and distrust it.'”– Alfred North Whitehead

Featured image source: CS Department, Gettysburg College Student Project by Kathryn Kinzler and Jessica Wagner

Citation

[1]Nick Goldman, 1993. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Research 21:2487-2491.