*In any study of finite sample size, deceptive and selective, internal bias affects all covariate parameter estimate – and assessment of their significance – to some degree. When “adjustment” is conducted on accessory covariates, bias is introduced, leading to warping of data. If anyone ‘adjusts for’ an independent variable without studying the interaction term with the main effect, one could end up removing variation associated with the independent variables inappropriately, as described, thus leaving random variation to be “explained” by the main effect. Phased prediction model development using machine learning based intelligent methods optimization (IMO) Is superior to least-squares curve fitting significance testing.*

**LEONARDO DA VINCI** believed that one could come to understand functioning of the human machine if one studied the human body from all perspectives. Tracing connections of parts could reveal their functional interrelationships, and reveal details of the whole from the study of the relations of the parts. He understood that one could not, for example, understand the workings of the hand without also understanding the anchoring and positions of the tendons in the forearm.

**RETROSPECTIVE STUDIES** are ripe for over-analysis, and while there is certainly a diversity of opinion stated in many disciplines, from education to psychology to biomedicine, there are times when certain common practices of multivariate data analysis hit my brain like nails on a chalkboard.

**ASSOCIATION STUDIES** are commonly done when randomized prospective clinical studies cannot be done, either due to lack of funds (tractability) or ethics. One particular egregious approach within multivariate data analysis such as regression-based or ANCOVA is the exploratory study of the effects of covariates as “adjustments”. Now, random sampling theory should, in most studies, randomize most variation for all variables roughly evenly between case/control or between outcome groups. Some percentage of the time, however, even at large N, we can expect a significant different between two clinical groups reflected in a covariate. Let’s call this incidental confounding – a certain variable, say, body weight, just happens to show a significant difference between, say, incidence of lupus in two groups with reasonable sample sizes.

So, when we mathematically remove the variation in body weight, the thinking goes, we then can study the main effect (let’s say, parental exposure to a chemical). Only then can we can expect that the variation left after adjusting for body weight may be properly attributed to “parental exposure to chemical”… and that adjustments provide the way to finding a more realistic picture of the degree of association. Right?

Well, not quite. Let’s look at the following (remember, the specific scenario is that Body Weight *looks like* (via a prior look) it should be adjusted, but the interaction term is not explored):

** Body Weight x Chemical Exposure Interaction (Truth)**

**True Condition Independent Not Independent**

*Body Weight*

*Truly a Factor* Remove important variable Remove important variable

*Body Weight*

*Truly Not a Factor* incorrectly Extract Variance incorrectly Extract Variance

The perspective “adjusting for” any variable assumes that any difference observed is confounding and is irrelevant to understanding the dependent variable. There have been tomes written on whether to adjust for “baseline difference” is numerous domains. From this table, it can be seen that when interactions exist, or not, and when the variable is truly a factor, or not, the effect of ‘adjusting’ for the variable is *never helpful toward increasing understanding*.

Certainly the terminology “adjusting for” implies that something went wrong with random sampling at baseline, or that there are biological effects of the variable to be accommodated. However, sometimes the “confounded” independent variable is functionally related to the dependent variable – which, if different between the groups, would imply that it, too, would be different between groups. Either way, the statistical significance paradigm only goes so far – it does not necessarily tell us about cause, or function, nor does it automatically tell us how to translate the significance into actionable information in terms of using the measurements of significant variables from patients in the clinical setting.

**Modeling out Variables while Ignoring Interactions Can Lead to “Explanation” of Nonsense**

If anyone ‘adjusts for’ an independent variable without studying the interaction term with the main effect, one could end up removing variation associated with the independent variables inappropriately, as described, thus leaving random variation to be “explained” by the main effect. When the main effect is significant in a first analysis, and then follow-up model selection procedures involve “correct for” or “adjusting for” new, additional variable, which may functionally interact with the main effect, it is completely inappropriate to shop for variables to adjust the data until the originally observed association disappears. By definition, this will lead to **model overfit** – analyzing the data to significance – and it will mislead any downstream public health policy formulated. Experts call the overall approach “cooking the data”.

**Kitchen Sink Statistics and Cherry Picking Results**

Another way of stretching the truth in these types of analysis involves using multiple different types of methods, looking at the results from each, and picking (and publishing only) the one that shows the degree of significance expected, or worse, needed, to justify a particular policy direction.

**How to Stop “Curve-Fitting Science” Over and Over… or, Why Machine Learning Based Prediction Modeling with a Focus on Intelligent Methods Optimization (IMO) is Vastly Superior to Curve-Fitting Multiple Regression, ANCOVAs, Hazard and Log Odd Ratios.**

In biomedical research, the hold grail of knowledge is prediction power. All of least-squares regression modeling – finding significance in main effects, covariates, and (woefully sometimes) their interaction terms amounts to describing the variation in the data at hand. A real risk in this setting is overfit – establishment parameter values for prediction models associated with specific variables that fit the data at hand, but may or may not generalize to new settings. All too often, results from initial or prior studies that use this type of analysis fail to be validated in follow-up, sometimes larger studies. In terms of translation to public health policy or biomedical practice translation, we then are left either scratching our heads, or g shrugging our shoulders and saying “oh well, I guess that initial finding was not true”. In reality, the curve-fitting modeling itself is biased toward overfit, and is not often well conducted anyway, so the findings may not be expected to generalize to validation studies that attempt replicate the results.

In all of these scenarios, there is an alternative to the “standard” data analysis problem. The alternative come from machine learning, in which modeling is done in an exploratory manner – using training sets that are not expected to render the last word on the models used, or the optimized parameter values within the models. The primary focus on machine learning prediction modeling is also not statistical significance of variables in their difference between groups – although that can play a role in terms of “feature selection”.

In Intelligent Methods Optimization, many methods for identifying potentially informative (predictive) features (variables) may be explored. Both classically univariate and multivariate feature selection and dimension reduction techniques may be explored using an initial Training Set (typically cases and controls, or cohort classes). For parameter optimization, the training set may be split iteratively by resampling patients from both groups (without replacement) into multiple instances Learning and Test Splits. Say a data set has 1,000 cases and 1,000 controls – a typical study of this type would involved randomly select and set aside about 300 (actually 333 is known to be optimal) from each of two clinical groups, creating a (blinded) Test Validation Set (TVS). The remaining 666 cases and 666 controls are considered the training set. Then, within the training set, multiple (say 30-40) instances of randomly selected splits (again ideally at 66%:33%) are generated, and various types of prediction models – from logistic regression, Random Forests, genetic algorithms, and other types of models are optimized using various combinations of features (variables) across a range of performance measures. Importantly, the feature selection step (if one is used) must be wrapped within the 66:33 splits (use of 100% of the data to determine significant difference is known to cause some features to appear to be significant, when in fact they are not (false positives).

The result is usually something like a set of Receiver-Operator Characteristic curves, displaying sensitivity (SN) and specificity (SP) pairs at particular combinations of feature selection effort + model parameter optimization for a variety of prediction model algorithms. At this stage, the data are usually from a single study site, and may or may not generalize to new, unseen data of the same type. These results, therefore, are referred to as “Training Set results”. To estimate the generalizability of the performance measures of the best models/model parameter optimization/feature selection, the Test Validation Set may be unblinded at this point, and a second set of performance evaluation measures determined for the Test Set: these results are called the “Test Validation Set”. Differences between the Training and Test set ROC curves may be informative: If the Training Set cures look good, but the TVS set results appear weakly predictive, overtraining of the models at the training set may be suspected, and the results will not likely validated to new data, especially across sites.

This procedure can usefully and robustly inform when there is no prediction capacity in the data (TVS SN~SP~50%).

The superiority of this approach to the outmoded curve-fitting is clear – the next validation study will provide new test validation sets, not re-learning the univariate significance (or lack thereof) of individual variables. And the more sophisticated modeling methods can point to potential causal and functional relationships among variables – especially notable if there are strong interaction terms.

If Disease Epidemiologists and Health Policy Makers would put the emphasis on predictive utility and, specifically on the generalizability of the performance evaluation measures of the models, we could have a much more reliable path to progress in understanding disease and establishing useful tools, such a decision rules, nomograms, and algorithms to inform on disease diagnosis, outcomes prediction, risk prediction, and accuracy of treatment options.

So, the next time you think about adjusting for a variable, think about how well da Vinci might have understood the mechanisms in the hand after blithely removing the forearm.

*Addendum:*

In terms of stating a model, understanding is made by interpreting the outcome when two variables are additive and independent as

y = B1+B2+B1 x B2

Some studies tend to state in the text of the study that

y-B1 = B1+B2+e-B1

and never explore B1 x B2.

This is a clear example of “shamwizardry”, a term I coin in my book “Cures vs. Profit”.

An example would be

“Lupus is a function of body weight and chemical exposure, and body weight and exposure compounded the effect as the interaction term was signficant”

vs

“After body weight is adjusted for, chemical exposure was no longer significant”.

(no mention of whether the interaction of body weight chemical exposure was explored).

—

Books and Publications by James Lyons-Weiler

This Entry Into Understanding article brought to you by:

.Share this article: http://bit.ly/1INxCnh