June 21, 2005

The irrelevance of Occam’s razor in statistical modeling

Filed under: Causal inference and statistics, Uncategorized — @ 3:36 pm

“Pluralitas non est ponenda sine neccesitate” — “plurality should not be posited without necessity.”

When the medieval philosopher William of Ockham first stated his minimalism principle, he probably never thought that the principle could be applied everywhere and last forever. In causal inference, it is one of the most often cited tenets to justify parsimonious explanations.One argues that unnecessary factors may not only complicate matters, but also hide truth. In particular, parsimony is one hallmark of statistical modeling. The simpler models are always more favorable than complicated ones.

The real world is complicated and interconnected (Karl Marx).In social science, data at hand are often massive in both the number of observations and characteristics in each observation.One wishes to reduce the complexity to a few parameters with which he can predict the reality. Statistical modeling is essentially a technique of dimension reduction.Through modeling, one can answer questions such as “can condom usage reduce the risk of STD and/or HIV infection” or “does it increase the risk of excessive sex instead?” These are important questions, as 70% high school girls are sexually active.

Unfortunately, statistical models are for prediction, not for inference. It becomes obvious when one recalls that all model fitting checkups are derived from prediction purpose. With a large number of variables in the model, one can fit data as well as treating each observation as a covariate.However, a model with hundreds of covariates only proves that analyst is not sophisticated and intelligent.

Therefore, all analysts will select some models, exclude some covariates, and check-recheck model fitting until it looks good enough. The “principle of parsimony” derived from “Occam’s Razor” is applied when assessing models.Given the same model fitting, the fewer number of covariates, the better model is.

However, the above process is a wrong practice. Parsimony is not an appropriate goal for statistical modeling. Instead, to reveal reality is the ultimate and only goal of analysis. If the reality is complex, only a complex model can reflect it.

Take linear regression analysis as an example, many people have traumatic experience of whether an insignificant covariate should be included in the model or not.Often the insignificant covariates are eventually excluded to favor the “Occam’s razor.”However, although one covariate may be insignificant in the model (given all others in the model), a collection of several covariates together can be significant. The insignificant variable may also have large influences on the effects of other covariates. Omitting a relevant covariate may bias the effects of other covariates (beta coefficients).This won’t be resolved by large sample size.Furthermore, an insignificant covariate may be due to inappropriate function form. Changing modeling settings to incorporate nonlinearity may improve the model significantly.

Here is another catch. Including an irrelevant covariate will increase standard errors of the beta coefficients of other covariates, although it may decrease model residual error. The conundrum can not be resolved without a prior knowledge of problems. The only way to perform any analyses is based on social or biological theories. One should always include all important and relevant covariates in the model, no matter what significance they have.

However, causal inference is more than including or excluding some covariates. In a sense, statistical models in causal inference are for testing theories, not for fitting data. The truth is hidden in the data, but the truth is by no means simple, despite the existence of possible simple relationships.

Facing the complex and the demand of causal inference, more complex models may be preferred over simple models.If a simple model fits data better than complex model, one possibility is that the complex model is misspecified, and another complex model should be assessed to capture more information in the data.

Certainly, the complex models are not restricted to simple linear regressions. Structural equation model can directly model the interrelationships between covariates and responses. Dynamic model and mixed model can capture changes in longitudinal data.Multivariate analysis should be used more often than the current practice. People should also be familiar with nonlinear model, as nonlinear relationships are far more common than linear relationship.Bayesian theories may be appealing in some situations.Nonparametic and statistical learning methods are indispensable in statistical analysis.

“Modeling in science remains, partly at least, an art.” (McCullagh & Nelder, 1989)As an art, we have to admit that imprecision is inevitable. On the other hand, it is the imprecision that makes modeling an art. Therefore, Box claimed “all models are wrong, but some are useful.” (GP Box, 1980) We are seeking those useful ones, but we must acknowledge that “eternal truth is not within our grasp.”

It is not uncommon that people take liberty to interpret their data as if they indeed hold truth. The practice in econometrics, unfortunately, suggests that overconfidence is pervasive, as it is evident in the Steven Levitt’s “Freaknomics.”Because that book targets the general population, some of the strong conclusions may have unwanted effects on people’s beliefs.

More to come…


Freely hosted by www.xlogit.com. Powered by WordPress.