October 24, 2005

This, I called simulation

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 3:35 pm

Last week I was troubled by a treacherous simulation problem. The problem was set up as a slow convergence example and was to demonstrate some techniques to improve its convergence. But the wicked part of the problem was that the instructor had overdone the setup so that it seemed to never converge, partly due to his own mistake in his code.

In fact, after I fixed instructor’s bug, the revised program never converged (at least in a manageable time). The simulation kept on running 24/7 on my two laptops. My heart sunk with the constant fan noise. This was not good.

(more…)

October 21, 2005

Drug dispute and eight Sins of drug trials

Filed under: Causal inference and statistics, Health, Uncategorized — xlsyu @ 1:15 pm

In the yesterday’s JAMA (and captured by today’s new media), cardiologists from Cleveland Clinic, Ohio, called for revoking the approval of a new diabetes drug–Pargluva® (muraglitazar), a dual alpha/gamma PPAR agonist.

PPAR, the abbreviation for peroxisome proliferator-activated receptor, has been extensively examined in biomedicine. PPAR gene is an upstream gene family including alpha, gamma and delta. Through various ways, PPAR can activate fat storage in adipose tissue, stimulate fat oxidation (burning fat) in muscles, and other fat related activities. In short, PPAR genes are thrifty genes which facilitate fat utilization and storage, thus reducing free fatty acids in the blood, increasing insulin sensitivity, and improving lipid profile. They are fine tuned by our millions of years of evolution.

(more…)

August 26, 2005

today’s reading notes in econ monitor

Filed under: Causal inference and statistics, Uncategorized, social study — xlsyu @ 1:02 pm

James Hamilton in his blog is talking about the possibility of recession in the next year, 2006-2007. Fairly technical.

The main point is that statistical models are pointless in predicting the future (even the next year). When it comes to the reality, all economic models have failed us till now. All models are wrong but some are useful. Macroeconomic models are useless.

(more…)

August 22, 2005

Thinking in Data (3)—expand our class system

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 6:37 pm

Boy, we’ve come a long way. My data object system gets more and more complicate. I have also corrected many errors (stealthily) in my previous posts. If you haven’t reread my previous posts, let me recapitulate the status quo of the data object system.

We have several atomic object types: number, character, logical value, missing value, and raw value. Vectors comprise these object types are atomic vectors. They are building blocks in our system. The type of vector is determined by its components. To avoid confusion, we now call the object type “mode”, and reserve “class” for a more extensive classification purpose. Therefore, for every object, it has four default attributes: class, mode, length, and name. It also has at least one method: indexing method.

A list class is a versatile tool which can combine all kinds of objects into one object. A data.frame class is a table like object type which is developed from the list class. Thus, the mode of the data.frame class is “list”, which indicates that data.frame is indeed a list.

(more…)

Thinking in Data (2): organizing data and more objects

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 1:43 am

OK, let’s continue our expedition on data. In the previous section, I introduced the list class (object type) in the need of combining different types of vectors such as numbers and strings into one object. However, that introduction is somewhat unfair to the list class. In fact, list can consist of any objects including list objects. That is, it can be recursive. This is true in R/S, and I believe there are comparable object types in C++ and JAVA. After all, you need something versatile enough to accommodate the whole universe.

Now we have the almighty list class and we have also said that vector is the basis of all objects in our data object system, which means list also belongs to the vector class. This is kind of confusing. All objects inside a vector must have the same class but list can include any objects. How could a list also be a vector? Please read on.

(more…)

August 19, 2005

Thinking in Data (1): what are data?

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 6:39 pm

During last couple days, I have been programming in R intensively. To be honest, although I have been following R since v0.9, R is never a major tool for my daily activities. Most of my route analyses are better done in SAS or Stata than in R.

It has been said that R (or S) is currently the best statistical language, which I have not much objection to. But the learning curve of R is definitely steeper than those of Stata and other software. Again one could say that nothing is easy to learn if you want to do any serious stuff.

No matter which tool people are using in their data analysis, it appears to me that few of them bother to think about a fundamental question: what are data?

(more…)

June 22, 2005

Freakonomics, is everything answered?

Filed under: Book review, Causal inference and statistics, Uncategorized, social study — Administrator @ 12:52 am

Economics can be fun, rewarding, and surprising. Steven Levitt has strived to tell us this in the book “Freakonomics,” coauthored with Stephen J. Dubner from New York Time. The book is indeed fascinating, full of interesting anecdotes and detective stories. How can you catch cheating among teachers? Where have all the criminals gone? What makes a perfect parent? Steven has answered all these questions in a vivid, scientific, and empirical way.

The structure of the book is unconventional. Topics and ideas are jumping all around. Stories are not internally correlated. Furthermore, some chapters (e.g., the last chapter on naming kids) are too loose and some tables are redundant. Consequently, the authors claimed that the book had no theme. Actually, it does promote one central dogma throughout the text. That is, theories and conventional wisdom should be subjected to empirical test. Let data speak themselves. We social scientists all know it. Now you morons should know it too.

(more…)

June 21, 2005

The irrelevance of Occam’s razor in statistical modeling

Filed under: Causal inference and statistics, Uncategorized — @ 3:36 pm

“Pluralitas non est ponenda sine neccesitate” — “plurality should not be posited without necessity.”

When the medieval philosopher William of Ockham first stated his minimalism principle, he probably never thought that the principle could be applied everywhere and last forever. In causal inference, it is one of the most often cited tenets to justify parsimonious explanations.One argues that unnecessary factors may not only complicate matters, but also hide truth. In particular, parsimony is one hallmark of statistical modeling. The simpler models are always more favorable than complicated ones.

The real world is complicated and interconnected (Karl Marx).In social science, data at hand are often massive in both the number of observations and characteristics in each observation.One wishes to reduce the complexity to a few parameters with which he can predict the reality. Statistical modeling is essentially a technique of dimension reduction.Through modeling, one can answer questions such as “can condom usage reduce the risk of STD and/or HIV infection” or “does it increase the risk of excessive sex instead?” These are important questions, as 70% high school girls are sexually active.

Unfortunately, statistical models are for prediction, not for inference. It becomes obvious when one recalls that all model fitting checkups are derived from prediction purpose. With a large number of variables in the model, one can fit data as well as treating each observation as a covariate.However, a model with hundreds of covariates only proves that analyst is not sophisticated and intelligent.

Therefore, all analysts will select some models, exclude some covariates, and check-recheck model fitting until it looks good enough. The “principle of parsimony” derived from “Occam’s Razor” is applied when assessing models.Given the same model fitting, the fewer number of covariates, the better model is.

However, the above process is a wrong practice. Parsimony is not an appropriate goal for statistical modeling. Instead, to reveal reality is the ultimate and only goal of analysis. If the reality is complex, only a complex model can reflect it.

Take linear regression analysis as an example, many people have traumatic experience of whether an insignificant covariate should be included in the model or not.Often the insignificant covariates are eventually excluded to favor the “Occam’s razor.”However, although one covariate may be insignificant in the model (given all others in the model), a collection of several covariates together can be significant. The insignificant variable may also have large influences on the effects of other covariates. Omitting a relevant covariate may bias the effects of other covariates (beta coefficients).This won’t be resolved by large sample size.Furthermore, an insignificant covariate may be due to inappropriate function form. Changing modeling settings to incorporate nonlinearity may improve the model significantly.

Here is another catch. Including an irrelevant covariate will increase standard errors of the beta coefficients of other covariates, although it may decrease model residual error. The conundrum can not be resolved without a prior knowledge of problems. The only way to perform any analyses is based on social or biological theories. One should always include all important and relevant covariates in the model, no matter what significance they have.

However, causal inference is more than including or excluding some covariates. In a sense, statistical models in causal inference are for testing theories, not for fitting data. The truth is hidden in the data, but the truth is by no means simple, despite the existence of possible simple relationships.

Facing the complex and the demand of causal inference, more complex models may be preferred over simple models.If a simple model fits data better than complex model, one possibility is that the complex model is misspecified, and another complex model should be assessed to capture more information in the data.

Certainly, the complex models are not restricted to simple linear regressions. Structural equation model can directly model the interrelationships between covariates and responses. Dynamic model and mixed model can capture changes in longitudinal data.Multivariate analysis should be used more often than the current practice. People should also be familiar with nonlinear model, as nonlinear relationships are far more common than linear relationship.Bayesian theories may be appealing in some situations.Nonparametic and statistical learning methods are indispensable in statistical analysis.

“Modeling in science remains, partly at least, an art.” (McCullagh & Nelder, 1989)As an art, we have to admit that imprecision is inevitable. On the other hand, it is the imprecision that makes modeling an art. Therefore, Box claimed “all models are wrong, but some are useful.” (GP Box, 1980) We are seeking those useful ones, but we must acknowledge that “eternal truth is not within our grasp.”

It is not uncommon that people take liberty to interpret their data as if they indeed hold truth. The practice in econometrics, unfortunately, suggests that overconfidence is pervasive, as it is evident in the Steven Levitt’s “Freaknomics.”Because that book targets the general population, some of the strong conclusions may have unwanted effects on people’s beliefs.

More to come…

June 2, 2005

Propensity score method and causal inference

Two days ago, an Ohio high school graduate shot his family members and friends to death in his graduation day. Two months ago, a Minnesota high school student killed several of his classmates.Why did these tragedies happen?We all want to know.

The May 27 Science Magazine published a report suggesting that earlier firearm violence exposure could cause later serious violent behavior. This is not much surprising, as other studies have already reached similar conclusion.However, the strong word—“cause” that authors from U of Michigan used in the title unnerves many people (and probably that is why it was accepted by the Science which seldom publishes social science reports).

Given recent lessons from hormone replacement therapy, all researchers are skeptical of any strong conclusions from observational studies. Then how could the authors claim the “causal effect” using an observational study?

The magic, as advertised in the paper, was the propensity score method.

Propensity score method was proposed by DB Rubin and his colleagues 20 years ago.It has been underused for a long time but it is getting popular these days. Here is an outline of this method in the context of firearm exposure study.

Over five years, three assessments were conducted every two years among adolescents from 78 Chicago neighborhoods.

At assessment 1, demographic, socioeconomic, behavior and psychological, and health related factors, together with neighborhood characteristics, were assessed. These covariates were used to develop the propensity score.

At assessment 2, firearm exposure status was obtained among these adolescents.Stepwise logistic regressions with the covariates from the first assessment were employed to predict the probability of exposure. The estimated probability is the propensity score. Thus, hundreds of covariates were reduced to one variable—the propensity score.Participants were then grouped into 12 strata based on the propensity score (It is too many.Usually one creates only five groups).

At assessment 3, the perpetrators, the outcome, were defined as those who experienced serious firearm violence during last 12 months.A stratified analysis by propensity score was then conducted to assess the magnitude of association (such as odds ratio). One can also use regression to adjust for residual confounding effects, e.g., small imbalance of covariates within propensity score strata.

The propensity score method is intuitively appealing and has several advantages over model based analysis such as adjustment for all covariates in one regression.

First, the propensity score summarizes many confounding factors (confounders) which are related to both exposure and outcomes.By explicitly exploring the relationship between exposure and confounders, one may discover imbalance among these variables and rectify it.

Second, by conducting stratified analysis for outcomes, one does not assume any association (e.g., linear) between outcome and confounders (in particular the joint distribution of confounders).

Third, one doesn’t have to worry too much about how to adjust hundreds of covariates in the outcome analysis. In the traditional regression analysis, too many covariates cause the dataset too sparse, and may require a large number of outcomes. Using propensity score, one essentially reduces the number of covariates. Note, in developing propensity score, we usually have enough exposed participants as “outcomes.”

Fourth and the most importantly, post-stratifying data based on propensity score is analogous to constructing a random experiment design within an observational study.By balancing propensity score between exposed and unexposed groups, one essentially creates comparable groups similar to those in random trials.As random trials are more valuable than observational studies in assessing causal inference, this feature is certainly desirable.

Now back to the firearm study.Did this study provide enough evidence to suggest a causal-effect link between firearm exposure and subsequent serious violence? There are many standards in assessing causal inference but let’s examine a couple essential criteria relevant to observational studies and to this study.

First, the risk factor must be associated with the outcome.In this study, the statistical significance of the results seems to support this. (Warning: no statistical significance doesn’t mean not causal).

Second, the factor must occur before the outcome.This seems obvious but has often been overlooked.In this study, the exposure did occur before the outcome assessment.However, the exposure is not static in nature.That is, those unexposed at the second assessment can be exposed to firearm during the following years, and vice versa.Nonetheless, because those having outcome were exposed to firearm by definition, exposure switches that occurred in the unexposed group were more likely to attenuate the association rather than strength it (if we believe there is a positive association).Therefore, there is no need to worry about this criterion either.

Third, is it consistent with other studies?Yes, the results from this study were consistent with conclusions from many previous studies.

Fourth, are there any experimental studies that can confirm the results?Well, there is no way to conduct experimental studies on this kind problem. The authors undertook indirect ways such as propensity score method to construct an “experimental” study.However, the propensity score only balances those known confounders.Unobservable factors are not accounted for in the method. Furthermore, although the paper includes more than one hundred candidate variables in the model for propensity score, they used stepwise logistic regression to select only 48 covariates (including quadratic terms) in the final model, which is questionable.Unfortunately, they didn’t provide the goodness-of-fit statistic for the final logistic regression.We didn’t know how well the estimated propensity score reflects the true probability of exposure.

In addition, the propensity score method is most useful in large dataset which can provide sufficient exposed and unexposed observations within each propensity stratum (i.e., overlapping observations within propensity score). However, because 20% participants dropped out of the study after the second assessment, there were only 210 exposed participants in the third assessment.In addition, because this study used too many propensity strata (12), the sample size in each stratum was too small and severely unbalanced on exposure status.

The outcome analysis in this report is also questionable.It is possible that some covariates are not significantly related to the exposure in the above stepwise regressions but may be significantly related to the outcome by themselves.Theoretically, those covariates unrelated to exposure (i.e., orthogonal) should not affect the relationship between exposure and outcome.However, reality is far more complicate than statistical theories. Different combinations of variable sets may yield different answers.

Fifth, are there any biological, psychological, or social theories for the association?Well, sort of.For example, social learning theory—learning by observing is a good candidate.However, given the complexity of the social phenomenon, the propensity score adjustment seems too parsimonious. In fact, because so many socioeconomic and psychological factors are related to violence, and because these relationships are naturally dynamic, regression methods used to form the propensity score and to assess the outcome are inevitably inadequate.

Overall, although asserting the “causal” relationship between violence exposure and subsequent serious violent behavior was somewhat overstretching the truth, the report had indeed advanced our knowledge on this arresting social phenomena. Its methodology was better than many previous reports. Nevertheless, for any causal factor whose effects are intertwined with myriads of others, it is never easy to reach a definite conclusion.

May 29, 2005

Some notes on social science research

We are what we repeatedly do.Excellence, then, is not an act but a habit.

– Aristotle

I recently stumbled into a blog maintained by Michael Nielson, an Australian physicist. In his blog, he recounted his experience on physics research and outlined some essential skills and principles for research (for your convenience, I also posted Nielson’s essay in my blog).I would like to recommend this fascinating piece to anybody who is seriously considering research as his/her career. In addition, to remind myself and to extend his thesis to the field of social science, I composed some notes as follows.

Any researcher should possess two sets of essential research skills. The first set is professional skills such as “public speaking, writing technical prose, and networking.”This is particularly pertinent to social science reseachers, for the whole field is completely built on these skills. Unfortunately, many foreign students find these skills insurmountably difficult to overcome.

The second is technical skills such as “finding and solving good research problems, and determining what constitutes a research result.”For social scientists, these skills refer to capabilities of understanding statistics and interpreting results correctly, reading papers critically, and generating important questions from seemingly trivial topics. Most students are well equipped with methodological skills after several years of PhD study.However, finding relevant research problems (especially important ones) requires lifetime learning by actively involving research projects. In fact, naïve junior researchers (including myself) often stuck in irrelevant questions, thus failing to recognize the important ones.The main characteristic of all keen researchers is able to find worthy topics to work on.See the forest, not the trees.

Mastering these skills and achieving significant success require some principles. However, Nielson’s principles seem too morally lofty.Hence, I have furnished his principles with my thoughts, thus more social science oriented.

The fundamental principle is to integrate research into your rest of life.My suggestion is that you should communicate with your family and win their support for your work.Make sure they understand that scientific research is not easy and demands your full commitment. Share every bit of career success with your family and be open to share family responsibilities if necessary.Unfortunately, faculties have much higher divorce rate than the general American.

Another important principle is to build good personal characteristics including “proactivity, vision, and discipline.” These are not much different from other good habits of successful people (see seven habits of highly successful people by Steven Covey).However, personal behaviors are extremely difficult to change.I am not sure how one can become proactive or disciplined after many years of sloppy life.The witty Benjamin Franklin’s successful life suggests that the vision and perseverance are more important. With a vision in mind and sticking to it, you are likely to develop proactive and disciplined attitudes.

The next series of principles are more technical and thus intertwined with essential research skill development.However, due to uniqueness of social science, most following discussions are more about my own beliefs rather than Nielson’s tenets.

Self-development is critical in successful research.With a vision in mind, you have to acquire specific knowledge and skills along your research career. All successful researchers specialize in some areas.However, during the process of self-development, you should keep your eyes open all the time. In social science, there is no earthshaking research that requires years of thinking.If the topic is out of fashion, the topic is dead. There is no need to revive it. Therefore, specializing yourself a little broader helps. In fact, most social scientists seem to know everything relevant to their own research.A broad knowledge together with a focused area is the best strategy in social science.

The next principle is about critical research.Creativity is desirable in any fields.However, as I said before, social research is not a science full of ingenuous discoveries. Numerous studies merely provide evidence to support or refute some seemingly self-evident things. Creativity in social science is represented as different ways of presenting results, thoughtful interpretations, and important questions. In particular, problem-creators, those who can generate interesting questions, are more adapted in social science research than problem-solvers.I agree that ideas are cheap but I think good ideas are priceless.Most social scientists are good at writing grants about important questions but easy (or not-so-easy) to solve, although the final results are sometimes trivial.

Furthermore, for most social scientists, there is only remote possibility to win any grand prize such as the Nobel Prize. In many areas, there is no prize at all. Therefore, to reward yourself and keep yourself motivated, you should actively involve in writing papers as many as possible. A chilling fact I hate to say is that most papers are junk papers, even those written by Nobel Prize winners.Because of the constant change of fashion in social science, even if you do write a good one but not about a hot topic, it looks like and indeed is junk.So don’t worry too much about wasting papers.Doing more means doing better. Keep this in your heart.

Here are some further notes about how to stay ahead in scientific knowledge. I agree with Nielson that one should gloss over all relevant research reports but study only a few. Information is exploding at a speed faster than ever.For example, Nature, Science, JAMA, and NEJM are published weekly.However, in social science, most reports are trivial and only important in certain ways. A glance over their abstracts is enough.Instead, one should know who are the major players and reports from which studies are worth reading. Carefully studying ten papers in any specific topic will get you ready to start research in that topic.

In terms of reading papers, some people tend to focus on the results rather than methods and discussion. On the other hand, I think methods and discussion are more important than results, as most reports are stealthy in hiding truth. Furthermore, by following the same methods, you can write your own papers using your own datasets to argue the same question differently or to unearth the hidden truth. Research by copying is the secret of many successful researchers (I learned this only after I changed boss). In fact, scientific advancement is built on small steps.Big jumps are rare and most people may not be able to live long enough to witness it.

There are many more principles and tips in Nielson’s post such as how to find and tackle important and difficult problems. However, in social science, nothing is important, or everything is important if it is a hot topic. You just need to do it. However, to do it, you should develop abilities to express your ideas clearly (e.g. writing and presenting), and a well-maintained network. That’s the only key to success.

Doing research is like having sex.First, it seems fascinating. But later you discover that it is not that mysterious and inspiring.Hence, you have to make it your habit to maintain your interest in that.

You may not agree with me.But please feel free leave your comments.

« Previous Page

Freely hosted by www.xlogit.com. Powered by WordPress.