April 4, 2006

The secrets of p values

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 7:36 pm

P value is a fascinating thing. Everybody loves it, and hates it. When the p value is less than 0.05 (probably out of luck), everybody gets excited; but when it is not, everybody starts mumbling the F word. For many number crunchers, it’s the way of life.

Here is an interesting graph I made in Excel. The x-axis is the z value from a standard normal distribution, and the y-axis is the p value. Based on this figure, one can reach several conclusions:

(more…)

error-bar and sample size

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 1:44 pm

量子关于老魏文章中Error-bar 的疑问

=====

我在那篇文章里先是假定error bar的含义是指观测误差,即当你看到了平均体积为X,而其真实值是X周围某点。error bar表示X周围点的可能范围,即真实值的范围。对error bar的含义作此理解主要是因为大家都说过体积难以精确测量。在这个意义上讲,error bar当然不能涵盖负值。

我在文中并未排除用error bar表示离散度的可能性。但在这种情况下我的问题已经不是负体积问题了。
这离散度是试验本身的一个重要结果。但是在只有10只样品且不断处死的情况下,这离散度已不具备可重复的科学价值。

=====

在用图表描述数据分布和结果的时候,常见的有:
1)均数+/-标准差: MEAN+/-SD
2)均数+/-标准误: MEAN+/-SE,(SE=SD/SQRT(N))
3)均数+95%可信区间: MEAN+/-1.96*SE

我不知道老魏本人在文章里面用的是那一种,但我认为不是前面有人提到的1。因为如果是均数加一个标准差的话,毕竟这种类型的图要表达的是现有样本的分布,是确实存在的值,所以不应有负值,图中的标棒也不会到0以下。

(more…)

March 24, 2006

Absence of evidence is not evidence of absence

When R. A. Fisher put down the phrase “p=0.05, or 1 in 20” in his famous book, Statistical Methods for Research Workers, very likely with a cup of tea besides his elbow, he might not realize that this p value has become so magic that it is as influential as the e in mathematics and engineer.

However, Fisher himself didn’t stick to that number. To him, p value sort of measures the evidence against a hypothesis. For example the usual p value for type 1 error is used to against the null hypothesis. He was very liberal in interpreting the p values. He sometimes treated p=0.08 as significant and sometimes he did not. On the other hand, Pearson and Neyman thought that we should use a fixed cutpoint for statistical tests, probably for the sake of simplicity.

(more…)

March 21, 2006

Alcohol intake and body weight paradox

Filed under: Causal inference and statistics, Health, Uncategorized — xlsyu @ 2:10 pm

Once upon a time, for a Chinese man, having a protruding belly indicates a prestigious status— rich, mature, and sometimes ignorant. People feel kind of proud to have a big stomach and lightly attribute it to the beer they’re enjoying.

Is it true that drinking beer causes their large waists, or perhaps their ignorance does it?

Despite of numerous anecdotes, unfortunately, the fact is, we don’t know! The correlation between alcohol intake and body weight remains obscure till now, let alone causation.

One should be cautious when reading and interpreting current health literature. It is very common to see statements like “drinking alcohol is not related to body weight”, or “three cups of milk a day reduces your weight.” Common people tend to ignore the fact that most studies employ multivariate models to obtain the claimed odds ratio or relative risk ratio. How people adjust the models is very important. In alcohol studies, alcohol intake is often adjusted for total energy intake. Thus, the correct interpretation for a “null alcohol effect” should be “given the same amount of energy intake, there is no additional alcohol effect on body weight.”

(more…)

January 19, 2006

the replication of observational studies

In reporting the scandal about the Norwegian cancer research, the NYT had the following unnerving comments:

A special feature of epidemiological studies like Dr. Sudbo’s is that they involve large numbers of patients and are unlikely to be repeated by other laboratories. Replication is considered the most reliable test of scientific quality.

The full text is here

It is a severe accusation to a scientific field (and in fact to all social science fields) in which observational studies are popular.

It seems to me that the author, Nicholas Wade, doesn’t understand the meaning of “replication,” at least in social science. He narrowly defined “replication” as “repeated by other laboratories.” We certainly won’t be able to replicate the study per se, and we may not be able to get the same numbers (e.g., odds ratios) from other studies. However, what is important in social science is that we can replicate the study findings, and that the findings are consistent in different populations and in different types of studies.

(more…)

January 17, 2006

More fun pieces

Filed under: Causal inference and statistics, Uncategorized, social study — xlsyu @ 1:20 pm

Today Andrew Gelman in his blog complained that a Nature columnist mocked a paper published in the Science magazine. I’ve read that paper too. It is about the life course of social network in a big university. To me, the merits of the paper are the innovative method—email—they employed, and the quantification of the social network. These are very important because the social network research during past few years was neglected if not demised because of its inherent uncertainty and complexity, weak association with important outcomes such as health, and the lack of theoretical models. In a sense, this paper revives the social network research. Besides, this paper is fun to read.

(more…)

January 3, 2006

What does the skin color gene SLC24A5 mean?

Filed under: Causal inference and statistics, Uncategorized, social study — xlsyu @ 4:38 pm

skincolor
A fascinating research lead by Keith Chen from Penn State U reported that a gene polymorphism, SLC24A5, may shed light on the origin of light skin in Europeans. Almost all Europeans have one version of allele different from that of almost all non-Europeans, which causes only one amino acid difference in a melanin related enzyme. About 25% to 30% of genetic variations of skin color between blacks and whites can be attributed to this alanine to theronine change.

Some neo-racists hailed to this discovery, claiming that biological differences between blacks and whites do exist, and race has a deep root in biology. Most biologists, however, cautiously reminded that skin color is not race, and the fact that one small change of gene causes such a big difference in skin color between blacks and whites proves that there are so few differences between them.

So what do the findings from this skin color gene mean?

(more…)

November 14, 2005

Racialized or personalized medicine?

Filed under: Causal inference and statistics, Health, Uncategorized — xlsyu @ 1:35 am

An Iceland company recently announced a cardiovascular drug which targets a gene variant common in European descendents but uncommon in African Americans. The gene variant, or polymorphism, has been associated with a higher risk of heart attack among African Americans but not among European descendents.

This gene variant is very interesting in that it is associated with a more active response to the inflammation. In one way, it is protective because it fights against inflammation. On the other hand, the inflammation hypothesis in the etiology of cardiovascular diseases suggests that the chronic inflammation may cause the plaque build-up in the arterial walls. Furthermore, the acute inflammation in the plaque spot may trigger the rupture of the plaque, thus leading to a heart attack. The active form of the gene variant may not be good to your heart.

(more…)

November 4, 2005

working on large datasets

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 11:57 am

Here is a very enlightening email post about working on large dataset, written by William Gould, the CEO of Statacorp, the maker of Stata.

Basically, working on dataset at the scale of 10 millions may cause numerical and substantive problems:

1) Precision problem: when the number of observation is large, even the basic calculation such as summation may be wrong. Each time you lost some very tinny thing due to computer limit, you may end up with a big deviance in the final summation.
2) Matrix operation thus may be inaccurate due to the inaccuracy.
3) For a large sample, standard error for estimates may not have the usual meaning. The standard error (=standard deviation/sqtr(nobs)) is very small. On the other hand, the estimates may be the population you are working on. So where is the uncertainty for the hypothesis testing?

Therefore, the solutions are:
1) Be cautious.
2) Use a manageable random subsample to test the model
3) Make sure you have roughly equal means and standard deviation in all x and ys. (But how about discrete variables?)

Anyway, I have been working on large datasets for a while. My data usually accumulates millions of observations. I am fully aware of precision problem as I am a nonbeliever of computer. I also fully agree with Bill about the issue of population representativeness in large datasets.

However, even inference from the national dataset, which one can say this is the population for the year, still has merit of hypothesis testing. We usually use the sample years to extrapolate, or predict, the future. The total population in this case is the 10 years or even 20 years of national records. The prediction makes sense, and the confidence interval for the prediction is appropriate.

On the other hand, the small standard error creates a trap for exploratory analysis. With large datasets, everything will be significant. A nonsignificant coefficient or comparison may be truly nonessential, but a significant one doesn’t mean important. Unfortunately, this has been intentionally or unintentionally ignored in many surprising findings in the top notch journals. Examples include numerous false reports about the effect of vitamins and nutrients on health.

Another thing may occur in the large dataset is that you always have anomalies. You can exclude them by capping and flooring the variable values. However, there may have hidden patterns in these anomalies and missing values.

One nice feature of large dataset, of course, is that all the large sample theory, the asymptotic stuff, is satisfied. There is not much difference whether you do a frequentist analysis of Bayesian analysis.

October 31, 2005

be professional

Filed under: Causal inference and statistics, Uncategorized, social study — xlsyu @ 9:31 pm

The Oct 28, 2005, Science News Focus reported a glass ceiling phenomenon for Asian scientists. Kuan-Teh Jeang from NIH and Neuroscientist Yi Rao from the Northwestern University presented solid evidence suggesting that Asian scientists were discriminated at the leader level in biology science community. The community responded their comments quickly and positively. However, there was criticism about Jeang and Rao’s conclusion. Among them, refutation from Xie Yu and Pike were typical examples of bullshitting–trying to confuse readers instead of enlighten them.

Discrimination has two meanings: perceived discrimination and real discrimination. What Xie said was that because he (and some others) didn’t feel discrimination, the discrimination didn’t exist. When confronted with statistics, he dismissed the statistics by saying that “statistics” was sort of “nothing,” and not enough evidence “…to reach a conclusion one way or the other.” This falls into a refutation strategy—dismiss the whole thing as a mess. Instead of carefully examining the evidence, or providing other evidence, he waved his hands. How could he comment like that? He is a sociologist and a statistician. Didn’t others give him comments like “statistics” means “nothing” for his female scientist research? How did he response to this kind of question? Didn’t he know that individual’s own experience can’t represent the population’s experience? Xie’s unprofessional comments definitely disappointed me.

Pike and others pointed out that culture and language barriers might hinder the advancement of Asian scientists, which might be true. However, these comments were trying to divert the foci of the problem, which is another refutation strategy—providing some alternative interpretations that are impossible verified.

There are two things here: are these barriers making Asian scientists difficult to be promoted, or are Asian people innately unwilling to be leaders? It is unlikely to be the latter. It is more likely that because of the barriers, Asian people find difficult to communicate with others and lost interest to move forward. It is also likely that others are rejecting Asian people because of those barriers. This is discrimination.

The discrimination does exist, and Yi Rao’s statistics are evidence. If Yu Xie and others want to refute Rao’s conclusion, why don’t they provide their own statistics? Be professional.

Next Page »

Freely hosted by www.xlogit.com. Powered by WordPress.