Here is a very enlightening email post about working on large dataset, written by William Gould, the CEO of Statacorp, the maker of Stata.
Basically, working on dataset at the scale of 10 millions may cause numerical and substantive problems:
1) Precision problem: when the number of observation is large, even the basic calculation such as summation may be wrong. Each time you lost some very tinny thing due to computer limit, you may end up with a big deviance in the final summation.
2) Matrix operation thus may be inaccurate due to the inaccuracy.
3) For a large sample, standard error for estimates may not have the usual meaning. The standard error (=standard deviation/sqtr(nobs)) is very small. On the other hand, the estimates may be the population you are working on. So where is the uncertainty for the hypothesis testing?
Therefore, the solutions are:
1) Be cautious.
2) Use a manageable random subsample to test the model
3) Make sure you have roughly equal means and standard deviation in all x and ys. (But how about discrete variables?)
Anyway, I have been working on large datasets for a while. My data usually accumulates millions of observations. I am fully aware of precision problem as I am a nonbeliever of computer. I also fully agree with Bill about the issue of population representativeness in large datasets.
However, even inference from the national dataset, which one can say this is the population for the year, still has merit of hypothesis testing. We usually use the sample years to extrapolate, or predict, the future. The total population in this case is the 10 years or even 20 years of national records. The prediction makes sense, and the confidence interval for the prediction is appropriate.
On the other hand, the small standard error creates a trap for exploratory analysis. With large datasets, everything will be significant. A nonsignificant coefficient or comparison may be truly nonessential, but a significant one doesn’t mean important. Unfortunately, this has been intentionally or unintentionally ignored in many surprising findings in the top notch journals. Examples include numerous false reports about the effect of vitamins and nutrients on health.
Another thing may occur in the large dataset is that you always have anomalies. You can exclude them by capping and flooring the variable values. However, there may have hidden patterns in these anomalies and missing values.
One nice feature of large dataset, of course, is that all the large sample theory, the asymptotic stuff, is satisfied. There is not much difference whether you do a frequentist analysis of Bayesian analysis.