August 19, 2005

Thinking in Data (1): what are data?

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 6:39 pm

During last couple days, I have been programming in R intensively. To be honest, although I have been following R since v0.9, R is never a major tool for my daily activities. Most of my route analyses are better done in SAS or Stata than in R.

It has been said that R (or S) is currently the best statistical language, which I have not much objection to. But the learning curve of R is definitely steeper than those of Stata and other software. Again one could say that nothing is easy to learn if you want to do any serious stuff.

No matter which tool people are using in their data analysis, it appears to me that few of them bother to think about a fundamental question: what are data?

Most analysts have fumbled with flat text files in which numerous numbers and characters are lined up chaotically. Without appropriate guidelines, they are incomprehensible. Therefore, before I dig into the meaning of data, let me first divert the discussion to the representation of data.

The traditional representation of data, not surprisingly, is derived from database systems. In fact, when SAS was first developed on IBM machines in 1960s, it used the IBM data system. The familiar terms in SAS such as “library”, “data file”, and “lib.data” are from the IBM MVS system. In SAS data, each observation is a record with a fixed or varied record length. Any data modification is done one record at a time—cumbersome but manageable. Records are the basic elements of the SAS data system. This is why SAS has a reputation of handling mega scale data. Even 999 variables in a record don’t take too much memory space. Want to do a regression on 20 GB data? No problem. Half an hour will be enough.

On the other hand, data can be visually laid out as a table: rows correspond to observations— sometimes indicated by a variable called “ID”; columns correspond to variables—the questions or items you have asked in your survey. This simplistic viewpoint is adopted by Stata. In stata, data manipulation focuses on both rows (observations) and columns (variables). You delete one observation by deleting one row; and you drop one variable by dropping one column. Each data point is indexed by row and column indices. To further improve the performance, Stata loads all data points—the whole table—into the computer memory. Stata claims that it is the fastest statistical program available. This may be true based on my experience.

Data representation in R (or S) is different, which leads us to resume our discussion on the meaning of data.

Data can range from one simple number —called scalar— to complicate structures such as multidimensional dynamic graphs. Nevertheless, everything stored in computers has to be either a number or a character string, and ultimately a 0-1 sequence. Furthermore, graphs can also be represented by an array of numbers such as xyz location, color, and intensity. To prevent my discussion rambling too far, I hereby examine in detail two common types of data: numbers and strings.

A number can be an integer, a real number (rational or irrational), or a complex number. In computer, it is difficult to represent all these types finitely and precisely, as computer register size and memory size are finite. For example, square root of three is impossible to be written down as a definite number. Instead, a finite sequence of integer numbers with a fixed or floating decimal point is good enough to approximate it. Therefore, a number can be represented either as an integer or as a sequence of integers in computers. The precision can be defined as single or double precision according to conventions.

On the other hand, a string is much easy to deal with because it is finitely defined. That is, any string is a combination of elements from some character sets such as ASCII. Using the character sets, every component in a string (e.g., alphabet) can be represented by an integer. Thus the computation on strings is transformed to the computation on numbers. The distinction between numbers and strings are more likely for practical reasons. That is, we want to read a real phrase instead of a sequence of integers.

Anyway, thinking each number individually leads us nowhere because almost all data contain a lot of numbers or strings. Subdividing numbers into groups based on their commonalities makes more sense.

A vector is a group of numbers which share the same characteristics, for example, measuring the same thing or different aspects of the same person. Numbers in a vector should have the same type: either integer, or fixed or floating point number, or string. Furthermore, since a vector has more than one number, we need to know how many of them in a vector. This is the length of a vector. Therefore, any vector has at least two characteristics (or in computer terminology, attributes): type and length.

But wait, there is one more thing—how to access each number inside a vector? We need an index, the subscript. Using a predefined vector, namely the natural number sequence, 1, 2, 3, …, we can easily access whichever element we want.

In the object oriented programming framework, we have defined several types of objects. A number is one type of object. It has one attribute—type. It is the basis for all other data objects. A vector is also a type of object which contains many basic objects—numbers or strings. It has two attributes: type and length.

Objects not only have attributes but also may have methods—the way to manipulate them. For a single number, there is no specific method associated with it. For a vector, we need at least one method—to index the vector through subscripts.

In R or S, vector is the basis of all other data objects. This is for practical reasons. Data in reality are more like a group of vectors—either observations or variables. Individual numbers are too basic to be of any use. In fact, a number can be viewed as a vector with the length of 1.

Individual vectors can’t do much help either. A typical survey data will have thousands of observations and hundreds of variables (questionnaire answers). They are inherently associated with each other. Values from one answer for all participants can be a vector since they measure the same thing and have the same type for all observations. Intuitively, combining all answer vectors together forms a table. But vectors in this table may not have the same type, since some answers may be string values while some are numeric values. A new type of object should be invented.

The object type “list” does this job. It comprises different types of vectors. Superficially, it seems not a direct extension of vector object type (but it is, as I will explain later), in that the elements of a list are different types of vectors. Each vector can be viewed as its attribute. The indexing is done through referring to its attributes. You can access individual numbers or strings through attributes and subscripts. It is a little bit cumbersome. But the advantage is that all related vectors are organized into one object.

The different type of objects such as number, character string, vector, and list is called “class” —the same meaning as in plant or animal classifications. For example, the default class “vector” has attributes of type and length with empty values and has an indexing method. A realization of a vector class fills type and length attributes. We say the realization is an object belonging to the class “vector.” Also, to distinguish every realization of any class, we would better add an attribute called name for all realizations.

With the object type “list” at hand, we are ready to do further manipulation and sophisticated modeling and prediction.

To be continued ……
=========

Postscript: Bluesea pointed out that data are not information. Instead, they are the digitalized information. For example, digitalizing a picture will generate millions of pixels, each containing information such as location, color, and intensity. The picture is the information, and data represent the information in a numerical way. Well, that kind of discussion will go on and on forever. Let me get things straight here: my purpose of writing this series is to think about how to statistically handle the data in the object oriented (OO) framework. That is, if I am reinventing an OO language to deal with data, what should I do? Inevitably, I will draw a lot from R/S, as it is the most popular OO statistical language. But I will also refer to some non OO languages such as SAS and Stata, and some obscure OO languages such as XLISP-STAT. Also my construction will be different from those existing system. This is where my creativity comes in.

1 Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment


Freely hosted by www.xlogit.com. Powered by WordPress.