August 22, 2005

Thinking in Data (2): organizing data and more objects

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 1:43 am

OK, let’s continue our expedition on data. In the previous section, I introduced the list class (object type) in the need of combining different types of vectors such as numbers and strings into one object. However, that introduction is somewhat unfair to the list class. In fact, list can consist of any objects including list objects. That is, it can be recursive. This is true in R/S, and I believe there are comparable object types in C++ and JAVA. After all, you need something versatile enough to accommodate the whole universe.

Now we have the almighty list class and we have also said that vector is the basis of all objects in our data object system, which means list also belongs to the vector class. This is kind of confusing. All objects inside a vector must have the same class but list can include any objects. How could a list also be a vector? Please read on.

We have defined several classes: number (integer, real, complex), character (string), and vector, in a somewhat hierarchical order. Given a vector of numbers, the class of the vector is “vector.” The components of the vector are all numbers. This is consistent with our definition of vector class.

If we define list as a vector object, all components in a list will have the same type — vector! This is exactly what the definition of vector class requires. The length of a list is the number of first level objects, and the extracting method is the same as that of any regular vector. The definition is sort of circular but it works.

One note: because everything in our data system is based on vectors by definition, it is not informative to say that the class of a vector of numbers is vector. Instead, renaming the class of a vector to the same as its elements—number in the above example—makes more sense. In more complicated classes, we can use two indicates: one is class names for different classes we can create (referred as “class”), and the other for the general object types the class actually belong to (referred as “mode” in R). For example, in a data.frame object(see below), the class is “data.frame” and the mode is “list.” In this way, you can trace back what the classes are.

Anyway, list is great, but sometimes it is too good. Let’s don’t forget our current aim– to develop a way to represent data. We would like to scale down the list object a little bit to simplify our lives in the data business.

Remember that Stata uses a simple “table” concept and does everything just fine. Indeed, table is enough to represent almost all data. A table may contain several different types of vectors: string, number, and others. So the table is a list. It is also like a rectangle. All row vectors must have the same length and all column vectors must also have the same length. The table is easily indexed by row and column like the way in a matrix. But table is not matrix, as the definition of matrix requires that all elements in a matrix must be the same type (discussed in detail later). Nevertheless, an object type like table would be useful in practice.

In R/S, the data table is called “data.frame” class, although to read data from a file we use read.table function. Let’s stick to the data.frame terminology because the word table will be used later as a function name to describe the data. What a mess!

So far so good. We have a familiar way to store our data using a data.frame object. The manipulation of the data.frame object is also straightforward. We are ready to go on to the next step—look at the data themselves.

But wait, till now we have introduced only two broad types of data values: number and character. Reality is more complicate than that. There are other types of data values, and some types have special meanings.

The first one is logical values. In a survey, you ask participants “do you smoke?” and the answer is “yes/no.” (Sometimes answers can be “don’t know” or “refuse to answer,” but let’s put off this issue till the discussion of examining missing values.) They can be represented as “TRUE” or “FALSE” and coded as 0/1 values.

The second one is missing values. Suppose you ask participants tons of questions and some participants may skip a couple questions. You don’t know whether it is because they refuse to answer those questions or because they don’t know the answers. All you have are empty values for those questions. They are better represented by missing values. (But missing values may have patterns. We will explain later on how to explore and impute missing values.) For convenience, missing values are treated flexibly to fit in with any other objects (except for the raw type).

The third one is about two special types of character object. Sometimes you ask participants “how many cigarettes do you smoke every day?” The answer may range from “zero” to “ten packs a day.” Although you can treat them as continuous values—the usually numeric values, you may want to group them into “zero”, “less than half pack”, “half pack to two packs”, “two to four packs”, and “more than four packs,” which correspond to “non smoker”, “light smokers”, “moderate smokers”, “heavy smokers” and “extra-heavy smokers.” These are ordered categorical values. They are usually coded as 0,1,2,3,4, and require special attention in data analysis.

Another special type of character object is the unordered categorical values. When people are asked to name the vegetables they usually eat, they may include all sorts of vegetables in their answers. We may want to group them in some ways such as those based on the botanical classification: Brassica or cruciferous vegetables (flower vegetables), tube and root vegetables (potatoes), green leaf vegetables, stalk vegetables (celery), fruit vegetables, pod and seed vegetables, onion vegetables, and so forth. Each vegetable type may have different nutrition values but there is no obvious order among them. (Sure you can order them based on nutrients such as calcium contents, but that is another question.) They are better represented by unordered categorical values and usually also coded as 0,1,2,3…. But during the data analysis they will be recoded as dummy values (a series of 0/1 values. It will be explained later).

In R/S, categorical values are called factors. Each factor has one special attribute called levels which are determined by the number of unique character strings. There is a stringOrdered class extended from the factor class to incorporate ordered factors.

The last one is a vague object type. We use raw data class to represent anything that is unstructured. For example, a chunk of binary stuff such as a scanned figure is raw data.

Number (integer, numeric/double precision, single precision, complex), character, logical value, missing value, factor and raw are atomic classes. They are at the bottom of the hierarchical class framework. Vectors consisting of these atomic objects are atomic vectors because they are the building blocks of other data objects. In fact, atomic vectors are the objects which all data analysis will base on.

There we go. We finally finished all groundwork for defining and organizing data objects. Next we may do some real business on what information data can tell us. But to analyze any data, we must have functions. How to define functions in the OO framework? Are functions also objects?

To be continued……

1 Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment


Freely hosted by www.xlogit.com. Powered by WordPress.