August 22, 2005

Thinking in Data (3)—expand our class system

Filed under: Causal inference and statistics, Uncategorized — xlsyu @ 6:37 pm

Boy, we’ve come a long way. My data object system gets more and more complicate. I have also corrected many errors (stealthily) in my previous posts. If you haven’t reread my previous posts, let me recapitulate the status quo of the data object system.

We have several atomic object types: number, character, logical value, missing value, and raw value. Vectors comprise these object types are atomic vectors. They are building blocks in our system. The type of vector is determined by its components. To avoid confusion, we now call the object type “mode”, and reserve “class” for a more extensive classification purpose. Therefore, for every object, it has four default attributes: class, mode, length, and name. It also has at least one method: indexing method.

A list class is a versatile tool which can combine all kinds of objects into one object. A data.frame class is a table like object type which is developed from the list class. Thus, the mode of the data.frame class is “list”, which indicates that data.frame is indeed a list.

Sure the above framework is far from sufficient to be a real object oriented system. It is not surprising that even a simple task such as representing data can be a messy business if we design a data system from scratch (and I haven’t mentioned anything about data storage, implementation, and low level programming etc.). As I have already said, we will discover many holes in this framework and we will add bit by bit to mend it.

Nevertheless, we have developed a way to represent data. For example, suppose we have conducted a survey on smoking and diet habits among lung cancer and non-cancer patients. The data will include patient id, age (years), smoking status (non, ex-, current smokers), levels of smoking (zero, light, moderate, heavy, and extra-heavy), duration of smoking (years), vegetable intake (botanical groups), and the response–lung cancer status (yes/no). This data will include all atomic data types except raw type. A data.frame object (called lung) can be created with no difficulty. The class of the object lung is data.frame, and the mode is list, and the names are seven variable names. It may be more convenient to redefine the length in the data.frame class into dimension length—the row and column length. In the lung data, the row length is the number of participants, and the column length is the number of variables. The indexing can be based on row and column location, or more naturally–variable names.

With the data at hand, the first thing we may want to do is to make a simple table by cancer status and smoking status. The table addresses the question whether lung cancer patients are more likely to be smokers than non-cancer patients are. But till now, we only have the indexing method associated with the data.frame class. We would like to have a method, or function, to tabulate the data. But in our object oriented system, we claim that everything is an object. Then a nontrivial question is: how can a function be an object?

The solution is simple but may sound illogical. We will define some basic language classes and mandate that all functions are derived from them.

The first basic language class is symbol class (or name class in R). As we know, symbolic computation is convenient at an abstract level. For example, given the expression x+y, we can substitute any values for x and y to obtain correct results. Symbol class is at the bottom of our class system.

The next one, as I have already used, is the expression class. An expression can be simple like 1+2 or x+y, but more likely is a complicate one which includes other expressions and functions. That is, expression class is a recursive class. Nevertheless, expression can be evaluated and results can be assigned to an object. For example, x<-1+2 (<- means assign), so x is 3 after evaluation.

A function class consists of one or more expressions in a structured way. Expressions are enclosed in an expression block “{…}” in which the class “{ ” is derived from expression class.

Any function is designed for specific purpose. Thus the function class is self closed (or a closure). Communication with outside is through the argument attribute. In addition, any function object has to be restricted in an environment in which its run is legitimate.

Therefore, the attributes of a basic function class include argument, environment, and body of expression block. Running through a function is evaluating the expressions inside its body within its environment. Because the expression is a recursive class, the function class is also recursive. For example, a function tabulating two variables will invoke other functions such as count and arithmetic functions.

By defining some language classes (and there will be more), we are ready to develop a function object called table (or tabulate in many other software). Its argument attributes are two vectors, and the environment will be any place (because table is too generic). The expression body will include counting one vector based on the other vector, summing up numbers, and printing them out. These expression statements will invoke many other functions as well.

Here is a new problem. Very often we want a summary of our data. But for a continuous variable (e.g., age), a summary should include mean, median, standard deviation etc., while for a categorical variable (e.g., smoking status), a summary may just be counts. We need personalized summary functions for the number class and factor (character) class.

We’ve already known that methods associated with a class are about how to manipulate the objects derived from that class. Why don’t we associate each class a summary method? Therefore, whenever we invoke a summary function, the system will first determine which class the summary is supposed to operate on, and accordingly invoke the class specific method to do the job. This is the key of object oriented programming.

Now let’s go back to square one. Suppose we derived a matrix class from the basic number vector class. By definition, elements in the matrix class should be the same type—here the number type. A matrix is organized by row and column, essentially by transforming the original vector. So it needs dimension attributes to indicate row and column. Further, the indexing method needs to be revised to accommodate this change, which effectively masks the parent class—the number vector. In addition, the summary function needs to be revised too so that we will get summary statistics by column.

But there is one special need. What if we want to know the grand mean of the matrix? If we have a mean method in the parent vector class, we can use it in matrix class instead of writing a new one. That is, child class such as matrix class inherits methods from its parent class. This certainly is a revolutionary idea. Since matrix is just a reconstruction of a vector by defining two dimensions, matrix is still stored as a one dimension vector and can be indexed like that. So the mean method in a vector class is perfectly operatable in a matrix class.

Therefore, when we invoke a mean function on a matrix, the system first searches the matrix class to see if it has a method called mean. If not, the system goes upstream to see if its parent has a mean method. In our case, there is a mean method for the number vector class. Then that mean method is used to compute the mean of the matrix by treating the matrix as a one dimensional vector.

In some sense, our OO data system has all elements an OO system should have: class, method, attributes, inheritance, and method dispatching. Furthermore, everything in our system is an object. Our work is largely done.

But after so much trouble in defining the data system, we still haven’t seen anything we can use to dig into the data. Well, please be patient, we are almost there. We will talk about how to visually explore the data in the next post.

To be continued…..

4 Comments »

  1. Thanks for your efforts.It’s very interesting.I closely followed this thread.Will you continue it?

    Comment by Lin — August 28, 2005 @ 9:15 am

  2. I haven’t seen your previous posts on this topic. One obvious question is: have you tried Python? It would be a lot easier and more useful if you can write a package like this for Python than trying to develop it from bottom up in sth like C/C++. Python already has the most powerful data strutures, OO models, and support packages ready for use.

    Comment by lovescience — September 4, 2005 @ 1:39 pm

  3. Python also have support for databases, GUIs, plotting etc. I try to do all my work in Python. It is open source and free. It would be nice to be able to access your package through Python.

    Comment by lovescience — September 4, 2005 @ 1:43 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment


Freely hosted by www.xlogit.com. Powered by WordPress.