Monatsarchiv für January 2011

 
 

Too Big to Be Accurate(1): Which is the Most Powerful Calculator in the World?

Calculate the factorial of 171 (171!)? Just TRY! It is equal to 171*170*169*….2*1.

1. Google calculator

As Google fanatics, I first try to search the answer via Google:

Google171

Whoops, nothing interested returned! Type “170!” and get the output:

Google170 Why kinda things happened in this calculator? 171! is just equal to 171*170!.

2. Excel

Switch to Excel spreadsheet. Function fact(*) used:

Excel170 Excel171 Oo, interesting. The same.

3. SAS

Google and Excel may be the niche players in calculators’ family. Why not try to use some programming languages?


Click to read more…

Feature Selection: Collections for Self Study

Recently I start to learn the algorithms and applications of feature selection. The term  “Feature”, wildly used in machine learning and data mining literatures,  simply means “Variable”. In some practices, for example, a neural network model uses a decision tree as input; the tree performs the function of variables selection.

The Arizona State University is maintaining a repository of feature selection, including original documentations, Matlab packages and user guide for the following popular algorithms so far:

BLogReg
CFS
Chi Square
FCBF
Fisher Score
Gini Index
Information Gain
Kruskal-Wallis
mRMR
Relief-F
SBMLR
T-test
SPEC
see http://featureselection.asu.edu/software.php

A R package, FSelector, is also useful for step-by-step studying. This package covers:

Filters:
*cfs
*chi-squared
*consistency
*correlation
–linear.correlation
–rank.correlation
*entropy.based
–information.gain
–gain.ratio
–symmetrical.uncertainty
*OneR
*random.forest.importance
*relif-F

Wrappers:
*best.first.search
*exhaustive.search
*greedy.search
–backward.search
–forward.search
*hill.climbing.search

Decision Trees in SAS Enterprise Miner and SPSS Clementine

Decision trees are included in SAS Enterprise Miner(EM). The counterpart is SPSS Clementine, which should be called IBM SPSS Modeler for precision after IBM’s acquisition of SPSS.

Recently I read a paper on the comparisons of SAS EM, SPSS Clementine and IBM Intelligent Miner on their decision tree and cluster technology:

Decision Tree Induction & Clustering Techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A Comparative Analysis by Abdullah M. Al Ghoson, Virginia Commonwealth University

The output is not that surprising. SAS EM plays better in performance, functionality and auxiliary task support but worse in usability.

SAS_VS_SPSS

Here are few comments on decision trees implementations in SAS EM and SPSS Clementine based on my own experiences. Some advises for beginners are also supplied.

There are four nodes in SPSS Clementine to supports four trees algorithms respectively: C5.0, Classification And Regression Trees (CART),  Quick, Unbiased, Efficient Statistical Tree(QUEST) and Chi-squared Automatic Interaction Detector(CHAID),  which are most famous and popular in decision trees family.

SPSS_4_trees Note that CART(R) is a registered trademark of California Statistical Software, Inc., and is licensed exclusively to Salford Systems, San Diego, California. So SPSS Clementine uses C&R Tree as name.

In SAS EM, there is only one decision tree node:

SAS_tree The algorithms behind this node is called SAS tree algorithms, which incorporate and extend the four mentioned before. Just change the settings in decision tree node, you can get the trees you want.

Obviously, SAS tree algorithms is superior than the separated ones in SPSS Clementine for expansibility and flexibility. But at the other hand, the complexities increase. For a newbie user of SAS EM, he/she may wonder which trees he/she is training. A SPSS Clementine users just picks up a node and says: OK, I am now training a CART or CHAID.—he/she would communicate with others more smoothly.

Regardless of the industry application, I think this is the educational benefit of SPSS Clementine. Since almost every data mining book introduces decision trees by separated algorithms(such as ID3/C4.5/C5.0, CART, QUEST, CHAID, . . .), the beginners using SPSS Clementine as instructional tool may get the clear ideas about the algorithms one by one. Once he/she get the full understanding of the differences among tree algorithms, he/she would train trees in SAS EM more comfortable.

What’s more, SPSS Clementine supplies rich supporting documentations for beginners and self learners , such as Tutorial, User Guide, Algorithms Guide, Node Reference. The official documentations of SAS EM 5.x and 6.x are relatively poor. Yes there is a good SAS Help and Documentation for SAS EM 4.3 including Getting Started with Enterprise Miner. EM4.3 is a traditional AF application but EM5.x and above are Java client incorporated in SAS analysis platform(they are totally different!). For EM5.x and above, only installation guides and a plain reference are available.

SAS Institute may have its own marketing strategies. No rich references available, the Institute DOES offer rich training programs in data mining and Enterprise Miner application. Wooo, the big-budget purchasers of SAS EM can also afford the trainings.

SAS Data Step’s Built-in Loop: An illustrated Example

Some newbie SAS programmers take SAS as their first programming language even learned. Sometimes they are confused by the concept of “data step’s built-in loop” even after reading the well-written The Little SAS Book: A Primer:

DATA steps also have an underlying structure, an implicit, built-in loop. You don’t tell SAS to execute this loop: SAS does it automatically. Memorize this:

DATA steps execute line by line and observation by observation.

Programmers could memorize the statement above and apply it well in their programming practices, but still find it hard to get the vivid idea about the so called implicit built-in loop. –This post would make it easy.

The following will show an explicit loop example in C++. Note that you do not need to know any about C++ to get the idea. Suppose that a data file data.dat in D driver holds three numbers

1
2
3

The question is how to (read and) print out these numbers and their sums.  Following is the C++ approach (just read the bold session):

#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int x;
int sum=0;
ifstream inFile;
inFile.open(“d:data.dat”);
inFile >> x;

while (!inFile.eof( ))
{
cout<<x<<endl;
sum = sum + x;
inFile >> x;
}

inFile.close( );
cout << “Sum = ” << sum << endl;
return 0;
}

There is an explicit loop in these C++ codes: while (!inFile.eof( )) .  While it is not at the end of infile, the codes above will keep print out the numbers and do the accumulation. The final output is

1
2
3
sum=6

The following SAS codes produce the exactly same output:

data _null_;
infile “d:data.dat” end=eof;
input x;
sum+x;
put x;
if eof then put sum=;
run;

Note that SAS codes do not need an explicit loop to reach to the end of file. There is a so called implicit built-in loop.<the end>