Category Archive ‘data mining‘

 
 

Weekend Clips: Data Scientist Episode II

1. What is A Data Scientist Anyway?

image

2. You Just Can’t Be Replaced by Yourselves!

image

3. We are all Data Scientists!

image

Weekend Clip: Data Scientist

Two tweets:

image

image

One blog post:

Statisticians aren’t the problem for data science. The real problem is too many posers

One job advertisement:

SAS Data Scientist ?(!)

A joke:

The biggest joke about data scientist is that the Google query “data scientist joke” returns nothing interesting.

Map and Reduce in MapReduce: a SAS Illustration

In last post, I mentioned Hadoop, the open source implementation of Google’s MapReduce for parallelized processing of big data. In this long National Holiday, I read the original Google paper, MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat and got that the terminologies of “map” and “reduce” were basically borrowed from Lisp, an old functional language that I even didn’t play “hello world” with. For Python users, the idea of Map and Reduce is also very straightforward because the workhorse data structure in Python is just the list, a sequence of values that you can just imagine that they are the nodes(clusters, chunk servers, …) in a distributed system.

MapReduce is a programming framework and really language independent, so SAS users can also get the basic idea from their daily programming practices and here is just a simple illustration using data step array (not array in Proc FCMP or matrix in IML). Data step array in SAS is fundamentally not a data structure but a convenient way of processing group of variables, but it can also be used to play some list operations like in Python and other rich data structure supporting languages(an editable version can be founded in here):

MapReduce

Follow code above, the programming task is to capitalize a string “Hadoop” (Line 2) and the “master” method is just to capitalize the string in buddle(Line 8): just use a master machine to processing the data.

Then we introduce the idea of “big data” that the string is too huge to one master machine, so “master method” failed. Now we distribute the task to thousands of low cost machines (workers, slaves, chunk servers,. . . in this case, the one dimensional array with size of 6, see Line 11), each machine produces parts of the job (each array element only capitalizes a single letter in sequence, see Line 12-14). Such distributing operation is called “map”. In a MapReduce system, a master machine is also needed to assign the maps and reduce.

How about “reduce”?  A “reduce” operation is also called “fold”—for example, in Line 17, the operation to combine all the separately values into a single value: combine results from multiple worker machines.

Feature Selection: Collections for Self Study

Recently I start to learn the algorithms and applications of feature selection. The term  “Feature”, wildly used in machine learning and data mining literatures,  simply means “Variable”. In some practices, for example, a neural network model uses a decision tree as input; the tree performs the function of variables selection.

The Arizona State University is maintaining a repository of feature selection, including original documentations, Matlab packages and user guide for the following popular algorithms so far:

BLogReg
CFS
Chi Square
FCBF
Fisher Score
Gini Index
Information Gain
Kruskal-Wallis
mRMR
Relief-F
SBMLR
T-test
SPEC
see http://featureselection.asu.edu/software.php

A R package, FSelector, is also useful for step-by-step studying. This package covers:

Filters:
*cfs
*chi-squared
*consistency
*correlation
–linear.correlation
–rank.correlation
*entropy.based
–information.gain
–gain.ratio
–symmetrical.uncertainty
*OneR
*random.forest.importance
*relif-F

Wrappers:
*best.first.search
*exhaustive.search
*greedy.search
–backward.search
–forward.search
*hill.climbing.search

Decision Trees in SAS Enterprise Miner and SPSS Clementine

Decision trees are included in SAS Enterprise Miner(EM). The counterpart is SPSS Clementine, which should be called IBM SPSS Modeler for precision after IBM’s acquisition of SPSS.

Recently I read a paper on the comparisons of SAS EM, SPSS Clementine and IBM Intelligent Miner on their decision tree and cluster technology:

Decision Tree Induction & Clustering Techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A Comparative Analysis by Abdullah M. Al Ghoson, Virginia Commonwealth University

The output is not that surprising. SAS EM plays better in performance, functionality and auxiliary task support but worse in usability.

SAS_VS_SPSS

Here are few comments on decision trees implementations in SAS EM and SPSS Clementine based on my own experiences. Some advises for beginners are also supplied.

There are four nodes in SPSS Clementine to supports four trees algorithms respectively: C5.0, Classification And Regression Trees (CART),  Quick, Unbiased, Efficient Statistical Tree(QUEST) and Chi-squared Automatic Interaction Detector(CHAID),  which are most famous and popular in decision trees family.

SPSS_4_trees Note that CART(R) is a registered trademark of California Statistical Software, Inc., and is licensed exclusively to Salford Systems, San Diego, California. So SPSS Clementine uses C&R Tree as name.

In SAS EM, there is only one decision tree node:

SAS_tree The algorithms behind this node is called SAS tree algorithms, which incorporate and extend the four mentioned before. Just change the settings in decision tree node, you can get the trees you want.

Obviously, SAS tree algorithms is superior than the separated ones in SPSS Clementine for expansibility and flexibility. But at the other hand, the complexities increase. For a newbie user of SAS EM, he/she may wonder which trees he/she is training. A SPSS Clementine users just picks up a node and says: OK, I am now training a CART or CHAID.—he/she would communicate with others more smoothly.

Regardless of the industry application, I think this is the educational benefit of SPSS Clementine. Since almost every data mining book introduces decision trees by separated algorithms(such as ID3/C4.5/C5.0, CART, QUEST, CHAID, . . .), the beginners using SPSS Clementine as instructional tool may get the clear ideas about the algorithms one by one. Once he/she get the full understanding of the differences among tree algorithms, he/she would train trees in SAS EM more comfortable.

What’s more, SPSS Clementine supplies rich supporting documentations for beginners and self learners , such as Tutorial, User Guide, Algorithms Guide, Node Reference. The official documentations of SAS EM 5.x and 6.x are relatively poor. Yes there is a good SAS Help and Documentation for SAS EM 4.3 including Getting Started with Enterprise Miner. EM4.3 is a traditional AF application but EM5.x and above are Java client incorporated in SAS analysis platform(they are totally different!). For EM5.x and above, only installation guides and a plain reference are available.

SAS Institute may have its own marketing strategies. No rich references available, the Institute DOES offer rich training programs in data mining and Enterprise Miner application. Wooo, the big-budget purchasers of SAS EM can also afford the trainings.

Run data mining codes following William Potts

FYI: SAS Enterprise Miner and SAS Text Miner Procedures: Reference for SAS 9.1.3, see:
 
 
This entry DOES exist in the SAS Support website, but it can’t be found by any search engine or documentation tree view. You’re recommended to download these files immediately due to SAS’s easy-dead hyperlinks.^-^
 
ps.SAS Institute provides no support for the use of Enterprise Miner and Text Miner Procedures when they are invoked directly, outside of the Enterprise Miner graphical user interface.

Free Machine Learning Courses (Stanford) in YouTube

FYI:
 

SAS User Books and Data Mining Software Comparision: Quick Links

  1. SAS Books Catalog(Jan, 2009)
  2. Data Mining Software 2009: Succesul Analyses at Affordable Prices(Nov. 2008, by mayato)

Industry Review: SAS and Teradata Partnership

SAS and Teradata Partnership: Press

  1. Leading Companies See Value in SAS and Teradata Partnership
  2. SAS and Teradata Unveil Advantage Program to Bring Powerful In-Database Solutions and Services to Customers
  3. SAS and Teradata Enter into Strategic Partnership


In BI industry, the pure players such as SAS, Teradata and Microstrategy, need to demonstrate their indispensable values against the megavendors, IBM (acquired Cognos), SAP (acquired Business Object), Oracle (acquired Hyperion) and Microsoft. Teradata is solely focused on enterprise data warehouse. SAS, dominating in business analytics (e.g. advanced statistics and data mining), will check and balance the BI industry due to the private-hold structure. SAS and Teradata Advantage Program partnership, includes wide business lines, such as Analytics, AML (Anti-Money Laundering), Credit Risk, Enterprise Intelligence and Optimization Services. I think It’s a effective way to learn from each other in mutual emulation and counterbalance the concentration market.

Data Mining in Stock Market

Data Mining in Stock Market? Is it crazy? or is it just a hopeless try? Every mentor in mathematics and finance educates us that the stock market is too chaotic and sentimental to use mathematical models. Most of all gift rock scientists are concentrated in the study of interest of rates and fixed income securities. It sounds profitable to use mathematical and statistical models to predict the price of stock, but there are little successfull stories.

I know I might hold some academic doctrines, so I have interest to monitor any effort to try to forecast stock prices using data mining techniques. Some links from a popular data mining blog , Data Mining Research, are listed as follows: