Monatsarchiv für September 2011

 
 

An Analytical Valley: Big Data and Data Scientists (and SAS Programmers)

hadoop

Tom Davenport reported an observation that Silicon Valley is becoming more analytical since companies in the Valley such as Google, Facebook, eBay, LinkedLn all have strong presences in analytics. Besides such predominant companies, I’d also like to add Yahoo to the list although Yahoo is no longer in its peak. Yahoo is the largest sponsor and contributor of Hadoop, an open source framework for distributed processing of so called “big data”. When taking a look at the outstanding Facebook data team or LinkedIn data team, we can see that Hadoop is also one of the most overwhelmingly successful technical factors. Such Valley companies themselves are the huge consumers of big data and have strong incentives to develop analytical solutions beyond their high technology product pipelines.

Analytical staffs in LinkedLn also helps a lot to promote the widely usage of the term “data scientist”. They identify themselves as data scientists and that’s really cool. Now more and more statisticians are also very glad to accept this brand new title. According to a survey in JSM (2011, Miami), more than 85% (164) statisticians there considered themselves “data scientists”.

McKinsey also released a report this May on big data and the huge gap of qualified analytical talents. You know when a management consulting firm begins to talk something technical, it is no longer a fashion to follow the discussion of the concept. To embrace the challenge of big data, one or the team needs multidiscipline background—basically speaking, computer science and statistics (and data mining or machine learning is just an interdisciplinary subject of them). Here is an ambitious list on “How do I become a data scientist”:

http://www.quora.com/Educational-Resources/How-do-I-become-a-data-scientist

For these learning plans, just feel the meaning and don’t take it too seriously. Check yourself and set up your own priority.

Notes for SAS Programmers

For SAS programmers, I read an exciting post besides High Performance Computing that SAS will also play with Hadoop by introducing some functionality in SAS/Access and SAS Data Integration Studio.

For SAS programmers with no IT background, it is not a good idea to jump into algorithms and data structures and other hard core computer courses immediately. Instead I recommend to take the full advantages of SAS language and system itself to dive into computer world gradually:

1. Learn and practice and practice SAS Proc SQL which is compliant with the SQL-92 standard. SQL is the common language in database world and SAS Proc SQL can help you switch smoothly to Oracle SQL, Teradata SQL, MySql SQL and other SQL implementations although there are some non-critical differences in details.

2. Dig into the operating system specific documentation of SAS, for example in SAS 9.3,  SAS 9.3 Companion for Windows or SAS 9.3 Companion for UNIX Environments or others depending the OS you are working on. They are the critical important documentations but unfortunately often missed in SAS programmers’  reading list.

Such docs will help SAS programmers to deal with the machines and expose to the wide computer world in a way that a SAS programmer can understand. You can’t expect to be an expert on computer via such docs, but at least you can communicate fluently with internal IT staff.

3. Then you get all the confidences to play with computer and can switch to any other topics interested in the list above!

Fours Errors in SAS 9.2 Fisher’s Iris Data in SASHELP Library

iris

In the previous post, I just mentioned that Fisher’s Iris Data is embedded officially in SASHELP library in SAS 9.2. Note that even in SAS 9.1.3, you can also find this data with several instances from some demos in user guide (just search "Iris" in "SAS Help and Documentation" accompany with you SAS product), for example, in SAS 9.1.3 IML.

Iris dataset is so important and popular that researchers round the world use it as benchmark to test and compare their algorithms and also as pedagogical purpose. It is also the overwhelming No. 1 dataset considering popularity in UCI Machine Learning Repository. Here 4 errors in SASHELP.iris listed for your consideration if interested and if you find some slightly differences in outputs following some demos out of SAS using this data:

Error 1: Line 35, the PetalWidth of Setosa should be 2 mm, not 1 mm;

Error 2: Line 38, the SepalWidth of Setosa should be 36 mm, not 31 mm;

Error 3: Line 38, the PetalLength of Setosa should be 14 mm, not 15 mm;

Error 4: Line 119, the PetalLength of Virginica should be 69 mm, not 70 mm.

For errors 1-3, there is also an interesting story in statistical literature. In 1936, Fisher the Great published his famous paper, The use of multiple measurements in taxonomic problems and the Iris data also attached (called Fisher Version in this post). In the following years (until today), people cited this paper and the Iris data Fisher Version is also replicated and distributed worldwide and then a version with above errors 1-3 might gain a very dominant popularity (I don’t know the source of there errors). In UCI Machine Learning Repository, the dataset iris.data is the one with such 3 errors (called UCI Version as well).

We could see that the duplicated UCI Version is even more popular in some extension than its original Fisher Version (SASHELP.iris also seems to be copied from UCI Version). Story goes on. In 1998, James Bezdek and other scholars just found the three discrepancies between Iris Fisher Version and UCI Version (and in some published papers using the same version of data). You can read it in Will the Real Iris Data Please Stand Up?

Bezdek then proposed to use the original Fisher Version of Iris, and UCI Machine Learning Repository also documented these three errors and added new dataset called bezdekIris.data (Bezdek Version) which is exactly Fisher Version (iris.data kept and I think it is because now the so called error version is also valuable).

Return to error 4 and I can’t figure out why and I might as well call it Iris SAS Version. Note that the unit in SAS Version is millimeter (mm), while others version all use centimeter (cm).

The interesting part is that I also check the Iris data in SAS 9.1.3 IML mentioned before and not surprising, it is exactly the Fisher Version (you can also find a right one in a demo from SAS 9.2 IML Studio 3.2).

The following codes generate several Iris versions:

iris_uci: UCI Version with both CM and MM as unit

bezdekiris_uci: Bezdek Version or Fisher Version with both CM and MM as unit

iris_mm: UCI Version with MM as unit and attributes alike SASHELP.iris, SAS Version

bezdekiris_mm: Bezdek Version or Fisher Version with MM as unit and attributes alike SASHELP.iris, SAS Version


Click to read more…