Category Archive ‘Industry Review‘

 
 

An Analytical Valley: Big Data and Data Scientists (and SAS Programmers)

hadoop

Tom Davenport reported an observation that Silicon Valley is becoming more analytical since companies in the Valley such as Google, Facebook, eBay, LinkedLn all have strong presences in analytics. Besides such predominant companies, I’d also like to add Yahoo to the list although Yahoo is no longer in its peak. Yahoo is the largest sponsor and contributor of Hadoop, an open source framework for distributed processing of so called “big data”. When taking a look at the outstanding Facebook data team or LinkedIn data team, we can see that Hadoop is also one of the most overwhelmingly successful technical factors. Such Valley companies themselves are the huge consumers of big data and have strong incentives to develop analytical solutions beyond their high technology product pipelines.

Analytical staffs in LinkedLn also helps a lot to promote the widely usage of the term “data scientist”. They identify themselves as data scientists and that’s really cool. Now more and more statisticians are also very glad to accept this brand new title. According to a survey in JSM (2011, Miami), more than 85% (164) statisticians there considered themselves “data scientists”.

McKinsey also released a report this May on big data and the huge gap of qualified analytical talents. You know when a management consulting firm begins to talk something technical, it is no longer a fashion to follow the discussion of the concept. To embrace the challenge of big data, one or the team needs multidiscipline background—basically speaking, computer science and statistics (and data mining or machine learning is just an interdisciplinary subject of them). Here is an ambitious list on “How do I become a data scientist”:

http://www.quora.com/Educational-Resources/How-do-I-become-a-data-scientist

For these learning plans, just feel the meaning and don’t take it too seriously. Check yourself and set up your own priority.

Notes for SAS Programmers

For SAS programmers, I read an exciting post besides High Performance Computing that SAS will also play with Hadoop by introducing some functionality in SAS/Access and SAS Data Integration Studio.

For SAS programmers with no IT background, it is not a good idea to jump into algorithms and data structures and other hard core computer courses immediately. Instead I recommend to take the full advantages of SAS language and system itself to dive into computer world gradually:

1. Learn and practice and practice SAS Proc SQL which is compliant with the SQL-92 standard. SQL is the common language in database world and SAS Proc SQL can help you switch smoothly to Oracle SQL, Teradata SQL, MySql SQL and other SQL implementations although there are some non-critical differences in details.

2. Dig into the operating system specific documentation of SAS, for example in SAS 9.3,  SAS 9.3 Companion for Windows or SAS 9.3 Companion for UNIX Environments or others depending the OS you are working on. They are the critical important documentations but unfortunately often missed in SAS programmers’  reading list.

Such docs will help SAS programmers to deal with the machines and expose to the wide computer world in a way that a SAS programmer can understand. You can’t expect to be an expert on computer via such docs, but at least you can communicate fluently with internal IT staff.

3. Then you get all the confidences to play with computer and can switch to any other topics interested in the list above!

Decision Trees in SAS Enterprise Miner and SPSS Clementine

Decision trees are included in SAS Enterprise Miner(EM). The counterpart is SPSS Clementine, which should be called IBM SPSS Modeler for precision after IBM’s acquisition of SPSS.

Recently I read a paper on the comparisons of SAS EM, SPSS Clementine and IBM Intelligent Miner on their decision tree and cluster technology:

Decision Tree Induction & Clustering Techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A Comparative Analysis by Abdullah M. Al Ghoson, Virginia Commonwealth University

The output is not that surprising. SAS EM plays better in performance, functionality and auxiliary task support but worse in usability.

SAS_VS_SPSS

Here are few comments on decision trees implementations in SAS EM and SPSS Clementine based on my own experiences. Some advises for beginners are also supplied.

There are four nodes in SPSS Clementine to supports four trees algorithms respectively: C5.0, Classification And Regression Trees (CART),  Quick, Unbiased, Efficient Statistical Tree(QUEST) and Chi-squared Automatic Interaction Detector(CHAID),  which are most famous and popular in decision trees family.

SPSS_4_trees Note that CART(R) is a registered trademark of California Statistical Software, Inc., and is licensed exclusively to Salford Systems, San Diego, California. So SPSS Clementine uses C&R Tree as name.

In SAS EM, there is only one decision tree node:

SAS_tree The algorithms behind this node is called SAS tree algorithms, which incorporate and extend the four mentioned before. Just change the settings in decision tree node, you can get the trees you want.

Obviously, SAS tree algorithms is superior than the separated ones in SPSS Clementine for expansibility and flexibility. But at the other hand, the complexities increase. For a newbie user of SAS EM, he/she may wonder which trees he/she is training. A SPSS Clementine users just picks up a node and says: OK, I am now training a CART or CHAID.—he/she would communicate with others more smoothly.

Regardless of the industry application, I think this is the educational benefit of SPSS Clementine. Since almost every data mining book introduces decision trees by separated algorithms(such as ID3/C4.5/C5.0, CART, QUEST, CHAID, . . .), the beginners using SPSS Clementine as instructional tool may get the clear ideas about the algorithms one by one. Once he/she get the full understanding of the differences among tree algorithms, he/she would train trees in SAS EM more comfortable.

What’s more, SPSS Clementine supplies rich supporting documentations for beginners and self learners , such as Tutorial, User Guide, Algorithms Guide, Node Reference. The official documentations of SAS EM 5.x and 6.x are relatively poor. Yes there is a good SAS Help and Documentation for SAS EM 4.3 including Getting Started with Enterprise Miner. EM4.3 is a traditional AF application but EM5.x and above are Java client incorporated in SAS analysis platform(they are totally different!). For EM5.x and above, only installation guides and a plain reference are available.

SAS Institute may have its own marketing strategies. No rich references available, the Institute DOES offer rich training programs in data mining and Enterprise Miner application. Wooo, the big-budget purchasers of SAS EM can also afford the trainings.

R or SAS: Quick Links to the Recent Debates


Original post, 7 Jan, 2009

Key Point
The popularity of R at universities could threaten SAS Institute.

A Controversial Review by Anee Milley from SAS
We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.

7 Jan, 2009,
SAS-L
Discussion in SAS-L, the most popular SAS mailing list. Most voices call for the incorporate both R and SAS.

7 Jan, 2009,
R-help
Cheer for the victory of R.

8 Jan, 2009,
Ashlee Vance‘s blog
R You Ready for R, with lots of comments

9 Jan, 2009,
SAS Consulting

9 Jan, 2009, Anee Milley
This Post Is Rated R, stating the viewpoints from SAS about open source software: support and participant.
For more, see Google Blog Search.

Links: Risk Intelligence Vendors Review: 2008

You can get the big picture viewing different sources(REMEMBER: A vendor’s research methodology is as important as its rating):

Chartis RiskTech 100 (October 2008)

FinTech100(2008)
FinTech100(2008): Top 25 Enterprise Companies
FinTech100(2008): Banking Top 10
FinTech100(2008): Capital Market Top 10
FinTech100(2008): Insurance Top 10
Celnet Credit Risk/Basel II Vendors(2008):

Links–BI Industry 2008: Review and Prospect

/*Thanks the hints supplied by:

A look back at 2008 and some crystal ball predictions…, byTammi Kay Geroge, from SASBlog*/

Major Data Warehousing Events of 2008 (and Predictions for 2009), by Michael Schiff, from TDWI,

Major Data Warehousing Events of 2008:

  • Everyone had an appliance story
  • Industry consolidations continued
  • The recessionary environment encourage further BI developments
  • Open source grew

Predictions for 2009:

  • Further industry consolidation(Informatica by HP, SPSS by SAP)
  • Cloud computing will come down to earth
  • Open source growth will accelerate
  • The IT world will become greener
  • Major emphasis on solutions rather than tools and technology

BusinessIntelligence Tools: Year in Review,by Cindi Howson, from BeyeNetwork

Top Virtualization Trends for 2009, by John Suit, from ZDNet

Surround the Warehouse: Prediction for 2009 , by Neil Raden, from IntelligentEnterprise

Industry Review: SAS and Teradata Partnership

SAS and Teradata Partnership: Press

  1. Leading Companies See Value in SAS and Teradata Partnership
  2. SAS and Teradata Unveil Advantage Program to Bring Powerful In-Database Solutions and Services to Customers
  3. SAS and Teradata Enter into Strategic Partnership


In BI industry, the pure players such as SAS, Teradata and Microstrategy, need to demonstrate their indispensable values against the megavendors, IBM (acquired Cognos), SAP (acquired Business Object), Oracle (acquired Hyperion) and Microsoft. Teradata is solely focused on enterprise data warehouse. SAS, dominating in business analytics (e.g. advanced statistics and data mining), will check and balance the BI industry due to the private-hold structure. SAS and Teradata Advantage Program partnership, includes wide business lines, such as Analytics, AML (Anti-Money Laundering), Credit Risk, Enterprise Intelligence and Optimization Services. I think It’s a effective way to learn from each other in mutual emulation and counterbalance the concentration market.

The Making of an Analyst: A Supplement to What Makes a Good Business Analyst

I once commented the entry, What Makes a Good Business Analyst by Rajan Chandras, with an easy tone, If You Can Make it Here, You Can Make it Anywhere. The standards of a good analyst conclude by Rajan, in my opinion, are somewhat of very high bars.

In the recent SASCOM Magazine, Ted Cuzzillo published a relatively moderate enty, say, The Making of An Analyst. This paper is considered the fresh graduates to be an analysts in their first job hunting. Yes, there two posts are more compatible than oppositive. Rajan’s targets are those veteran analysts with years of experience.

If You Can Make it Here, You Can Make it Anywhere: On What Makes a Good Business Analyst by Rajan Chandras

In the latest post, What Makes a Good Business Analyst?, Rajan Chandras cites some soft items from Forrester’s Business Analyst Assessment Workbook:

  • Ability to think abstractly, identify patterns, and generate ideas and solutions
  • Understanding of when and how to escalate issues or needs
  • Understanding of and ability to delivery the appropriate level of detail needed for each task
  • Interest in exploring and understanding new concepts and topic areas
  • Emotionally invested in the work
  • Ability to learn by shadowing stakeholders
  • Ability to clearly articulate technology in terms stakeholders can understand
  • Understanding of the organizational culture and its impact on processes and projects (this one seems obvious, but the latter phrase is more profound than might seem at first glance)
  • Ability to drive a decision analysis and selection process
  • Ability to recognize patterns in requirements and categorize them appropriately

What’s more, there are some suggestions by Rajan Chandras himself:

  • Know the organization’s external environment: its competitive position, current state of the industry, geographical & social factors, etc.
  • Know the organization’s internal environment: its financial position, organization culture, IT maturity, etc.
  • Adapt to the needs (your language, dress etc.), but be yourself. Imperfect, yet genuine, is fine; falsity comes through easily, and will destroy your credibility in no time.

No doubt, no boss can reject such a perfect analyst. But I’m afraid these standards are suitable for every professionals. That is to say, they create a model to explain everything. It is too universal to be served as a good filter to select the most proper analysts. She or he may more marketable in any other business line.