Category Archive ‘Database‘

 
 

Map and Reduce in MapReduce: a SAS Illustration

In last post, I mentioned Hadoop, the open source implementation of Google’s MapReduce for parallelized processing of big data. In this long National Holiday, I read the original Google paper, MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat and got that the terminologies of “map” and “reduce” were basically borrowed from Lisp, an old functional language that I even didn’t play “hello world” with. For Python users, the idea of Map and Reduce is also very straightforward because the workhorse data structure in Python is just the list, a sequence of values that you can just imagine that they are the nodes(clusters, chunk servers, …) in a distributed system.

MapReduce is a programming framework and really language independent, so SAS users can also get the basic idea from their daily programming practices and here is just a simple illustration using data step array (not array in Proc FCMP or matrix in IML). Data step array in SAS is fundamentally not a data structure but a convenient way of processing group of variables, but it can also be used to play some list operations like in Python and other rich data structure supporting languages(an editable version can be founded in here):

MapReduce

Follow code above, the programming task is to capitalize a string “Hadoop” (Line 2) and the “master” method is just to capitalize the string in buddle(Line 8): just use a master machine to processing the data.

Then we introduce the idea of “big data” that the string is too huge to one master machine, so “master method” failed. Now we distribute the task to thousands of low cost machines (workers, slaves, chunk servers,. . . in this case, the one dimensional array with size of 6, see Line 11), each machine produces parts of the job (each array element only capitalizes a single letter in sequence, see Line 12-14). Such distributing operation is called “map”. In a MapReduce system, a master machine is also needed to assign the maps and reduce.

How about “reduce”?  A “reduce” operation is also called “fold”—for example, in Line 17, the operation to combine all the separately values into a single value: combine results from multiple worker machines.

An Analytical Valley: Big Data and Data Scientists (and SAS Programmers)

hadoop

Tom Davenport reported an observation that Silicon Valley is becoming more analytical since companies in the Valley such as Google, Facebook, eBay, LinkedLn all have strong presences in analytics. Besides such predominant companies, I’d also like to add Yahoo to the list although Yahoo is no longer in its peak. Yahoo is the largest sponsor and contributor of Hadoop, an open source framework for distributed processing of so called “big data”. When taking a look at the outstanding Facebook data team or LinkedIn data team, we can see that Hadoop is also one of the most overwhelmingly successful technical factors. Such Valley companies themselves are the huge consumers of big data and have strong incentives to develop analytical solutions beyond their high technology product pipelines.

Analytical staffs in LinkedLn also helps a lot to promote the widely usage of the term “data scientist”. They identify themselves as data scientists and that’s really cool. Now more and more statisticians are also very glad to accept this brand new title. According to a survey in JSM (2011, Miami), more than 85% (164) statisticians there considered themselves “data scientists”.

McKinsey also released a report this May on big data and the huge gap of qualified analytical talents. You know when a management consulting firm begins to talk something technical, it is no longer a fashion to follow the discussion of the concept. To embrace the challenge of big data, one or the team needs multidiscipline background—basically speaking, computer science and statistics (and data mining or machine learning is just an interdisciplinary subject of them). Here is an ambitious list on “How do I become a data scientist”:

http://www.quora.com/Educational-Resources/How-do-I-become-a-data-scientist

For these learning plans, just feel the meaning and don’t take it too seriously. Check yourself and set up your own priority.

Notes for SAS Programmers

For SAS programmers, I read an exciting post besides High Performance Computing that SAS will also play with Hadoop by introducing some functionality in SAS/Access and SAS Data Integration Studio.

For SAS programmers with no IT background, it is not a good idea to jump into algorithms and data structures and other hard core computer courses immediately. Instead I recommend to take the full advantages of SAS language and system itself to dive into computer world gradually:

1. Learn and practice and practice SAS Proc SQL which is compliant with the SQL-92 standard. SQL is the common language in database world and SAS Proc SQL can help you switch smoothly to Oracle SQL, Teradata SQL, MySql SQL and other SQL implementations although there are some non-critical differences in details.

2. Dig into the operating system specific documentation of SAS, for example in SAS 9.3,  SAS 9.3 Companion for Windows or SAS 9.3 Companion for UNIX Environments or others depending the OS you are working on. They are the critical important documentations but unfortunately often missed in SAS programmers’  reading list.

Such docs will help SAS programmers to deal with the machines and expose to the wide computer world in a way that a SAS programmer can understand. You can’t expect to be an expert on computer via such docs, but at least you can communicate fluently with internal IT staff.

3. Then you get all the confidences to play with computer and can switch to any other topics interested in the list above!

Who is Alfred?

Tell me something about Alfred, male or female? age? height and weight?

Oracle database (version 9 and below) had a well known default demo account SCOTT with a password, TIGER (and TIGER was the name of the real person Bruce Scott ’s cat, see) and in this account, there are some tables named DEPT, EMP, BONUS and SALGRADE (you can read their meaning). Almost every Oracle DBA learn SQL using these database and an joke just says that in DBA’s meetings, people just  warm up saying “how about Smith?” And you should know that in the database, Smith is a clerk and his boss is Ford (whose boss is Jones)!

In the beginning I also raise a question for SAS programmers: who is Alfred? Don’t give quick answer such that “Alfred who”. Actually, you should already go through with Alfred very well as a SAS programmer:

proc print data=sashelp.class;
    where name="Alfred";
run;

As a clinical SAS programmer, I play with data, get acquaintance with the data and subjects and then subjects are no longer “subject”. They have identities and  Alfred is a 14 years old boy. I have such habit mostly because in clinical world, data are very expensive (not like the massive transaction data in financial industry) and should be took more care.

I dare say that “class” is the most famous SAS dataset in sashelp library and then in the SAS world. The first dataset used for demo is almost this “class”. I just did a quick Google search, “sas sashelp.class” returns about 44,400 results. Hope you can find any other SAS datasets to beat it.

Alfred in “class” pops into my mind because today, I do find a strong candidate. In SAS 9.2 (and 9.3), the sashelp library has a new member, Iris. YES, it is the “Fisher Iris Flower Data”, which can be safely considered the most famous and most  used dataset in machine learning and data mining papers and statistical applications. Currently it has only 859 hits in Google, I think the number will reach high accompany with the wide use of SAS 9.2 and above, and to enforce my prediction, I will definitely play with the Iris data in the following future!

Work With Oracle: A Quick Sheet for SAS Programmers

(Note: All the followings are tested on Windows XP environment.)

0. Install Oracle Database 10g Express Edition

Fast (and free) to download, easy to deploy and simple to admin–for learning and testing purpose, Oracle Database 10g Express Edition (Oracle Database XE, a mini version of Oracle Database 10g Release 2) are strongly recommended:

0.1 Download it at its homepage(206MB);

0.2 Install it following default settings;

0.3 Unlock accounts for HR and SCOTT. In a windows DOS prompt, using the following scripts:

sqlplus sys/YourSysPassword as sysdba
alter user HR account unlock
alter user HR identified by YourHRPassword
alter user SCOTT account unlock
alter user SCOTT identified by TIGER
Or you can accomplish these tasks within the Oracle web application:
http://localhost:8081/apex

1. Connect Oracle using SAS libname engine

libname SCOTT oracle user=”SCOTT” password=”TIGER” path=’xe’ ;

Note: xe is the default path for Oracle Database XE.

2. Connect Oracle using SQL Procedure Pass-Through Facility

proc sql ;
connect to oracle as orcl
(user=”SCOTT” password=”TIGER” path=’xe’);

select *
from connection to orcl

(

SELECT …

) ;

disconnect from orcl;

quit;

Note: 1) In this approach, Oracle, instead of SAS, processes the SQL statement (i.e., you use the more powerful and flexible Oracle SQL syntax instead of SAS Proc SQL procedure for querying. For more, see SAS Doc).

2) Your Oracle SQL codes should be placed in the RED blocks, and end without a semicolon(;):

select *

from emp

3) Only basic Oracle SQL statements (select, create table, . . .)can pass through this facility.

3. Load SAS datasets to Oracle database

proc copy in=sashelp
out=scott;
select class;
run;

4. Misc: Some differences between Oracle SQL and SAS Proc SQL

4.1 Table aliases

Oracle: from employees a;

SAS: from employees a; or

from employees as a;

4.2 Column aliases

Oracle: don’t use single quotation marks(‘’).

select job_id as job, job_id job, job_id as “job” , job_id as ‘job’

SAS:

select job_id as job, job_id “job” , job_id ‘job’, job_id job

4.3 Literal Character Strings

Oracle: Date and character literal values must be enclosed within single quotation marks(‘’).

SAS: Use both single and double quotation marks.

4.4 Order the null value

Oracle: Null values are displayed last for ascending sequences and first for descending sequences.

SAS: Null values are displayed first for ascending sequences and last for descending sequences.

4.5 Nesting Group Functions

select max(avg(salary))
from employees
group by department_id

SAS log:

ERROR: Summary functions nested in this way are not supported.

Optional approach for SAS:

select max(avg) as max
from(
select avg(salary) as avg
from employees
group by department_id )

4.6 row number(_n_)

Oracle:

select rownum
from employees

SAS:

select monotonic()
from employees

Note: monotonic() is an undocumented function of SAS, seehttp://www2.sas.com/proceedings/sugi29/040-29.pdf