Happy New Year (Yes Again)

Then I feel great to reset my year of 2012 as brand new one.

Today, Jan 23, is the first day of Chinese New Year, and it is Monday, the first day of work weekit is always joyful to have such coincidenceJ. HAPPY NEW YEAR!

I had a big move this year. Actually I passed through multiply new years of 2012: I took my flight from Beijing in Jan 1 (happy new year!) and landed in Raleigh, North Carolina also in the same day due to the time difference among China and US where I take a new job as a Life Sciences Consultant at d-Wise Technologies. I’m not supposed to be a pure SAS programmer any more, but I don’t want to change the theme of my blog (“Hello World by A SAS Programmer“) since my most handy language is still SAS while my role will expose me more SAS products and solutions.

Chinese New Year is also called “Spring Festival” and “spring” is always the key word for the holiday:

(Chinese Characters for “the Beauty of Spring”. I wrote these years before in Beijing)

And I also want to spread my spring greetings with few sentences from Walt Whitman, These I Singing in Spring:

THESE, I, singing in spring, collect for lovers,

. . .

Collecting, I traverse the garden, the world—but soon

I pass the gates,

Now along the pond-side—now wading in a little, fearing not the wet,

Now by the post-and-rail fences, where the old stones

thrown there, pick’d from the fields, have accumulated,

(Wild-flowers and vines and weeds come up through

the stones, and partly cover them—Beyond these I pass,)

Far, far in the forest, before I think where I go,

Solitary, smelling the earthy smell, stopping now and then in the silence,

Alone I had thought—yet soon a troop gathers around me,

Some walk by my side, and some behind, and some embrace my arms or neck,

They, the spirits of dear friends, dead or alive—thicker

they come, a great crowd, and I in the middle,

Collecting, dispensing, singing in spring, there I wander with them,

Plucking something for tokens—tossing toward whoever is near me;

. . .

Subscribe to from a logical point of view by Email

Vim as A SAS IDE

Few configurations (just copy this sas.vim file to C:\Program Files\vim\vim73\syntax if you also use gVIM 7.3 at Windows) to make Vim as a simple SAS IDE where

F3: run SAS codes (in batch mode)
F4: close other two windows (the current active window is Log window after F3 running; F4 jump to SAS file with full window)

F5: jump to SAS file
F6: jump to Log file
F7: jump to lst file (list output)
F8: keep only the current window (full window)

****************

Details and Credits

1. The first post on Vim and SAS I read is by Xiaowei Wang in Chinese.

The original SAS syntax file took from Zhenhuan Hu.

Kent Nassen also maintains some Vim functions to run SAS codes and check log.

2. To run SAS codes using F3:

map <F3> :w<CR>:!SAS % -CONFIG “C:\Program Files\SAS\SASFoundation\9.2\nls\en\SASV9.CFG“<CR>:sp  %<.lst<CR>:sp  %<.log<CR>

3. Close other windows using F4:

map <F4> :close<CR>:close<CR>

4. Keep only current window using F8:

map <F8> : only<CR>

5. Jump to SAS file using F5:

map <F5> :e %<.sas<CR>

6. Jump to Log file using F6:

map <F6> :e %<.log<CR>

7. Jump to Lst file using F7:

map <F7> :e %<.lst<CR>

Subscribe to from a logical point of view by Email

My Collection of SAS Macro Repositories

Then I just find that the most effective and safest way to synchronize bookmarks across machines is making them Google searchable, i.e, putting them online.

Followings are my personal collections of SAS macro Repositories (I will keep it update according to new sites reached and your inputs). Most of them are rich, well documented and friendly for navigation and review:

/***General***/

1. SAS Macros by Richard DeVenezia

http://www.devenezia.com/downloads/sas/macros/index.php

Richard is a very active contributor in SAS-L. He also plays with Java, Perl, PHP and JavaScript and you can find all these codes in his homepage. Besides the well organized macros, there are some interesting utilities:

http://www.devenezia.com/downloads/sas/samples/

2. Roland‘s SAS Macros

http://www.datasavantconsulting.com/roland/Spectre/maclist2.html

Roland, a proficient SAS programmer from Europe, also supply two SAS applications:

Spectre – a Practical and Educational Clinical Trials Reporting Engine
http://www.datasavantconsulting.com/roland/Spectre/index.html

RGPP -Graphical Patient Profiler
http://www.datasavantconsulting.com/roland/RGPP/rgpp.html

And some tips:
http://www.datasavantconsulting.com/roland/sastips.html

3. Chris‘s SAS Macros

http://sas.cswenson.com/downloads/macros

I just found Chris’ site weeks before and Chris is a pretty cool programmer: in one of his fun macros, an error message will be such kind of form:

ERROR: MWA HA HA! You fool! You are cursed with leprosy!

4. sconsig SAS Coding Tips and Techniques 

http://www.sconsig.com/sastip.htm

rich while badly for navigating and review

5. Arnold Schick’s macros

http://schick.tripod.com/macros.html

also some macros collected from SAS-L :

http://schick.tripod.com/p-index.html

6. Rodney A. Sparapani ‘s Macro

http://www.mcw.edu/PCOR/Education/SASMacros.htm

Rodney is best known for his contribution to SAS support in ESS (Emacs Speaks Statistics) as a cool programmer and WinBugs and Bayesian then he works as a statistician.

Rodney’s site also contains lots of statistical stuff.

/***Statistics***/

1. Mayo Clinic Locally Written SAS Macros
http://cancercenter.mayo.edu/mayo/research/biostat/sasmacros.cfm

or in

http://mayoresearch.mayo.edu/biostat/sasmacros.cfm

2. Paul D. Allison
http://www.ssc.upenn.edu/~allison/#Macros
new site: http://www.pauldallison.com/Download3.html

Paul is a prolific writer with books on SAS and statistics.

3. MCHP SAS Macros

http://mchp-appserv.cpe.umanitoba.ca/viewConcept.php?conceptID=1048

4. Ralph O’Brien ,UnifyPow: A SAS Module for Sample-Size Analysis
http://www.bio.ri.ccf.org/Power/

5. Usual Dietary Intakes: SAS Macros for the NCI Method
http://riskfactor.cancer.gov/diet/usualintakes/macros.html

6. Clinician’s corner: SAS macros
http://www.medicine.mcgill.ca/epidemiology/Joseph/PBelisle/sas-macros.html

/***Graph***/

1. Robert Allison’s SAS/Graph Examples!
http://robslink.com/SAS/Home.htm

My first stop for SAS graphics.

2. SAS Graphic Programs and Macros by Michael Friendly
http://www.datavis.ca/sasmac/

more popular in academia.

Subscribe to from a logical point of view by Email

Get Start with WPS, and Call for an Elegant SAS IDE!

I got a trial version of WPS (the latest version 2.5.2.0 at Windows), which engine can interpret “some of the language of SAS”. I took piece of codes for testing and some passed while some popped up with errors (so currently it is only a limited version of SAS). I don’t drive into the deep part yet of what WPS can do and can’t do, but I do love the WPS way to organize projects, folders and files (including souse files):

WPS_SAS

WPS uses a lite version of Eclipse as GUI(WPS Workbench; the “lite” means WPS Workbench can’t be extensible as the original Eclipse but really with shorter response time). Besides its Project Explore for folders and files management(left panel), I also love its Outline in right panel to show the SAS programming elements and errors information in log window:

WPS_LOG

Then I’d like to switch to SAS itself. Frankly speaking, at least in IDE part, WPS looks pretty better than the current corresponding SAS:

SAS_GUI

Of course I hold the principle of “substance over form”, but if available, the form itself also make people comfortable and enjoyable (for example, the Apple products…). As far as I know, the new version of SAS DI Studio and Enterprise Miner both have pretty much improvement in GUI from ergonomic point of view. Even for code editor, Enterprise Guide Editor  is now more superior than so called SAS Enhanced Editor. But as a SAS programmer (not only the SAS user), I may spend all my day in the Base SAS window!

I also spent some time to configure VIM as a relative simple SAS IDE as a temporally replacement (F3 to run, F5 to jump to program window, F6 to log window, F7 to output window just as the same as in SAS IDE):

VIM_SAS It’s simple but always can do the job as SAS itself while looks really cool to comfort myself as a programmer. So, what’s going on in the next release? Still wait.

Subscribe to from a logical point of view by Email

Hello Python

Inspired by Jian’s polyglot programming practice, I also begin to brush up Python and C++ which I learned during graduate school. Following is a Python response to one of Jian Dai’s former programming challenges for lines count of source codes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import os

#count number of lines of
#single file
def lineCount(fileName):
    countSingle=0
    for line in open(fileName):
        countSingle += 1
    return countSingle

#count number of lines of
#directory and subdirectories
def dirCount(dir,extension):
    countTotal=0
    for r,d,f in os.walk(dir):
        for files in f:
            if files.endswith(extension):
                fileName=os.path.join(r,files)
                countSingle=lineCount(fileName)
                countTotal += countSingle
    return countTotal

a=dirCount("C:/Program Files/CDISC Express/",".sas")

print  a

I use python-2.7.2, the final Python 2.x release most because of the various modules support for learning purpose. The book helps me to get the quick review of Python is Think Python: How to Think Like a Computer Scientist by Allen Downey.

Also, I begin to use CodeColorer for this blog to insert codes.

Subscribe to from a logical point of view by Email

An Online Latin to English Translator via SAS

Last month I submitted piece of SAS codes for a monthly programming challenge hosted by Jian Dai to translate the Latin motto of Hogwarts School in Harry Potter into English:

draco dormiens nunquam titillandus

You can get the meaning using Google search of course—but not in Google Translator (Google Translator can’t recognize all of such Latin words!). Jian posted a concise Perl way to parse webs which contain this Latin phrase and key words “mean”,  “means” and such and you can always find page like

"draco dormiens nunquam titillandus," which means "never tickle a sleeping dragon."

My SAS approach can’t return a human readable sentence like this one but a 100% word to word machine translation and you can use it to translate any Latin sentence which happens not appear in any singe web page. The usage is also very simple:

filename L2E url "http://jiangtanghu.com/docs/en/Latin2Eng.sas";
%include L2E;

%Latin2Eng(draco dormiens nunquam titillandus)

and you get:

Obs   draco         dormiens                               nunquam                         titillandus

1    dragon    sleep, rest                 at no time, never            tickle, titillate, provoke
2     snake     be/fall asleep          not in any circumstances     stimulate sensually
3                   behave as if asleep
4                   be idle, do nothing

and also (2*4*2*2=) 32 Cartesian combinations to feel the meaning if needed.

Then you can also test the words by Julius Caesar:

%Latin2Eng(Veni Vidi Vici)

and get:

Obs   Veni       Vidi                                                                     Vici

1     come    see, look at                                                     conquer, defeat, excel
2                 consider                                                           outlast
3                (PASS) seem, seem good, appear, be seen      succeed

This SAS translator is based on WORDS (version 1.97FC) by William Whitaker and the codes still needs some modifications when any unexpected special symbols popped up in the translating page.

Subscribe to from a logical point of view by Email

Map and Reduce in MapReduce: a SAS Illustration

In last post, I mentioned Hadoop, the open source implementation of Google’s MapReduce for parallelized processing of big data. In this long National Holiday, I read the original Google paper, MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat and got that the terminologies of “map” and “reduce” were basically borrowed from Lisp, an old functional language that I even didn’t play “hello world” with. For Python users, the idea of Map and Reduce is also very straightforward because the workhorse data structure in Python is just the list, a sequence of values that you can just imagine that they are the nodes(clusters, chunk servers, …) in a distributed system.

MapReduce is a programming framework and really language independent, so SAS users can also get the basic idea from their daily programming practices and here is just a simple illustration using data step array (not array in Proc FCMP or matrix in IML). Data step array in SAS is fundamentally not a data structure but a convenient way of processing group of variables, but it can also be used to play some list operations like in Python and other rich data structure supporting languages(an editable version can be founded in here):

MapReduce

Follow code above, the programming task is to capitalize a string “Hadoop” (Line 2) and the “master” method is just to capitalize the string in buddle(Line 8): just use a master machine to processing the data.

Then we introduce the idea of “big data” that the string is too huge to one master machine, so “master method” failed. Now we distribute the task to thousands of low cost machines (workers, slaves, chunk servers,. . . in this case, the one dimensional array with size of 6, see Line 11), each machine produces parts of the job (each array element only capitalizes a single letter in sequence, see Line 12-14). Such distributing operation is called “map”. In a MapReduce system, a master machine is also needed to assign the maps and reduce.

How about “reduce”?  A “reduce” operation is also called “fold”—for example, in Line 17, the operation to combine all the separately values into a single value: combine results from multiple worker machines.

Subscribe to from a logical point of view by Email

An Analytical Valley: Big Data and Data Scientists (and SAS Programmers)

hadoop

Tom Davenport reported an observation that Silicon Valley is becoming more analytical since companies in the Valley such as Google, Facebook, eBay, LinkedLn all have strong presences in analytics. Besides such predominant companies, I’d also like to add Yahoo to the list although Yahoo is no longer in its peak. Yahoo is the largest sponsor and contributor of Hadoop, an open source framework for distributed processing of so called “big data”. When taking a look at the outstanding Facebook data team or LinkedIn data team, we can see that Hadoop is also one of the most overwhelmingly successful technical factors. Such Valley companies themselves are the huge consumers of big data and have strong incentives to develop analytical solutions beyond their high technology product pipelines.

Analytical staffs in LinkedLn also helps a lot to promote the widely usage of the term “data scientist”. They identify themselves as data scientists and that’s really cool. Now more and more statisticians are also very glad to accept this brand new title. According to a survey in JSM (2011, Miami), more than 85% (164) statisticians there considered themselves “data scientists”.

McKinsey also released a report this May on big data and the huge gap of qualified analytical talents. You know when a management consulting firm begins to talk something technical, it is no longer a fashion to follow the discussion of the concept. To embrace the challenge of big data, one or the team needs multidiscipline background—basically speaking, computer science and statistics (and data mining or machine learning is just an interdisciplinary subject of them). Here is an ambitious list on “How do I become a data scientist”:

http://www.quora.com/Educational-Resources/How-do-I-become-a-data-scientist

For these learning plans, just feel the meaning and don’t take it too seriously. Check yourself and set up your own priority.

Notes for SAS Programmers

For SAS programmers, I read an exciting post besides High Performance Computing that SAS will also play with Hadoop by introducing some functionality in SAS/Access and SAS Data Integration Studio.

For SAS programmers with no IT background, it is not a good idea to jump into algorithms and data structures and other hard core computer courses immediately. Instead I recommend to take the full advantages of SAS language and system itself to dive into computer world gradually:

1. Learn and practice and practice SAS Proc SQL which is compliant with the SQL-92 standard. SQL is the common language in database world and SAS Proc SQL can help you switch smoothly to Oracle SQL, Teradata SQL, MySql SQL and other SQL implementations although there are some non-critical differences in details.

2. Dig into the operating system specific documentation of SAS, for example in SAS 9.3,  SAS 9.3 Companion for Windows or SAS 9.3 Companion for UNIX Environments or others depending the OS you are working on. They are the critical important documentations but unfortunately often missed in SAS programmers’  reading list.

Such docs will help SAS programmers to deal with the machines and expose to the wide computer world in a way that a SAS programmer can understand. You can’t expect to be an expert on computer via such docs, but at least you can communicate fluently with internal IT staff.

3. Then you get all the confidences to play with computer and can switch to any other topics interested in the list above!

Subscribe to from a logical point of view by Email

Fours Errors in SAS 9.2 Fisher’s Iris Data in SASHELP Library

iris

In the previous post, I just mentioned that Fisher’s Iris Data is embedded officially in SASHELP library in SAS 9.2. Note that even in SAS 9.1.3, you can also find this data with several instances from some demos in user guide (just search "Iris" in "SAS Help and Documentation" accompany with you SAS product), for example, in SAS 9.1.3 IML.

Iris dataset is so important and popular that researchers round the world use it as benchmark to test and compare their algorithms and also as pedagogical purpose. It is also the overwhelming No. 1 dataset considering popularity in UCI Machine Learning Repository. Here 4 errors in SASHELP.iris listed for your consideration if interested and if you find some slightly differences in outputs following some demos out of SAS using this data:

Error 1: Line 35, the PetalWidth of Setosa should be 2 mm, not 1 mm;

Error 2: Line 38, the SepalWidth of Setosa should be 36 mm, not 31 mm;

Error 3: Line 38, the PetalLength of Setosa should be 14 mm, not 15 mm;

Error 4: Line 119, the PetalLength of Virginica should be 69 mm, not 70 mm.

For errors 1-3, there is also an interesting story in statistical literature. In 1936, Fisher the Great published his famous paper, The use of multiple measurements in taxonomic problems and the Iris data also attached (called Fisher Version in this post). In the following years (until today), people cited this paper and the Iris data Fisher Version is also replicated and distributed worldwide and then a version with above errors 1-3 might gain a very dominant popularity (I don’t know the source of there errors). In UCI Machine Learning Repository, the dataset iris.data is the one with such 3 errors (called UCI Version as well).

We could see that the duplicated UCI Version is even more popular in some extension than its original Fisher Version (SASHELP.iris also seems to be copied from UCI Version). Story goes on. In 1998, James Bezdek and other scholars just found the three discrepancies between Iris Fisher Version and UCI Version (and in some published papers using the same version of data). You can read it in Will the Real Iris Data Please Stand Up?

Bezdek then proposed to use the original Fisher Version of Iris, and UCI Machine Learning Repository also documented these three errors and added new dataset called bezdekIris.data (Bezdek Version) which is exactly Fisher Version (iris.data kept and I think it is because now the so called error version is also valuable).

Return to error 4 and I can’t figure out why and I might as well call it Iris SAS Version. Note that the unit in SAS Version is millimeter (mm), while others version all use centimeter (cm).

The interesting part is that I also check the Iris data in SAS 9.1.3 IML mentioned before and not surprising, it is exactly the Fisher Version (you can also find a right one in a demo from SAS 9.2 IML Studio 3.2).

The following codes generate several Iris versions:

iris_uci: UCI Version with both CM and MM as unit

bezdekiris_uci: Bezdek Version or Fisher Version with both CM and MM as unit

iris_mm: UCI Version with MM as unit and attributes alike SASHELP.iris, SAS Version

bezdekiris_mm: Bezdek Version or Fisher Version with MM as unit and attributes alike SASHELP.iris, SAS Version


|Click to read more…

Subscribe to from a logical point of view by Email

Who is Alfred?

Tell me something about Alfred, male or female? age? height and weight?

Oracle database (version 9 and below) had a well known default demo account SCOTT with a password, TIGER (and TIGER was the name of the real person Bruce Scott ’s cat, see) and in this account, there are some tables named DEPT, EMP, BONUS and SALGRADE (you can read their meaning). Almost every Oracle DBA learn SQL using these database and an joke just says that in DBA’s meetings, people just  warm up saying “how about Smith?” And you should know that in the database, Smith is a clerk and his boss is Ford (whose boss is Jones)!

In the beginning I also raise a question for SAS programmers: who is Alfred? Don’t give quick answer such that “Alfred who”. Actually, you should already go through with Alfred very well as a SAS programmer:

proc print data=sashelp.class;
    where name="Alfred";
run;

As a clinical SAS programmer, I play with data, get acquaintance with the data and subjects and then subjects are no longer “subject”. They have identities and  Alfred is a 14 years old boy. I have such habit mostly because in clinical world, data are very expensive (not like the massive transaction data in financial industry) and should be took more care.

I dare say that “class” is the most famous SAS dataset in sashelp library and then in the SAS world. The first dataset used for demo is almost this “class”. I just did a quick Google search, “sas sashelp.class” returns about 44,400 results. Hope you can find any other SAS datasets to beat it.

Alfred in “class” pops into my mind because today, I do find a strong candidate. In SAS 9.2 (and 9.3), the sashelp library has a new member, Iris. YES, it is the “Fisher Iris Flower Data”, which can be safely considered the most famous and most  used dataset in machine learning and data mining papers and statistical applications. Currently it has only 859 hits in Google, I think the number will reach high accompany with the wide use of SAS 9.2 and above, and to enforce my prediction, I will definitely play with the Iris data in the following future!

Subscribe to from a logical point of view by Email