Category Archive ‘misc‘

 
 

Happy New Year (Yes Again)

Then I feel great to reset my year of 2012 as brand new one.

Today, Jan 23, is the first day of Chinese New Year, and it is Monday, the first day of work weekit is always joyful to have such coincidenceJ. HAPPY NEW YEAR!

I had a big move this year. Actually I passed through multiply new years of 2012: I took my flight from Beijing in Jan 1 (happy new year!) and landed in Raleigh, North Carolina also in the same day due to the time difference among China and US where I take a new job as a Life Sciences Consultant at d-Wise Technologies. I’m not supposed to be a pure SAS programmer any more, but I don’t want to change the theme of my blog (“Hello World by A SAS Programmer“) since my most handy language is still SAS while my role will expose me more SAS products and solutions.

Chinese New Year is also called “Spring Festival” and “spring” is always the key word for the holiday:

(Chinese Characters for “the Beauty of Spring”. I wrote these years before in Beijing)

And I also want to spread my spring greetings with few sentences from Walt Whitman, These I Singing in Spring:

THESE, I, singing in spring, collect for lovers,

. . .

Collecting, I traverse the garden, the world—but soon

I pass the gates,

Now along the pond-side—now wading in a little, fearing not the wet,

Now by the post-and-rail fences, where the old stones

thrown there, pick’d from the fields, have accumulated,

(Wild-flowers and vines and weeds come up through

the stones, and partly cover them—Beyond these I pass,)

Far, far in the forest, before I think where I go,

Solitary, smelling the earthy smell, stopping now and then in the silence,

Alone I had thought—yet soon a troop gathers around me,

Some walk by my side, and some behind, and some embrace my arms or neck,

They, the spirits of dear friends, dead or alive—thicker

they come, a great crowd, and I in the middle,

Collecting, dispensing, singing in spring, there I wander with them,

Plucking something for tokens—tossing toward whoever is near me;

. . .

Hello Python

Inspired by Jian’s polyglot programming practice, I also begin to brush up Python and C++ which I learned during graduate school. Following is a Python response to one of Jian Dai’s former programming challenges for lines count of source codes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import os

#count number of lines of
#single file
def lineCount(fileName):
    countSingle=0
    for line in open(fileName):
        countSingle += 1
    return countSingle

#count number of lines of
#directory and subdirectories
def dirCount(dir,extension):
    countTotal=0
    for r,d,f in os.walk(dir):
        for files in f:
            if files.endswith(extension):
                fileName=os.path.join(r,files)
                countSingle=lineCount(fileName)
                countTotal += countSingle
    return countTotal

a=dirCount("C:/Program Files/CDISC Express/",".sas")

print  a

I use python-2.7.2, the final Python 2.x release most because of the various modules support for learning purpose. The book helps me to get the quick review of Python is Think Python: How to Think Like a Computer Scientist by Allen Downey.

Also, I begin to use CodeColorer for this blog to insert codes.

Fours Errors in SAS 9.2 Fisher’s Iris Data in SASHELP Library

iris

In the previous post, I just mentioned that Fisher’s Iris Data is embedded officially in SASHELP library in SAS 9.2. Note that even in SAS 9.1.3, you can also find this data with several instances from some demos in user guide (just search "Iris" in "SAS Help and Documentation" accompany with you SAS product), for example, in SAS 9.1.3 IML.

Iris dataset is so important and popular that researchers round the world use it as benchmark to test and compare their algorithms and also as pedagogical purpose. It is also the overwhelming No. 1 dataset considering popularity in UCI Machine Learning Repository. Here 4 errors in SASHELP.iris listed for your consideration if interested and if you find some slightly differences in outputs following some demos out of SAS using this data:

Error 1: Line 35, the PetalWidth of Setosa should be 2 mm, not 1 mm;

Error 2: Line 38, the SepalWidth of Setosa should be 36 mm, not 31 mm;

Error 3: Line 38, the PetalLength of Setosa should be 14 mm, not 15 mm;

Error 4: Line 119, the PetalLength of Virginica should be 69 mm, not 70 mm.

For errors 1-3, there is also an interesting story in statistical literature. In 1936, Fisher the Great published his famous paper, The use of multiple measurements in taxonomic problems and the Iris data also attached (called Fisher Version in this post). In the following years (until today), people cited this paper and the Iris data Fisher Version is also replicated and distributed worldwide and then a version with above errors 1-3 might gain a very dominant popularity (I don’t know the source of there errors). In UCI Machine Learning Repository, the dataset iris.data is the one with such 3 errors (called UCI Version as well).

We could see that the duplicated UCI Version is even more popular in some extension than its original Fisher Version (SASHELP.iris also seems to be copied from UCI Version). Story goes on. In 1998, James Bezdek and other scholars just found the three discrepancies between Iris Fisher Version and UCI Version (and in some published papers using the same version of data). You can read it in Will the Real Iris Data Please Stand Up?

Bezdek then proposed to use the original Fisher Version of Iris, and UCI Machine Learning Repository also documented these three errors and added new dataset called bezdekIris.data (Bezdek Version) which is exactly Fisher Version (iris.data kept and I think it is because now the so called error version is also valuable).

Return to error 4 and I can’t figure out why and I might as well call it Iris SAS Version. Note that the unit in SAS Version is millimeter (mm), while others version all use centimeter (cm).

The interesting part is that I also check the Iris data in SAS 9.1.3 IML mentioned before and not surprising, it is exactly the Fisher Version (you can also find a right one in a demo from SAS 9.2 IML Studio 3.2).

The following codes generate several Iris versions:

iris_uci: UCI Version with both CM and MM as unit

bezdekiris_uci: Bezdek Version or Fisher Version with both CM and MM as unit

iris_mm: UCI Version with MM as unit and attributes alike SASHELP.iris, SAS Version

bezdekiris_mm: Bezdek Version or Fisher Version with MM as unit and attributes alike SASHELP.iris, SAS Version


Click to read more…

Who is Alfred?

Tell me something about Alfred, male or female? age? height and weight?

Oracle database (version 9 and below) had a well known default demo account SCOTT with a password, TIGER (and TIGER was the name of the real person Bruce Scott ’s cat, see) and in this account, there are some tables named DEPT, EMP, BONUS and SALGRADE (you can read their meaning). Almost every Oracle DBA learn SQL using these database and an joke just says that in DBA’s meetings, people just  warm up saying “how about Smith?” And you should know that in the database, Smith is a clerk and his boss is Ford (whose boss is Jones)!

In the beginning I also raise a question for SAS programmers: who is Alfred? Don’t give quick answer such that “Alfred who”. Actually, you should already go through with Alfred very well as a SAS programmer:

proc print data=sashelp.class;
    where name="Alfred";
run;

As a clinical SAS programmer, I play with data, get acquaintance with the data and subjects and then subjects are no longer “subject”. They have identities and  Alfred is a 14 years old boy. I have such habit mostly because in clinical world, data are very expensive (not like the massive transaction data in financial industry) and should be took more care.

I dare say that “class” is the most famous SAS dataset in sashelp library and then in the SAS world. The first dataset used for demo is almost this “class”. I just did a quick Google search, “sas sashelp.class” returns about 44,400 results. Hope you can find any other SAS datasets to beat it.

Alfred in “class” pops into my mind because today, I do find a strong candidate. In SAS 9.2 (and 9.3), the sashelp library has a new member, Iris. YES, it is the “Fisher Iris Flower Data”, which can be safely considered the most famous and most  used dataset in machine learning and data mining papers and statistical applications. Currently it has only 859 hits in Google, I think the number will reach high accompany with the wide use of SAS 9.2 and above, and to enforce my prediction, I will definitely play with the Iris data in the following future!

I am a 20% SAS Nerd!

Kirk Paul Lafler drafted a checking list for identifying a SAS nerd (or geek, in its positive ways) in one of his intriguing papers:

You Could be a SAS® Nerd If . . .

Here I’m glad to find that I am roughly a 20% SAS nerd (12 matched in all 57 lists):

8. You blog SAS-related comments and technical solutions frequently.

9. You have more than five SAS blogs in your RSS feed.

10. Your home page is support.sas.com, sasCommunity.org, SAS-L, or LexJansen.com.

11. You know more than ten SAS keyboard shortcuts.

12. You get excited when you find a new match-merge technique that performs better than the one you developed the week before.

21. You have more than one version of SAS on your machine or network so you can compare and contrast program, processing and output differences.

28. You spend your Friday evenings and weekends responding to SAS-L posts, entering sasCommunity blog entries, and reading the latest “hot” SAS topic on LexJansen.com.

38. The first thing you read in the morning is the “Tip of the Day”.

45. You subscribe to five or more SAS groups on LinkedIn, sasCommunity, and Facebook and you use a tabbed browser so you can be online with all of them at the same time.

47. You spend your evenings and weekends SAS-L’ing, Googling and Binging looking for elegant SAS technical solutions.

50. You proudly proclaim that you’re a SAS programmer when asked by a fellow passenger, “What do you do for a living?”

51. You’re amazed when your fellow airline passenger replies, “What is a “SAS programmer?”.

I also asked Kirk if he is a 100% SAS nerd and Kirk replied, NO. He said he is a 99% SAS nerd:)

SAS Bloggers In Action(1): Rick Wicklin, SAS/IML and “Color Revolution”

It is well known that the French writer, author of The Three Musketeer, Alexandre Dumas, wrote his master piece of work in different colored papers according to literary genre:

non-fiction on  rose,

fiction on blue,

poetry on yellow

The SAS blog writer, author of Statistical Programming with SAS/IML Software, Rick Wicklin of SAS Institute,  also leads a strong “color revolution” in SAS blog community:

JohariWindowIn an interesting personal statement, Blogging, Programming, and Johari Windows, Rick summarizes his rich and colorful blogging rhythms according to the above Johari window:  

  • Mondays, writes introductory notes (corresponding to the upper right quadrant of Johari window). 
  • Wednesdays, experimental articles on sampling, simulation and other statistical programming topics(lower left quadrant).
  • Fridays, on explorative analysis of data (upper left quadrant).

So what about the lower right quadrant? Rick rediscovers and exposes what he once knew. Just suppose that, Rick picks up some codes he wrote before (ten years ago maybe) with big surprise: oh, who on earth wrote such damned clever beautiful codes? He or she must be in his/her aggressive youth. –then Rick wrote them all in blog.

Here I produced a summary table for Rick’s blogging activities (numbers per month per weekday; before July 16, 2011 Beijing time; next following post would introduce how to use SAS to analyze data from website such as Rick’s blog):

Rick 

Key findings:

  • Rick is really a frequent and productive blogger with averagely 0.5 posts per day!
  • Rick DOES keep his words. Most of the posts are published in Friday, Wednesday and Monday(44, 44, 42 posts respectively).
  • None posted in Saturdays and Sunday.

Rick began his writing since September  3. 2010, Friday.  Up to July 15. 2011, Friday, there are 48 Fridays, 46 Wednesdays and 46 Mondays.  Only 10 colored weekdays (4 Fridays, 2 Wednesdays and 4 Mondays) passed with no posts and most all them are due to national holidays:

06/09/2010 , Monday       : Labor Day
24/11/2010 , Wednesday: round Thanksgiving Day
26/11/2010 , Friday         : round Thanksgiving Day
22/12/2010 , Wednesday
24/12/2010 , Friday         : Christmas Day
27/12/2010 , Monday
31/12/2010 , Friday         : New Year’s Day
30/05/2011 , Monday      : Memorial Day
10/06/2011 , Friday
04/07/2011 , Monday      : Independence Day

At least in 4 holidays (most in Monday), Rick was also active in writing:

11/10/2011, Monday, Columbus Day: How Do You Reshape a Matrix?
11/11/2010, Thursday, Veterans Day: It’s Here!
17/01/2011, Monday, Birthday of Martin Luther King, Jr.: On the Flip Side: Exchanging Rows and Columns
21/02/2011, Monday, Washington’s Birthday: How to Build a Vector from Expressions

Amazing Rick keeps a fixed writing pattern and in next following post, detailed analysis and SAS codes will be presented so you can also keep eyes on the metadata of your favorite bloggers’ writing and may rise a question like:

Hey Rick, what’s up in Jun 10, 2011, Friday?

Tango Haiku

Happy weekend and Haiku again!

My finger hurts,

I can not dance

Tango in Bastille.

So… any ideas?

… …

… … …

OK. I am a terrible Haiku writer. The sentences are not self explained without  /*commentaries*/ . So the story behind the scene…

One day my remote workstation was very very slow due to some network issues. But I still needed to write and run SAS codes in the server which is located in France. It really hurt my fingers from the ergonomic point of view.

“Tango” was my project code. But why Bastille? oh, that day happened to be round the Bastille Day in France. The performance of the remote server that time was just as terrible as the Bastille I want to break so I could do my project work more smoothly!

—————————–

Note that after my completing this Tango Haiku, I did a Google search and found that there is really a French book called Bastille tango!

Tango

ADJECTIVE Encounters

Really she is the strangest creature in the world, far from heroic, variable as a weathercock, “bashful, insolent; chaste, lustful; prating, silent; laborious, delicate; ingenious, heavy; melancholic, pleasant; lying, true; knowing, ignorant; liberal, covetous, and prodigal”— in short, so complex, so indefinite, …

            –Virginia Woolf, The Common Reader, First Series (1925)

I don’t know if Ernest Hemingway is still one of the recognized dominant writers in colleges English education. At least once  a time he WAS. In an extreme form, he used only nouns and verbs to construct sentences.

In my personal English education(as Second Language), admittedly that there is also an absence of adjectives. It is just wonderful, nice, great, cool, weird, awesome,  and all in all, everything is OK or not OK, good or not good. In writing, my sentences lack of tone and shades. I write only technical articles in English and people can often well manage the so-called technical borings when acquiring information, knowledge, and opinions.

In reading when I try to just read for the sake of reading itself, I also find it is difficult to dig into pure literature pages where rich adjectives assembled heavily. I only read smoothly technical papers. So when I happened to have a paper book of Woolf and also loaded a corresponding public domain e-book in Kindle, I read the most intensive instances of adjectives ever. It is really totally different experiences.

There would be three types in any languages. For English:

in the top, pure literature, Shakespeare-like;

middle, which could be called the universal or international English; it is be the dominate English among nations in business, technology and even academia; most of the popular writers also utilize such sort of English to extent their global reputation;

bottom, the street language, slangs, talk-show.

I penetrate the English through the middle like almost all of the ESL learners. It is the most convenient and effective way in a very short run. But for a leap in long run, some friends just suggest that I should go down the street or climb up to the top. Ok I am on the way. Virginia Woolf is the first stop and I keep the first notes.

Too Big to Be Accurate(1): Which is the Most Powerful Calculator in the World?

Calculate the factorial of 171 (171!)? Just TRY! It is equal to 171*170*169*….2*1.

1. Google calculator

As Google fanatics, I first try to search the answer via Google:

Google171

Whoops, nothing interested returned! Type “170!” and get the output:

Google170 Why kinda things happened in this calculator? 171! is just equal to 171*170!.

2. Excel

Switch to Excel spreadsheet. Function fact(*) used:

Excel170 Excel171 Oo, interesting. The same.

3. SAS

Google and Excel may be the niche players in calculators’ family. Why not try to use some programming languages?


Click to read more…

SGF: Caesars Palace in Las Vegas again

The mechanic, who wishes to do his work well, must first 
sharpen his tools.
 --Confucian Analects. BOOK XV.WEI LING KUNG.CHAP.IX.

My paper Work Smarter than Harder-tools for growing up a SAS programmer was accepted by SAS Global Forum 2011. It would be my first time to attend SAS user group conference worldwide. The draft version is available at

http://jiangtanghu.com/docs/en/SGF2011_JiangtangHU(draft).pdf

Welcome for any comments.

The SGF2011 will be held at Caesars Palace in Las Vegas, Nevada. The interesting thing is that, as far as I know, Caesars Palace in Las Vegas was also the host for SUGI 1978(Jan 30-Feb 1, 1978). looking forward to seeing my SAS gurus in Caesars Palace next year, a place of history.