Monatsarchiv für October 2010

 
 

Play Matrix within SAS(1): basic files processing

Recently I read Rick Wicklin’s IML blog with great interests(and anticipation for his fore-coming IML book,  Statistical Programming with SAS/IML Software). SAS programmers have the following programming tools to facilitate their daily work:

  • SAS data step: the basic SAS; a generation IV programming language, similar with other procedural languages such as C.
  • SAS Proc SQL: SAS’s implementation of standard SQL(SQL-92).
  • SAS IML(Interactive Matrix Language): SAS’s matrix manipulation language(like R and Matlab).  SAS IML Studio also supply IMLPlus programming language(IML+), an enhanced version of IML.
  • SAS SCL(SAS Component Language): build in SAS/AF software, an object oriented programming(OOP) language for applications development.

I am a heavy user of data steps and SQL and want to invest some bit on matrix manipulation. Although other wonderful languages available(such as R and Matlab), I found IML is a good choice for SAS programmers like me. It is well integrated within SAS system, and very important, almost all of the SAS Base functions and call routines are also supported by IML. Here some notes of IML 101(codes are self explanatory from a SAS Base point of view):

1. IML style of ‘hello world’

proc iml;
    text="Hello World!";
    print "IML saying" text;
quit;

and you got in output window:

IML saying Hello World!

Like Proc SQL, IML begins with “proc iml” , end with ”quit”, and every statements end with a semicolon. The key word “print” (an IML statement), just like “put” statement in data steps.

An enhanced version of Hello World:

options nocenter nodate nonumber;

proc iml;
    reset printall;

    text="Hello World!";
    print "in &sysdate. IML saying" text;
quit;

Some SAS global options added(“nocenter nodate nonumber”). The IML statement “reset", works like “options” statement to set some processing options within the IML(and you can guess the meaning of the options “printall”, just print all. . . it is your turn to check the output window).

A SAS system macro variable “&sysdate” is presented to encourage you to add any programming elements in SAS Base to IML.

2. How to create a matrix manually

Actually, we have already create a matrix named “text” in the previous hello-world codes. It is a character scalar(matrix with only one element). If we want to avoid the SAS data steps’ style of assignment,  we can use {} to enclose matrix elements:

a={“a”};  /*a _char_ scalar */
b={1};     /*a _num_ scalar*/

and a 2*3 matrix:

c={1 2 3,
      2 3 4}; /*2 rows, 3 cols*/

Commas(,) are used to separate rows.

3. How to create a matrix by functions

Some matrix reshaping functions:

a=I(3);     /*creates a 3*3 identity matrix*/
b=J(2,3,5); /*creates a 2*3 matrix of identical values*/
e=do(1,9,2); /*produces series, from 1 to 9, by increment 2*/
c=block(a,b);/*forms a block-diagonal matrice*/
d=diag(a);   /*creates a diagonal matrix*/
m=repeat(a,4,3); /*create a (3*4)*(3*3) matrix by repeating*/
n=T(b);   /*transpose*/

4. How to create a matrix by reading a SAS data set

proc iml;
    use sashelp.class;
    read all var _char_                    into class_char;
    read all var _num_                     into class_num;
    read all var {"Age" "Height" "Weight"} into class_num2;
    close sashelp.class;

    print class_char class_num class_num2;
quit;

Note that it is a good habit to close the data file after reading or using it(see Rick Wicklin’s Five Reasons to CLOSE Your Data Sets).

5. How to output a matrix to SAS dataset

proc iml;
    use sashelp.class;
    read all var _num_ into class_num;
    close sashelp.class;

    create work.class_num from class_num;
    append from class_num;
    show datasets;
quit;

 

6. How to format a matrix

/*version I: use matrix options*/

proc iml;
        use sashelp.class;
        col={"Age" "Height" "Weight"};
        read all var col into class;
        read all var{name} into row;
        close sashelp.class;

        print class[rowname=row
                    colname=col
                    format=5.2
                    label="test, label"];
quit;

/*version II: use mattib statement*/

proc iml;
        use sashelp.class;
        col={"Age" "Height" "Weight"};
        read all var col into class;
        read all var{name} into row;
        close sashelp.class;

        mattrib class rowname=row
                      colname=col
                      label="test, label"
                     format=5.2;
        print class;
quit;

/*version III: avoid hardcoding—use IML function and operations*/

proc iml;
        use sashelp.class;
        col=T(contents(sashelp,class)[3:5]);
        read all var col into class;
        read all var{name} into row;
        close sashelp.class;

        mattrib class rowname=row
                      colname=col
                      label="test, label"
                      format=5.2;
        print class;
quit;

(IML matrix operations: to be continued)

SAS Algorithmically(1): Newton-Raphson method

A good reference for the basic algorithms of Newton-Raphson method to calculate the square root of a number, see

http://mathforum.org/library/drmath/view/52644.html

And the SAS codes(self-explanatory):

data root;
        /*question: find the square root of 4*/
        x=4; 

        /*first choose a rough approximation of sqrt(4);
        actually, you can start with any numbers*/
        y0=1;        

        count=0;/*init count number*/

        do until (w<1e-8); /*set a small tolerance error*/
                count=count+1;   /*accumulate count number*/
                y=(y0+x/y0)/2;   /*Newton’s formula*/
                w=abs(y-y0); /*if close, exit;*/
                y0=y;        /* otherwise, keep the new one*/
        end;

        output;
run;

The outputs:          

x    y0    count         w           y

4     2      6      2.2204E-15    2

After 6 iterations, Newton-Raphson(also called divide-and-average) gets an approximated square root. See what happed during each iteration compared the output generated by SAS function,sqrt():

data root;
        x=4; 
        y0=1;       
        count=0;
        do until (w<1e-8);  
                count=count+1;  
                y=(y0+x/y0)/2; 
                w=abs(y-y0);
                y0=y;
                if y =sqrt(x) then is_eq_sqrt="YES";
                else is_eq_sqrt="NO";
                output;
        end;
run;

Outputs:

x       y0         count        w               y     is_eq_sqrt

4    2.50000      1      1.50000    2.50000     NO
4    2.05000      2      0.45000    2.05000     NO
4    2.00061      3      0.04939    2.00061     NO
4    2.00000      4      0.00061    2.00000     NO
4    2.00000      5      0.00000    2.00000     NO
4    2.00000      6      0.00000   2.00000     YES     

What’s the difference between count 5 and 6 since their y values look the same? We reset the tolerance value to 1e-3 rather than 1e-8, and get the outputs:

x       y0      count       w          y              is_eq_sqrt

4    2.50000      1      1.50000    2.50000      NO
4    2.05000      2      0.45000    2.05000      NO
4    2.00061      3      0.04939    2.00061      NO
4    2.00000      4      0.00061    2.00000      NO      

The system get a faster convergence at an higher error rate, with an approximated  value little away from sqrt(4).

We should have a deep understanding of how SAS stores numeric values, which deserves a full session to discuss, to unearth the mystery. Some basic references:

Happy SAS Graphing!

I’m not a R user. Instead, I’m an observer. For example, I love the R Graph Gallery:

http://addictedtor.free.fr/graphiques/thumbs.php

As a SAS programmer, I also love Robert Allison‘s SAS/Graph Examples and Michael Friendly‘s SAS Graphic Programs and Macros (the only two SAS graph galleries available on website maintained by users):

http://robslink.com/SAS/Home.htm

http://www.datavis.ca/sasmac/

I used to be a casual user of SAS Graph. I’d like to add Graph into my toolbox in the following month: half of it due to the encouragement of the two SAS graph gurus, the other half SAS 9.2’s exciting enhancement on graphics.

p.s: the official SAS graph  gallery:

http://support.sas.com/sassamples/graphgallery/index.html

Logics in mathematics and in daily life: a statistical programming example

Refresh some basic logical propositions (or statements):

implication:       if       P then       Q (P>Q)

inverse:            if not P then not Q (-P>-Q)

converse:         if       Q then       P (Q>P)

contrapositive: if not Q then not P (-Q>-P)

contradition:    if       P then not Q (P>-Q)

Mathematically or logically speaking, if the implication statement holds, then the contrapositive holds, but the inverse does not hold, i.e., if P then Q, then we can get if not Q then not P, but we can not get if not P then not Q.

That’s all logics needed here and Let’s turn to the ambiguous English in daily life. James R. Munkres of MIT gave an example in Topology (2nd edition, 2000, P.7):

Mr. Jones, if  you get a grade below  70 on  the final, you are going to flunk  this course.

We adapt it in a logical implication form:

Mr. Jones, if P then Q, where

P: you get a grade below  70 on  the final

Q: you are going to flunk  this course

Considering the context, we can also get that the inverse holds: if you get a grade above er or equal to 70, then you are going to pass this course(if not P then not Q ).

Question: when do statistical programming, what types of logics you use?

Answer: Not all mathematically. see

if score<70 then grade="flunk";  *if P then Q;
else                    grade="pass";  *if not P then not Q;

XML and SAS

Last month, I gave a talk, XML: the SAS Approach, in CDISC Interchange China 2010(at the Medical School of Fudan University, Shanghai, 2010-09-15). FDA favors CDISC and HL7, the two XML based standards, and SAS programmers in biopharmaceutical industry  need incorporate the XML technology into their toolboxes. Fortunately, you don’t need to be an XML expert then to play XML in your daily work, and, SAS system DOES offer multiple tools and applications to handle XML files, i.e. import and export XML data:

  • SAS data steps approach:                        import and export
  • SAS XML Libname engine:                         import and export
  • SAS ODS XML statement(ODS MARKUP):   export
  • PROC CDISC:                                            import and export
  • SAS XML Mapper:                                      import
  • SAS CDISC Viewer:                           as if  import

The SAS CDISC Viewer and PROC CDISC procedure are some bit toys, and the rest really work. The Perl Regular Expression(PRX) approach is also presented to export and import XML data.

A simple demo. First, use FILE and PUT statements to generate an XML file:

data _null_;
    file "export.xml";
    put ‘<?xml version="1.0" encoding="windows-1252" ?>’;
    put ‘<ROWSET>’;
    put ‘<ROW>’;
    put ‘<text> Welcome to CDISC Interchange 2010 China </text>’;
    put ‘</ROW>’;
    put ‘<ROW>’;
    put ‘<text> We are in Shanghai! </text>’;
    put ‘</ROW>’;
    put ‘</ROWSET>’; 
run;

Then read the whole XML file to SAS dataset:

data import0 ;
    infile "export.xml" dsd missover truncover lrecl = 1024;
    input line $1024.;
    if line = ” then delete;
run;

Third step, extract the information you want(the text between <text> and </text> tags) using  Perl Regular Expression:

data import (keep = line );
     retain queName ;
     retain line ;
     set import0;     

     /*use PRX to capture the structure of XML data;*/
     If _n_=1 then do;
            queName=prxparse(‘/^<text> /’);
     end;
     queNameN=prxmatch(queName,line);

    /*use PRX to remove the XXML tags;*/
     if queNameN>0 then do;
        rx1=prxparse("s/<.*?>//");
        call prxchange(rx1,99,line);
        output;
     end;     
run;

The logic of PRX approach to process XML data is very simple and can be easily modified according to your needs:

  • complicate and utilize the PRX codes to capture the hierarchical structure of XML data.
  • remove XML tags and output the information to SAS dataset.

On three statistical realms

Peter Petocz and Anna Reid(2010) grouped three levels of students’ conceptions of statistics:

  • Level I:   focus on techniques
  • Level II:  focus on using data
  • Level III: focus on meaning

I found the three conceptions could be easily interpreted as the three kinds of state of learning and using statistics based on my personal experience:

  • State I: focus on techniques—As a student of Economics and (then) Software Engineering, I needed some statistics techniques to support my study on data mining and machine learning. So I invested a lot on some fancy skills such as logistic regression, decision tree,  neural network and even support vector machine in graduate school and SAS R&D(as an intern). In most time, I just thrown data to the models and checked their functionality and feasibility(Wula-IT-WORKS! or Oops-crash-again). When looking back, I’d just have to say these techniques were toys played in labs.
  • State II:  focus on using data—Now I worked as a SAS programmer(also titled as statistical analyst) in pharma. All data are not just the rows and columns in the tables. They are SUBJECTS! Statistical techniques are used carefully to display and interpret the story of real world. Why the denominator is 999 while 1000 subjects were recruited in this trial? Because subject 001-127, male, 23 months of age,  discontinued due to his father’s wish and opinion!
  • State III: focus on meaning—Peter Petocz and Anna Reid concluded that, regarding the MEANING conception of statistics, “statistics is an inclusive tool used to make sense of the world and develop personal meanings.” The last state of any realms ideal, is always sounded like philosophy or religion. That may be a life in a statistical way or style(If got it, I would change my blog’s title as From a Statistical Point of View^).

—————-some notes on non-statistics—————————-

1. three kinds of state of Chan

  • just mountain
  • isn’t mountain
  • still mountain

2. three realm ideal of Wang Guowei

  • heaven is integrated with man:

Last night the west wind shriveled the green-clad trees,

Alone I climb the high tower

To gaze my fill along the road to the horizon.

  • knowledge is integrated with practice

My clothes grow daily more loose, yet care I not.

For you am I thus wasting away in sorrow and pain.

  • feeling is integrated with scenery

I sought her in the crowd a hundred, a thousand times.

Suddenly with a turn of the head [I saw her],

That one there where the lamplight was fading.

Reference:

Peter Petocz and Anna Reid. On Becoming a Statistician—A Qualitative View. International Statistical Review(2010), 78,2,271

WANG Guowei. Ren jian ci hua. translated by Adele Austin Rickett.