Category Archive ‘Industry Review‘

 
 

What’s New

I didn’t blog for a while in this first half March and there are bunches of new stuff to catch up:

I had a new baby! He was delivered on time (and on budget!), lions tigers and bears, oh my… His brother is Tiger so I named him, Leo.

And I got the latest SAS 9.3 (TS1M2) installed! SAS is jus getting much beautiful.

SAS9.3_12.1

OpenCDISC had the latest release, Version 1.4 with the new SDTM 3.1.3 validation checks,—and yes CDISC itself also had some significant updates:

SDTMv1.3 and SDTMIGv3.1.3 now have the machine readable metadata online. It’s a nice improvement (last year I just posted The Great, Open, Vendor-neutral, Platform-independent Data Standards, . . . Yet in PDF Formats).

Define-XML now turns to 2.0 (finally).

R had its final 2.* release, Version 2.15.3 and Version 3.0.0 will just come soon. RStudio also had a update recently. RStudio is the best IDE (not just R IDE) I used.

Google will shut down Google Reader, the best RSS reader ever. It’s a huge loss and I tell you, for example, the famous SAS and statistical blogger Wensui Liu, once frequently posted on Windows Live Space, and then Blogspot and finally WordPress. The former two blogs were closed and Google Reader feed is the only way to archive these lost posts!

Github for Clinical/Statistical Programmers

PhUSE-FDA Working Group 5 (Development of Standard Scripts for Analysis and Programming) just adopted Google Code as collaborative programming platform. Google Code is one of the most popular and respected open source software hosting sites in the world and it is definitely a good choice for PhUSE-FDA WG5.

But after viewing one of WG5’s working reports, Sharing Standard Statistical Scripts and getting to know why they finally chose Google Code (rather than Github which was also tested by WG5 members), I think it’s necessary to clarify some misunderstanding against Github where I’m also an occasional user.

As stated in Slide 11 in the report mentioned before, Github,

Too complicated an interface
Too much overhead for simple development
Too much training and education needed

designed for classic programming languages like C and Java (not for things like R and SAS)

For the first point regarding interface, it seems only Git command line tested, and it may be too complicated to “classic statistical programming users”. Actually, Github offers a great GUI tool, for example, GitHub for Windows to help users visually clone repositories, commit changes and other management tasks without typing Git commands:

Github_GUI

It’s also worthy to mention that with GitHub for Windows, users don’t need to install any separated version control software like Git, CVS or SVN. GitHub for Windows already includes a fully functional version of msysGit. It just makes users’ life much simpler. To use Google Code, you must install and configure something like TortoiseSVN.

The second, is Github suitable for “things like R and SAS”? It’s true that all hosts including Github are dominated by “classic programming languages like C and Java”. For SAS, SAS programmers as a whole are just not active in  any social coding activities, but for R, actually it is one of the mostly used languages in Github.

Google Code is good and a “Google Code vs Github” question is just mostly subjective. It seems to me the pickup of Google Code by WG5 rather than Github was based on incomplete information. I personally prefer Github and there are also some good reasons:

  • Use the GUI tool, GitHub for Windows to maintain a minimum Git/SVN/CVS setup.
  • Github supplies much richer statistics reports, including charts.
  • Github is more social oriented which makes it cool in this Web2.0 world.

The Great, Open, Vendor-neutral, Platform-independent Data Standards, . . . Yet in PDF Formats

You know I mean the CDISC standards including CDASH, SDTM, SEND, ADaM, … and you are right there are few not only in PDF format (ODM, define.xml for example).

Today Jozef (Jos) Aerts from XML4Pharma posted his frustration of copying and pasting metadata from only PDF-formed SDTM-IG 3.1.4.  I hate to complain the volunteer work by the CDISC team but it is worth a discussion, is there any better way to publish CDISC Standards?

Weekend Clips: Data Scientist Episode II

1. What is A Data Scientist Anyway?

image

2. You Just Can’t Be Replaced by Yourselves!

image

3. We are all Data Scientists!

image

Weekend Clip: Data Scientist

Two tweets:

image

image

One blog post:

Statisticians aren’t the problem for data science. The real problem is too many posers

One job advertisement:

SAS Data Scientist ?(!)

A joke:

The biggest joke about data scientist is that the Google query “data scientist joke” returns nothing interesting.

Is There Any Better Way? Publishing Process For CDISC Standards Documentation

 

1. The Pain

I read from Lex Jansen (@LexJansen) that CDISC SDTM v1.3 and SDTMIG v3.1.3 were newly released. It’s pretty nice since CDISC SDTM was supposed to be released semiannually in the new publishing cycle. We can see the team put great efforts on this new version, but frankly speaking, this delivery (the way to display, not the content itself) is far away elegant.

The new SDTM Implementation Guide (IG) v1.3 is just a temporary workaround shipment, as an embedded file “How to Use SDTMIG 3.1.3” indicates,

SDTMIG 3.1.3 is presented as an annotated version of SDTMIG 3.1.2. This approach was taken for SDTMIG in order for the document to be released quickly without an extensive rewrite. The content presented as annotations will be incorporated into a single version of documentation in a future release.

What does “annotated” mean? When you replace “should” to “must” in the file,

  • strikethrough the word “should”
  • insert the replacement “must”, and
  • add a sticky note to indicate the change above

SDTMv313

This is annoying. There are 143 sticky notes throughout the whole documentation, including replacement, deleting, files attachment and such and the reason, is said to ensure “the document to be released quickly without an extensive rewrite”. BUT 143 sticky notes in a PDF file! it’s already huge editing effort ever!

2. The Reason (or The Conjecture)

Almost everybody complains of Microsoft Office Excel and Word, but Ура(!), they are still dominant in our working spaces (especially heavy in clinical world? I’m not sure). I didn’t have any personal connection with CDISC publishing team, but from the documentation released, I’m pretty confident that these files (SDTMIG v3.1.3 and others) were edited in Word and then published into PDF via Adobe products (very common practice, isn’t it?).

Now you may understand why CDISC publishing team delivered this “annotated”  version due to limited time and human resource (although editing 143 sticky notes was also a big work load). The clue is Word! Word! Word!

Microsoft Word is extremely popular for its WYSIWYG (What You See Is What You Get), but it can’t separate contents from formats and it will a disaster when maintaining a frequent updated Word file by multiple users. In this CDISC SDTMIG case,  there are about 143 content updates supplied by CDISC community worldwide, but when applying such content updates to the original Word file, you are always reasonable to worry about that such updates would change something(yes SOMETHING) unexpectedly! The biggest concern for CDISC standard files, I guess, again with confidence, is if such updates destroy the in-text links  or other cross references which offers the nice navigation throughout the documentation.

So, this “annotated” version at least is safe (and SAFE is much more important than what it looks): no links proven worktable in v3.1.2 will broken in this time pushing new release, and things would get better in the future (from the same source, “How to Use SDTMIG 3.1.3”):

CDISC is currently discussing how future documentation will be published ensure documentation is easy to navigate and read and at the same time easy to maintain.

3. The Prospective

Yes I will end with a (set of) suggestion(s). The bottom line is no Word anymore and I promise no additional cost and pain compared to digging into Microsoft Word and Adobe Acrobat.

Take SDTM IG v3.1.3 as a demo project:

  • Convert all the contents of SDTM IG v3.1.2 (from PDF, or original Word) to a text based format. Personally I prefer Markdown and reStructuredText. Actually it doesn’t matter which one is chosen for test purpose, because such text based formats can be easily transferred (much easier than from Word/PDF). The benefits of these two formats are separation of contents and formats, and very intuitive to learn (much easier than HTML; almost WYSIWYG). This task is machine doable somehow but also needs manually modification. But all in all,  it is not a big deal, it is only about 300 pages.
  • Edit these text files according to the new SDTM IG v3.1.3 updates.
  • Distribute these text files (and rendered output files in PDF/ HTML formats) to a vendor supported or self hosted collaborating site, like GitHub.
  • Call for CDISC team members and users to report any issues and even encourage them to directly edit them online (don’t worry, it won’t be mess; we are in a version control system like GitHub). 
  • Then the next version will come out naturally (and peacefully).

then I’m looking forward to hearing your ideas.

4. Additional Notes

The markup standards mentioned above in my proposal,  Markdown and reStructuredText, are not replacement for CDISC metadata standard, ODM and its XML derivatives Define.xml. Instead, they are better formats to get rid of Microsoft Word for community collaborating of editing the “narrative” parts of models (the PDF files we read from CDISC), for example, SDTMIG we discussed before.

Blogging is Awesome: CDISC Bloggers

I remember when blogging was cool.

Before the specializing and monetizing and Twitter-izing.

                                      —Peter Dewolf

Well I think blogging is still cool (and awesome and awesome …). The most appealing personal reason is, blog posts are Google searchable and suitable for archive while Tweets NOT. Admittedly I hold some sort of  Existentialism 2.0:

if it is not Google searched, it doesn’t exit!

Last month I placed a post on how to keep pace with CDISC from its official channels and I feel cool to add an appendix of source from the awesome blogosphere. Fortunately or not, CDISC is still in the niche market of topics and it takes few efforts to get the list(update me if someone else available! if you are a Google Reader user, just simply import this file, my Google Reader subscription on CDISC):

1. Blog @ Assero by Dave Iberson-Hurst (“Dave IH”)

http://www.assero.co.uk/category/blog/

Insightful and full of humor. I retweeted all of its latest posts and you can feel somehow on these tittles (YES on CDISC):

What I Want, What I Really Really Want

Churchill, the FDA and a Fall

Mad March and the FDA

Btw, I write blogs casual way while it is very impressive to read IH reminding me the George Orwell style.

2. d-Wise Technologies Blog

http://www.d-wise.com/blog/

It is my employer’s official blog site where Chris Decker is the key contributor to CDISC. You can check out his latest posts on FDA/PhUSE Annual Computational Science Symposium where he served as committee lead:

Overcoming Industry Challenges: A Shift to Collaboration

Validation and Quality: Are They the Same?

I will also commit to update this blog as my understanding on clinical standards goes. Here is the saying:

look to the master,
follow the master,
walk with the master,
see through the master,
become the master.

3.  XML4Pharma Blog

http://cdiscguru.blogspot.com/

with industry news and hard (while cool) way writing on XML (CDISC ODM, define.xml).

4. eClinical Trends by Clinovo

http://blog.clinovo.com/category/cdisc/

Clinovo jumps to this topic by launching a CDISC SDTM convertor CDISC Express.

5. eClinicalOpinion

http://eclinicalopinion.blogspot.com/

This blog is most focused on EDC, the clinical data management part. I like its series discussion on CDISC ODM.

6. eCTD Regulatory Submissions Network

http://ectdregulatorysubmissionsnetwork.blogspot.com/

This is a personal blog by Shakul Hameed. I read it mostly to get some information on submission requirements from European regulatory.

7. HL7 Watch

http://hl7-watch.blogspot.com/

while it is not CDISC directly related (#6 also), it’s nice to get some voice of HL7 which would be the future of CDISC.

8. From a Logical Point of View-CDISC

http://www.jiangtanghu.com/blog/category/cdisc/

Yes this one, my 2 cents. I will keep recording my personal immersion and understanding on CDISC and related clinical standards. (while it is privilege to cross reference oneself in his/her own blog! Keep awesome, keep blogging.)

9. Linked Data and URI:s for Enterprises

http://kerfors.blogspot.com/

Look at the colon (:) in the title of this blog and you’re right this blog plays (at least) with XML. I find it is good resource (thanks @kerfors for referencing!) to learn ODM, the foundation of CDISC while the latest post is

Semantic models for CDISC based standard and metadata management

P.S.: Blogger Chris Hemedinger maintains a nice list of SAS bloggers (blogs by SAS employees, and blogs by SAS customers, consultants, and the analytics community).

An Analytical Valley: Big Data and Data Scientists (and SAS Programmers)

hadoop

Tom Davenport reported an observation that Silicon Valley is becoming more analytical since companies in the Valley such as Google, Facebook, eBay, LinkedLn all have strong presences in analytics. Besides such predominant companies, I’d also like to add Yahoo to the list although Yahoo is no longer in its peak. Yahoo is the largest sponsor and contributor of Hadoop, an open source framework for distributed processing of so called “big data”. When taking a look at the outstanding Facebook data team or LinkedIn data team, we can see that Hadoop is also one of the most overwhelmingly successful technical factors. Such Valley companies themselves are the huge consumers of big data and have strong incentives to develop analytical solutions beyond their high technology product pipelines.

Analytical staffs in LinkedLn also helps a lot to promote the widely usage of the term “data scientist”. They identify themselves as data scientists and that’s really cool. Now more and more statisticians are also very glad to accept this brand new title. According to a survey in JSM (2011, Miami), more than 85% (164) statisticians there considered themselves “data scientists”.

McKinsey also released a report this May on big data and the huge gap of qualified analytical talents. You know when a management consulting firm begins to talk something technical, it is no longer a fashion to follow the discussion of the concept. To embrace the challenge of big data, one or the team needs multidiscipline background—basically speaking, computer science and statistics (and data mining or machine learning is just an interdisciplinary subject of them). Here is an ambitious list on “How do I become a data scientist”:

http://www.quora.com/Educational-Resources/How-do-I-become-a-data-scientist

For these learning plans, just feel the meaning and don’t take it too seriously. Check yourself and set up your own priority.

Notes for SAS Programmers

For SAS programmers, I read an exciting post besides High Performance Computing that SAS will also play with Hadoop by introducing some functionality in SAS/Access and SAS Data Integration Studio.

For SAS programmers with no IT background, it is not a good idea to jump into algorithms and data structures and other hard core computer courses immediately. Instead I recommend to take the full advantages of SAS language and system itself to dive into computer world gradually:

1. Learn and practice and practice SAS Proc SQL which is compliant with the SQL-92 standard. SQL is the common language in database world and SAS Proc SQL can help you switch smoothly to Oracle SQL, Teradata SQL, MySql SQL and other SQL implementations although there are some non-critical differences in details.

2. Dig into the operating system specific documentation of SAS, for example in SAS 9.3,  SAS 9.3 Companion for Windows or SAS 9.3 Companion for UNIX Environments or others depending the OS you are working on. They are the critical important documentations but unfortunately often missed in SAS programmers’  reading list.

Such docs will help SAS programmers to deal with the machines and expose to the wide computer world in a way that a SAS programmer can understand. You can’t expect to be an expert on computer via such docs, but at least you can communicate fluently with internal IT staff.

3. Then you get all the confidences to play with computer and can switch to any other topics interested in the list above!

Decision Trees in SAS Enterprise Miner and SPSS Clementine

Decision trees are included in SAS Enterprise Miner(EM). The counterpart is SPSS Clementine, which should be called IBM SPSS Modeler for precision after IBM’s acquisition of SPSS.

Recently I read a paper on the comparisons of SAS EM, SPSS Clementine and IBM Intelligent Miner on their decision tree and cluster technology:

Decision Tree Induction & Clustering Techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A Comparative Analysis by Abdullah M. Al Ghoson, Virginia Commonwealth University

The output is not that surprising. SAS EM plays better in performance, functionality and auxiliary task support but worse in usability.

SAS_VS_SPSS

Here are few comments on decision trees implementations in SAS EM and SPSS Clementine based on my own experiences. Some advises for beginners are also supplied.

There are four nodes in SPSS Clementine to supports four trees algorithms respectively: C5.0, Classification And Regression Trees (CART),  Quick, Unbiased, Efficient Statistical Tree(QUEST) and Chi-squared Automatic Interaction Detector(CHAID),  which are most famous and popular in decision trees family.

SPSS_4_trees Note that CART(R) is a registered trademark of California Statistical Software, Inc., and is licensed exclusively to Salford Systems, San Diego, California. So SPSS Clementine uses C&R Tree as name.

In SAS EM, there is only one decision tree node:

SAS_tree The algorithms behind this node is called SAS tree algorithms, which incorporate and extend the four mentioned before. Just change the settings in decision tree node, you can get the trees you want.

Obviously, SAS tree algorithms is superior than the separated ones in SPSS Clementine for expansibility and flexibility. But at the other hand, the complexities increase. For a newbie user of SAS EM, he/she may wonder which trees he/she is training. A SPSS Clementine users just picks up a node and says: OK, I am now training a CART or CHAID.—he/she would communicate with others more smoothly.

Regardless of the industry application, I think this is the educational benefit of SPSS Clementine. Since almost every data mining book introduces decision trees by separated algorithms(such as ID3/C4.5/C5.0, CART, QUEST, CHAID, . . .), the beginners using SPSS Clementine as instructional tool may get the clear ideas about the algorithms one by one. Once he/she get the full understanding of the differences among tree algorithms, he/she would train trees in SAS EM more comfortable.

What’s more, SPSS Clementine supplies rich supporting documentations for beginners and self learners , such as Tutorial, User Guide, Algorithms Guide, Node Reference. The official documentations of SAS EM 5.x and 6.x are relatively poor. Yes there is a good SAS Help and Documentation for SAS EM 4.3 including Getting Started with Enterprise Miner. EM4.3 is a traditional AF application but EM5.x and above are Java client incorporated in SAS analysis platform(they are totally different!). For EM5.x and above, only installation guides and a plain reference are available.

SAS Institute may have its own marketing strategies. No rich references available, the Institute DOES offer rich training programs in data mining and Enterprise Miner application. Wooo, the big-budget purchasers of SAS EM can also afford the trainings.

R or SAS: Quick Links to the Recent Debates


Original post, 7 Jan, 2009

Key Point
The popularity of R at universities could threaten SAS Institute.

A Controversial Review by Anee Milley from SAS
We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.

7 Jan, 2009,
SAS-L
Discussion in SAS-L, the most popular SAS mailing list. Most voices call for the incorporate both R and SAS.

7 Jan, 2009,
R-help
Cheer for the victory of R.

8 Jan, 2009,
Ashlee Vance‘s blog
R You Ready for R, with lots of comments

9 Jan, 2009,
SAS Consulting

9 Jan, 2009, Anee Milley
This Post Is Rated R, stating the viewpoints from SAS about open source software: support and participant.
For more, see Google Blog Search.