Category Archive ‘CDISC‘

 
 

What’s New

I didn’t blog for a while in this first half March and there are bunches of new stuff to catch up:

I had a new baby! He was delivered on time (and on budget!), lions tigers and bears, oh my… His brother is Tiger so I named him, Leo.

And I got the latest SAS 9.3 (TS1M2) installed! SAS is jus getting much beautiful.

SAS9.3_12.1

OpenCDISC had the latest release, Version 1.4 with the new SDTM 3.1.3 validation checks,—and yes CDISC itself also had some significant updates:

SDTMv1.3 and SDTMIGv3.1.3 now have the machine readable metadata online. It’s a nice improvement (last year I just posted The Great, Open, Vendor-neutral, Platform-independent Data Standards, . . . Yet in PDF Formats).

Define-XML now turns to 2.0 (finally).

R had its final 2.* release, Version 2.15.3 and Version 3.0.0 will just come soon. RStudio also had a update recently. RStudio is the best IDE (not just R IDE) I used.

Google will shut down Google Reader, the best RSS reader ever. It’s a huge loss and I tell you, for example, the famous SAS and statistical blogger Wensui Liu, once frequently posted on Windows Live Space, and then Blogspot and finally WordPress. The former two blogs were closed and Google Reader feed is the only way to archive these lost posts!

Github for Clinical/Statistical Programmers

PhUSE-FDA Working Group 5 (Development of Standard Scripts for Analysis and Programming) just adopted Google Code as collaborative programming platform. Google Code is one of the most popular and respected open source software hosting sites in the world and it is definitely a good choice for PhUSE-FDA WG5.

But after viewing one of WG5’s working reports, Sharing Standard Statistical Scripts and getting to know why they finally chose Google Code (rather than Github which was also tested by WG5 members), I think it’s necessary to clarify some misunderstanding against Github where I’m also an occasional user.

As stated in Slide 11 in the report mentioned before, Github,

Too complicated an interface
Too much overhead for simple development
Too much training and education needed

designed for classic programming languages like C and Java (not for things like R and SAS)

For the first point regarding interface, it seems only Git command line tested, and it may be too complicated to “classic statistical programming users”. Actually, Github offers a great GUI tool, for example, GitHub for Windows to help users visually clone repositories, commit changes and other management tasks without typing Git commands:

Github_GUI

It’s also worthy to mention that with GitHub for Windows, users don’t need to install any separated version control software like Git, CVS or SVN. GitHub for Windows already includes a fully functional version of msysGit. It just makes users’ life much simpler. To use Google Code, you must install and configure something like TortoiseSVN.

The second, is Github suitable for “things like R and SAS”? It’s true that all hosts including Github are dominated by “classic programming languages like C and Java”. For SAS, SAS programmers as a whole are just not active in  any social coding activities, but for R, actually it is one of the mostly used languages in Github.

Google Code is good and a “Google Code vs Github” question is just mostly subjective. It seems to me the pickup of Google Code by WG5 rather than Github was based on incomplete information. I personally prefer Github and there are also some good reasons:

  • Use the GUI tool, GitHub for Windows to maintain a minimum Git/SVN/CVS setup.
  • Github supplies much richer statistics reports, including charts.
  • Github is more social oriented which makes it cool in this Web2.0 world.

Best of SAS: A Personal Nomination 2012

There SAS applications/procedures/features were not necessarily available since 2012. This year I paid special attention to them when began to use SAS 9.3. The following notes are  totally my personal endorsement purely based my own experience as a user:

XML File Reading: SAS XML Mapper

SAS_XML_Mapper

SAS XML Mapper itself is not an elegant tool from software design perspective (an example: I keep multiple versions because the latest version seems not carry out all the functionalities from old ones), but it is best XML file processing tool for SAS programmers like me. The sweet part of SAS XML Mapper is that you can use it to get SAS datasets directly with a automatically generated XML mapping file. I use it to import XML files from

CDISC ODM based files like define.xml

metadata querying results (XML) returned by SAS Metadata Server

and it works pretty well and I just can’t live without it!

Graphics Facility: ODS Graphics

ODS Graphics is the raising star among SAS products family since its advent and I also fall into love with her. This system contains

five procedures with “SG” in names,

a template language(GTL), and

two GUI tools, ODS Graphics Editor, and ODS Graphics Designer

in Base SAS and ODS Statistical Graphics in SAS/STAT (and in SAS/QC which I didn’t check out). ODS Graphics just makes SAS graph  much beautiful and graphics task much fun (and elegant!).

Btw, it may not be a fair game but still nice to check out a SAS ODS graph and some random pick up R graphs:

This is from Rick Wicklin with 3 lines of codes including one “RUN” statement:

proc sgscatter data=sashelp.iris; 
matrix SepalLength--PetalLength /group=Species diagonal=(histogram kernel);
run;

SAS_corr2

and this is the R homepage graph(and the R codes):

R_Graphics

and this is R Graph Gallery homepage:

r_graph_gallery

I must say SAS ODS Graphics rocks!

Statistical Procedure: PROC TTEST and PROC FREQ

In this category I list two because they are equally extremely relevant and important for me as statistical SAS programmer.

PROC TTEST makes equivalence test (which is extremely popular in clinical research) much more accessible by adding a TOST(two one-side test) option. Years ago SAS programmers might use PROC MIXED or other methods for this kind of statistical test(I also took a note on this topic, see here).

The new on PROC FREQ I checked out is to support much richer methods on calculating confidence intervals for binomial proportion (my note here) and confidence intervals for difference between independent binomial proportions (my note here) which are also extremely important in clinical research and I programmed a lot.

Report Writing: ODS Report Writing Interface

My first SAS version was 9.1.3 where PROC REPORT is the primary SAS reporting writing tool (with ODS) while PROC TABULATE not in fashion anymore (and you may merely hear the arguments among these two procedures since then). I used PROC REPORT for all my production work for reporting since recently I tried the ODS Report Writing Interface in a project for non-rectangular tables. It’s great and everyone was happy!

Basically it is an ODS enhanced DATA _NULL_ reporting writing method (DATA _NULL_ with FILE PRINT statements: it’s an even older way for me);a new ODS output object declared within DATA _NULL_:

dcl odsout obj();

I like this kind of reporting method: you can control your report line by line and cell by cell (although with more lines of structured codes!).

Metadata Querying Tool: PROC METADATA

In this category, the other two strong candidates are JAVA interface and SAS Data Step Functions. I like PROC Metadata against JAVA because it holds the same full functionality (while SAS Data Step Functions, no yet) while I can still work in SAS to produce reports (just add a new line to start to use ODS Report Writing Interface:)). Furthermore, I feel much comfortable working with SAS!

PROC Metadata uses XML as inputs and outputs: it may be not admirable compared to Data Step Functions. Since I use SAS XML Mapper, it’s not a problem anymore!

Documentation

I like the totally new SAS help and documentation system both online and offline since SAS 9.3.

First in Base SAS, lots of files were separated from “SAS Language Dictionary”,

SAS_Docs_offline

and in all procedures guide, the tab view looks great:

SAS_Docs

The Great, Open, Vendor-neutral, Platform-independent Data Standards, . . . Yet in PDF Formats

You know I mean the CDISC standards including CDASH, SDTM, SEND, ADaM, … and you are right there are few not only in PDF format (ODM, define.xml for example).

Today Jozef (Jos) Aerts from XML4Pharma posted his frustration of copying and pasting metadata from only PDF-formed SDTM-IG 3.1.4.  I hate to complain the volunteer work by the CDISC team but it is worth a discussion, is there any better way to publish CDISC Standards?

Is There Any Better Way? Publishing Process For CDISC Standards Documentation

 

1. The Pain

I read from Lex Jansen (@LexJansen) that CDISC SDTM v1.3 and SDTMIG v3.1.3 were newly released. It’s pretty nice since CDISC SDTM was supposed to be released semiannually in the new publishing cycle. We can see the team put great efforts on this new version, but frankly speaking, this delivery (the way to display, not the content itself) is far away elegant.

The new SDTM Implementation Guide (IG) v1.3 is just a temporary workaround shipment, as an embedded file “How to Use SDTMIG 3.1.3” indicates,

SDTMIG 3.1.3 is presented as an annotated version of SDTMIG 3.1.2. This approach was taken for SDTMIG in order for the document to be released quickly without an extensive rewrite. The content presented as annotations will be incorporated into a single version of documentation in a future release.

What does “annotated” mean? When you replace “should” to “must” in the file,

  • strikethrough the word “should”
  • insert the replacement “must”, and
  • add a sticky note to indicate the change above

SDTMv313

This is annoying. There are 143 sticky notes throughout the whole documentation, including replacement, deleting, files attachment and such and the reason, is said to ensure “the document to be released quickly without an extensive rewrite”. BUT 143 sticky notes in a PDF file! it’s already huge editing effort ever!

2. The Reason (or The Conjecture)

Almost everybody complains of Microsoft Office Excel and Word, but Ура(!), they are still dominant in our working spaces (especially heavy in clinical world? I’m not sure). I didn’t have any personal connection with CDISC publishing team, but from the documentation released, I’m pretty confident that these files (SDTMIG v3.1.3 and others) were edited in Word and then published into PDF via Adobe products (very common practice, isn’t it?).

Now you may understand why CDISC publishing team delivered this “annotated”  version due to limited time and human resource (although editing 143 sticky notes was also a big work load). The clue is Word! Word! Word!

Microsoft Word is extremely popular for its WYSIWYG (What You See Is What You Get), but it can’t separate contents from formats and it will a disaster when maintaining a frequent updated Word file by multiple users. In this CDISC SDTMIG case,  there are about 143 content updates supplied by CDISC community worldwide, but when applying such content updates to the original Word file, you are always reasonable to worry about that such updates would change something(yes SOMETHING) unexpectedly! The biggest concern for CDISC standard files, I guess, again with confidence, is if such updates destroy the in-text links  or other cross references which offers the nice navigation throughout the documentation.

So, this “annotated” version at least is safe (and SAFE is much more important than what it looks): no links proven worktable in v3.1.2 will broken in this time pushing new release, and things would get better in the future (from the same source, “How to Use SDTMIG 3.1.3”):

CDISC is currently discussing how future documentation will be published ensure documentation is easy to navigate and read and at the same time easy to maintain.

3. The Prospective

Yes I will end with a (set of) suggestion(s). The bottom line is no Word anymore and I promise no additional cost and pain compared to digging into Microsoft Word and Adobe Acrobat.

Take SDTM IG v3.1.3 as a demo project:

  • Convert all the contents of SDTM IG v3.1.2 (from PDF, or original Word) to a text based format. Personally I prefer Markdown and reStructuredText. Actually it doesn’t matter which one is chosen for test purpose, because such text based formats can be easily transferred (much easier than from Word/PDF). The benefits of these two formats are separation of contents and formats, and very intuitive to learn (much easier than HTML; almost WYSIWYG). This task is machine doable somehow but also needs manually modification. But all in all,  it is not a big deal, it is only about 300 pages.
  • Edit these text files according to the new SDTM IG v3.1.3 updates.
  • Distribute these text files (and rendered output files in PDF/ HTML formats) to a vendor supported or self hosted collaborating site, like GitHub.
  • Call for CDISC team members and users to report any issues and even encourage them to directly edit them online (don’t worry, it won’t be mess; we are in a version control system like GitHub). 
  • Then the next version will come out naturally (and peacefully).

then I’m looking forward to hearing your ideas.

4. Additional Notes

The markup standards mentioned above in my proposal,  Markdown and reStructuredText, are not replacement for CDISC metadata standard, ODM and its XML derivatives Define.xml. Instead, they are better formats to get rid of Microsoft Word for community collaborating of editing the “narrative” parts of models (the PDF files we read from CDISC), for example, SDTMIG we discussed before.

Blogging is Awesome: CDISC Bloggers

I remember when blogging was cool.

Before the specializing and monetizing and Twitter-izing.

                                      —Peter Dewolf

Well I think blogging is still cool (and awesome and awesome …). The most appealing personal reason is, blog posts are Google searchable and suitable for archive while Tweets NOT. Admittedly I hold some sort of  Existentialism 2.0:

if it is not Google searched, it doesn’t exit!

Last month I placed a post on how to keep pace with CDISC from its official channels and I feel cool to add an appendix of source from the awesome blogosphere. Fortunately or not, CDISC is still in the niche market of topics and it takes few efforts to get the list(update me if someone else available! if you are a Google Reader user, just simply import this file, my Google Reader subscription on CDISC):

1. Blog @ Assero by Dave Iberson-Hurst (“Dave IH”)

http://www.assero.co.uk/category/blog/

Insightful and full of humor. I retweeted all of its latest posts and you can feel somehow on these tittles (YES on CDISC):

What I Want, What I Really Really Want

Churchill, the FDA and a Fall

Mad March and the FDA

Btw, I write blogs casual way while it is very impressive to read IH reminding me the George Orwell style.

2. d-Wise Technologies Blog

http://www.d-wise.com/blog/

It is my employer’s official blog site where Chris Decker is the key contributor to CDISC. You can check out his latest posts on FDA/PhUSE Annual Computational Science Symposium where he served as committee lead:

Overcoming Industry Challenges: A Shift to Collaboration

Validation and Quality: Are They the Same?

I will also commit to update this blog as my understanding on clinical standards goes. Here is the saying:

look to the master,
follow the master,
walk with the master,
see through the master,
become the master.

3.  XML4Pharma Blog

http://cdiscguru.blogspot.com/

with industry news and hard (while cool) way writing on XML (CDISC ODM, define.xml).

4. eClinical Trends by Clinovo

http://blog.clinovo.com/category/cdisc/

Clinovo jumps to this topic by launching a CDISC SDTM convertor CDISC Express.

5. eClinicalOpinion

http://eclinicalopinion.blogspot.com/

This blog is most focused on EDC, the clinical data management part. I like its series discussion on CDISC ODM.

6. eCTD Regulatory Submissions Network

http://ectdregulatorysubmissionsnetwork.blogspot.com/

This is a personal blog by Shakul Hameed. I read it mostly to get some information on submission requirements from European regulatory.

7. HL7 Watch

http://hl7-watch.blogspot.com/

while it is not CDISC directly related (#6 also), it’s nice to get some voice of HL7 which would be the future of CDISC.

8. From a Logical Point of View-CDISC

http://www.jiangtanghu.com/blog/category/cdisc/

Yes this one, my 2 cents. I will keep recording my personal immersion and understanding on CDISC and related clinical standards. (while it is privilege to cross reference oneself in his/her own blog! Keep awesome, keep blogging.)

9. Linked Data and URI:s for Enterprises

http://kerfors.blogspot.com/

Look at the colon (:) in the title of this blog and you’re right this blog plays (at least) with XML. I find it is good resource (thanks @kerfors for referencing!) to learn ODM, the foundation of CDISC while the latest post is

Semantic models for CDISC based standard and metadata management

P.S.: Blogger Chris Hemedinger maintains a nice list of SAS bloggers (blogs by SAS employees, and blogs by SAS customers, consultants, and the analytics community).

OpenCDISC Validator V1.3: An Unboxing Review (1): counting issue

The lasted OpenCDISC Validator version 1.3 was released at 29 March, 2012 (btw, there is a typo in the Line 1 of CHANGELOG.txt within the package: “2012” not “2011”). As usual, you can submit the following SAS scripts to get some basic information(remember to customize your directory):

filename CDISC url "https://raw.github.com/Jiangtang/Programming-SAS/master/Rules_Count_OpenCDISC_XML.sas";

%include CDISC;

%Rules_Count_OpenCDISC_XML(dir=C:OpenCDISC1.3compareopencdisc-validator_1.3config)

and you get a summary of validation rules of OpenCDISC Validator V1.3 (499 total unique rules):

OpenCDISC_V1.3

where

AD: Analytical Data
CT: Controlled Terminology
DD: Data Definition
OD: Operation Data Model
SD: Study Data
SE: SEND data

As comparison, a summary of V1.2.1 (385 total unique rules) posted before:

The most significant enhancement of V1.3 against V1.2.1 is the adding of rules for SDTM 3.1.2 with Amendment 1 and SEND 3.0. You can see there are also some changes among others modules, such ADaM 1.0 and SDTM 3.1.2. The OpenCDISC release newsletter said that there are 43 new SDTM rules added. Well, rules deleted, rules added, rules commented, we now have some arithmetical discrepancies.

The scripts above capture all instances of validation rule IDs (also delete some commented for example in config-define-1.0.xml, four rules commented: OD0004, OD0005, OD0007, OD0008). We can also double validate the counts manually:

  • copy all contents for example in SDTM 3.1.2 in its website into Notepad++ (where line numbers displayed)
  • delete all unnecessary entries
  • then the last line number is the total number of the rules (227 in this case).

Another way to check the rules is to open the XML configuration files using a web browser:

Theoretically the three ways are identical in counting, but there is an open bug in the style sheet file in …OpenCDISC1.3opencdisc-validatorconfigresourcesxslconfig.xsl, Line 175:

<xsl:template match="val:Unique|val:Condition|val:Match|val:Regex

|val:Required|val:Lookup|val:Metadata">

There is no “val:Find” to render all the Find validation rules (AD0061 in config-adam-1.0.xml) so all Find validators are not displayed. A suggested workaround is just to add “val:Find” to the file:

<xsl:template match="val:Unique|val:Condition|val:Match|val:Regex

|val:Required|val:Lookup|val:Metadata|val:Find">

Actually in the “OpenCDISC Validation Framework” page of OpenCDISC website, the “Find”validator is not documented yet.

<to be continued>

Fetch CDISC Control Terminology Files in NCI Vocabulary Repository: All in One Click

CDISC Control Terminology is the most frequent updated model among CDISC standards. Take SDTM as example, the latest SDTM terminologies released at 23 March 2012; and from 2009 to 2011, there were 15 different SDTM terminology versions! If you just rely on your own local repository, you might miss the pace somehow.

Here is a simple approach. Just submit the following one line of codes in a shell,

wget http://evs.nci.nih.gov/ftp1/CDISC/ -r –no-parent  -l 3

and you will get all the CDISC Control Terminology files (plus historical versions) with proper folder structures in your local driver (the current directory of your shell).

For syntax details, you can refer to its online manual:

-r: recursive retrieving

–no-parent: only fetch files under the URL

-l 3: set maximum depth level of 3  

If you are a Windows user, you might install Wget for Windows (it is a native tool in Unix/Linux under GNU)and add it into your environmental variable, Path. You can also save the above scripts in a notepad and save it as a .bat file(test.bat for example). Next time you just click the test.bat to get all the updates.

Wget is a very powerful tool. For me, the download speed is pretty acceptable (depends on internet connection) and almost no difference between a Windows 7 and a Ubuntu 11 machine:

Wget_Win7

Wget_ubuntu11

Quick Notes on RTP CDISC User’s Group Q1 Meeting

It’s my first time to attend a local event, RTP (Research Triangle Park) CDISC User’s Group meeting, Q1 and here are some quick notes.

1. people

Almost fresh faces for me. It’s nice to meet Jack Shostak of Duke Clinical Research Institute again. I visited him in Duke last year after SAS Global Forum in Las Vegas. Jack has a forthcoming book on SAS and CDISC, Implementing CDISC Using SAS: An End-to-End Guide. It’s the first book on this topic and worth waiting!

I also met (unexpectedly and exciting) a Chinese friend Chunmao in the meeting. Very interesting: after introduction, then we got that we emailed on CDISC mapping before! Chunmao just moved from DC to Triangle as SAS programmer weeks before(a side message: Triangle is hiring!). Big bonus to attend this meeting.

My colleague Chris Decker of d-Wise Technologies also showed up in this meeting. Actually he and Jack both serve as committee members in RTP CDISC User’s Group (they are also core members in CDISC community worldwide).

Tom Soeder of Cato (venue supplier for this meeting) kindly served as host while Jeff Abolafia of Rho the moderator.

2. agenda

Jeff and another key member of this group introduced some important updates from CDISC. One of the most interesting messages for me is the regular release cycle of SDTM Model and Implementation Guide. SDTM will be released semiannually, so we will get SDTMIG 3.1.3 in this summer, 3.1.4 at the end of year which will mainly hold the recently updates of Trial Summary, an amendment,  and CDISC Devise domains respectively.

SDTM is the flagship model of CDISC. SDTMIG 3.1.1 published in 2005 while 3.1.2 in 2008. It’s nice to see from the new more frequent release schedule that the CDISC community is expanding (and more organized and expected).

Recently SDTM does have lots of updates, including a copy of the Metadata Submission Guideline (MSG). CDISC organization will also offer periodic webinars on updates.

Chris then gave a summary on latest FDA/PhUSE Computational Science Symposium (CSS while Chris organized it). You may get more information on Chris’s blog, and CDISC blog. It’s better to keep CSS in watch list.

Jack and Jeff had comments on working the FDA/PhUSE working groups.

Peter Schaefer of Certara released the outputs of latest CDISC user network servey where SDTM and ADaM are still on the top of user’s list.

Final part (most practical), group exercises! Three groups were assigned to map some challenging CRF pages to SDTM. Some users also took some CRF pages from their own companies for public discussion (nice to have some flavors!).

3. Links

RTP CDISC User’s Group on Yahoo Group (the traffic is low but still informative):

http://tech.groups.yahoo.com/group/rtp_cdisc/

CDISC official site:

http://www.cdisc.org/

Now we have more reasons to visit CDISC website frequently for new updates models (e.g., Control Terminology also released semiannually) and webinar postings.

FDA/PhUSE working groups Wiki:

http://www.phusewiki.org/wiki/index.php?title=PhUSE_Wiki

Lots of action followed by the six working groups.

Chris is one of the core members to promote CDISC among industry and regulator  and he is also the most active blog writer on d-Wise blog and you can get informed:

http://d-wise.com/blog/

GitHub and Weekend Programming

Yihui of Iowa State just texted me that GitHub is programmers’ Facebook. Inspired by him(great thanks!), I also begin to play with GitHub now:

https://github.com/Jiangtang

Currently I only created one repo as personal SAS code repository. To kill weekend time, I uploaded piece of codes to count the OpenCDISC validation rules by models. To use it:

filename CDISC url “https://raw.github.com/Jiangtang/Programming-SAS/master/Rules_Count_OpenCDISC_XML.sas”;

%include CDISC;

%Rules_Count_OpenCDISC_XML(dir=C:tempOpenCDISCsoftwareopencdisc-validatorconfig)

while get:

OC_by_model

Happy weekend and happy programming.