Category Archive ‘XML‘

 
 

What’s New

I didn’t blog for a while in this first half March and there are bunches of new stuff to catch up:

I had a new baby! He was delivered on time (and on budget!), lions tigers and bears, oh my… His brother is Tiger so I named him, Leo.

And I got the latest SAS 9.3 (TS1M2) installed! SAS is jus getting much beautiful.

SAS9.3_12.1

OpenCDISC had the latest release, Version 1.4 with the new SDTM 3.1.3 validation checks,—and yes CDISC itself also had some significant updates:

SDTMv1.3 and SDTMIGv3.1.3 now have the machine readable metadata online. It’s a nice improvement (last year I just posted The Great, Open, Vendor-neutral, Platform-independent Data Standards, . . . Yet in PDF Formats).

Define-XML now turns to 2.0 (finally).

R had its final 2.* release, Version 2.15.3 and Version 3.0.0 will just come soon. RStudio also had a update recently. RStudio is the best IDE (not just R IDE) I used.

Google will shut down Google Reader, the best RSS reader ever. It’s a huge loss and I tell you, for example, the famous SAS and statistical blogger Wensui Liu, once frequently posted on Windows Live Space, and then Blogspot and finally WordPress. The former two blogs were closed and Google Reader feed is the only way to archive these lost posts!

Blogging is Awesome: CDISC Bloggers

I remember when blogging was cool.

Before the specializing and monetizing and Twitter-izing.

                                      —Peter Dewolf

Well I think blogging is still cool (and awesome and awesome …). The most appealing personal reason is, blog posts are Google searchable and suitable for archive while Tweets NOT. Admittedly I hold some sort of  Existentialism 2.0:

if it is not Google searched, it doesn’t exit!

Last month I placed a post on how to keep pace with CDISC from its official channels and I feel cool to add an appendix of source from the awesome blogosphere. Fortunately or not, CDISC is still in the niche market of topics and it takes few efforts to get the list(update me if someone else available! if you are a Google Reader user, just simply import this file, my Google Reader subscription on CDISC):

1. Blog @ Assero by Dave Iberson-Hurst (“Dave IH”)

http://www.assero.co.uk/category/blog/

Insightful and full of humor. I retweeted all of its latest posts and you can feel somehow on these tittles (YES on CDISC):

What I Want, What I Really Really Want

Churchill, the FDA and a Fall

Mad March and the FDA

Btw, I write blogs casual way while it is very impressive to read IH reminding me the George Orwell style.

2. d-Wise Technologies Blog

http://www.d-wise.com/blog/

It is my employer’s official blog site where Chris Decker is the key contributor to CDISC. You can check out his latest posts on FDA/PhUSE Annual Computational Science Symposium where he served as committee lead:

Overcoming Industry Challenges: A Shift to Collaboration

Validation and Quality: Are They the Same?

I will also commit to update this blog as my understanding on clinical standards goes. Here is the saying:

look to the master,
follow the master,
walk with the master,
see through the master,
become the master.

3.  XML4Pharma Blog

http://cdiscguru.blogspot.com/

with industry news and hard (while cool) way writing on XML (CDISC ODM, define.xml).

4. eClinical Trends by Clinovo

http://blog.clinovo.com/category/cdisc/

Clinovo jumps to this topic by launching a CDISC SDTM convertor CDISC Express.

5. eClinicalOpinion

http://eclinicalopinion.blogspot.com/

This blog is most focused on EDC, the clinical data management part. I like its series discussion on CDISC ODM.

6. eCTD Regulatory Submissions Network

http://ectdregulatorysubmissionsnetwork.blogspot.com/

This is a personal blog by Shakul Hameed. I read it mostly to get some information on submission requirements from European regulatory.

7. HL7 Watch

http://hl7-watch.blogspot.com/

while it is not CDISC directly related (#6 also), it’s nice to get some voice of HL7 which would be the future of CDISC.

8. From a Logical Point of View-CDISC

http://www.jiangtanghu.com/blog/category/cdisc/

Yes this one, my 2 cents. I will keep recording my personal immersion and understanding on CDISC and related clinical standards. (while it is privilege to cross reference oneself in his/her own blog! Keep awesome, keep blogging.)

9. Linked Data and URI:s for Enterprises

http://kerfors.blogspot.com/

Look at the colon (:) in the title of this blog and you’re right this blog plays (at least) with XML. I find it is good resource (thanks @kerfors for referencing!) to learn ODM, the foundation of CDISC while the latest post is

Semantic models for CDISC based standard and metadata management

P.S.: Blogger Chris Hemedinger maintains a nice list of SAS bloggers (blogs by SAS employees, and blogs by SAS customers, consultants, and the analytics community).

OpenCDISC Validator V1.3: An Unboxing Review (1): counting issue

The lasted OpenCDISC Validator version 1.3 was released at 29 March, 2012 (btw, there is a typo in the Line 1 of CHANGELOG.txt within the package: “2012” not “2011”). As usual, you can submit the following SAS scripts to get some basic information(remember to customize your directory):

filename CDISC url "https://raw.github.com/Jiangtang/Programming-SAS/master/Rules_Count_OpenCDISC_XML.sas";

%include CDISC;

%Rules_Count_OpenCDISC_XML(dir=C:OpenCDISC1.3compareopencdisc-validator_1.3config)

and you get a summary of validation rules of OpenCDISC Validator V1.3 (499 total unique rules):

OpenCDISC_V1.3

where

AD: Analytical Data
CT: Controlled Terminology
DD: Data Definition
OD: Operation Data Model
SD: Study Data
SE: SEND data

As comparison, a summary of V1.2.1 (385 total unique rules) posted before:

The most significant enhancement of V1.3 against V1.2.1 is the adding of rules for SDTM 3.1.2 with Amendment 1 and SEND 3.0. You can see there are also some changes among others modules, such ADaM 1.0 and SDTM 3.1.2. The OpenCDISC release newsletter said that there are 43 new SDTM rules added. Well, rules deleted, rules added, rules commented, we now have some arithmetical discrepancies.

The scripts above capture all instances of validation rule IDs (also delete some commented for example in config-define-1.0.xml, four rules commented: OD0004, OD0005, OD0007, OD0008). We can also double validate the counts manually:

  • copy all contents for example in SDTM 3.1.2 in its website into Notepad++ (where line numbers displayed)
  • delete all unnecessary entries
  • then the last line number is the total number of the rules (227 in this case).

Another way to check the rules is to open the XML configuration files using a web browser:

Theoretically the three ways are identical in counting, but there is an open bug in the style sheet file in …OpenCDISC1.3opencdisc-validatorconfigresourcesxslconfig.xsl, Line 175:

<xsl:template match="val:Unique|val:Condition|val:Match|val:Regex

|val:Required|val:Lookup|val:Metadata">

There is no “val:Find” to render all the Find validation rules (AD0061 in config-adam-1.0.xml) so all Find validators are not displayed. A suggested workaround is just to add “val:Find” to the file:

<xsl:template match="val:Unique|val:Condition|val:Match|val:Regex

|val:Required|val:Lookup|val:Metadata|val:Find">

Actually in the “OpenCDISC Validation Framework” page of OpenCDISC website, the “Find”validator is not documented yet.

<to be continued>

Face Off: Review OpenCDISC XML files

OpenCDISC, the first open source CDISC validator, is already in the toolbox of FDA reviewers (CDER/CBER, see CDISC Standards in the Regulatory Submission Process, 26 January 2012, P.33). The key features in OpenCDISC is a dichotomy of validation rules (XML based) and application logic. Currently OpenCDISC Validator (Version 1.2.1) officially supports the four following CDISC modules:

You can get the corresponding configuration files (validation rules) online or in the software folder (in ..opencdisc-validatorconfig with extension of .xml). Since SDTM 3.1.2 has the most rich set of validation rules from Janus, WebSDM and of course additional  OpenCDISC rules by itself, its configuration file (config-sdtm-3.1.2.xml) deserves more attention. Better understanding of config-sdtm-3.1.2.xml is the first step to customize the software according to business needs. Followings are some personal tips and tricks to play and even “torture” the file, using Notepad++, web browsers (IE and Firefox), Excel with MSXML and SAS XML Mapper.

1. DON’T use the Windows default Notepad to open and edit the xml file

XML_Notepad

while the reason:

if you use Notepad to open a XML file, almost you get nothing but strings and strings.

and another supporting reason, see bellowing picture.

2. USE Notepad++ or other REAL text editors to open and edit it

XML_Notepad

Notepad++ makes the difference. It supports multiple tabs view, XML syntax highlighting and XML tags match and other fancy stuff never in the plain Notepad. And like OpenCDISC, it’s free, both in sense of free beer and free speech.

Other real text editor, include Vim, UltraEdit and such, but for most users, I still think Notepad++ is the most handy one.

3. At first, use a web browsers to review it

XML_IE

It is the web view of config-sdtm-3.1.2.xml. The secret is a style file, define-1.0.xsl in ..opencdisc-validatorconfigschematron. This is another story of dichotomy. The config-sdtm-3.1.2.xml file itself is only used to store metadata (machine-readable), while the style file (also a XML file) used to instruct how to display it (human-readable). Within some proper internal interface, web browsers (I tested in IE and Firefox; Google Chrome doesn’t work). Excel can also render this XML file well (only test on Excel 2010 and 2007) while Web view is much better:

XML_Excel

4. The real awesome job: use Microsoft XML parser or other XML parsers to dig into XML structure

XML_Tags_Excel

I use Excel 2010 with Microsoft XML parser (MSXML 6.0. You can get the version of your MSXML by visiting this website in IE and you will get the different results when switching to other web browsers because Firefox and Chrome use other parsers).

You can also get a instance of each XML tag:

XML_Tags_Excel_preview

5. The real awesome job: use SAS XML Mapper to get the tabulation view

And you may want to exact all the tables in the XML file with tabulation view, ideally, in SAS dataset:

For example, the first few rows in config-sdtm-3.1.2.xml:

ODM_xml_tab

and the corresponding SAS dataset:

ODM_tab

Actually you can put all the data in XML into a big dataset but with lots of redundancies. To use SAS XML Mapper (the latest version is 9.3), you should design a mapping file to tell the structure of the XML file. For the simple ODM dataset, you indicate the table name, column name and path, type and length:

map

It never be fun to play with XML files. SAS XML Mapper is supposed to read CDISC ODM based XML files automatically (OpenCDISC XML files are called ODM compliant), but at least for this config-sdtm-3.1.2.xml, it failed and that’s why we should create a mapping file (see above) by ourselves. Fortunately you don’t need to write it from scratch (it would be thousands of lines of codes):

  • find a CDISC ODM based XML file that SAS XML Mapper can read automatically, e.g., in http://www.cdisc.org/define-xml, a file named define-example1.xml works well.
  • use AutoMap function in SAS XML Mapper to get the mapping file.
  • modify the mapping file to fit your needs.
  • for details, refer SAS XML mapping syntax.

6. Final Notes for Excel

Right click config-sdtm-3.1.2.xml then open with “Microsoft Excel”:

Excel1

Option 2 will go to section 3. If go with option 1:

Excel2

Option 1-1 and 1-2:   tabulation view in section 5

Option  1-3:  tag view in section 4.

Dive into CDISC Express (1): Introductory

Recently I did for my personal project some research on Clinovo’s open source application, CDISC Express, a SAS application based on Excel framework designed to map clinical data to CDISC SDTM domains automatically. Not perfect yet, but it is easily understandable and practically usable after few hours’ of exploration of user guide. And most important, it is on the right way: an automatic CDISC converter is the magic weapon in almost every clinical programmer’s dream.

CDISC Express is the first and only practically usable open source CDISC converter I even met. I wrote a post a month ago when I first tested it with great interests and reported some issues to its fix system. Then I also had the great opportunity to discuss the software via email with its core developer, Romain Miralles. This post is just my personal notes on how to use and dig into the software, and will be best serve as a working documentation. You can return to me for any questions and comments.

By the way, there is an opportunity for your practicing and you will also have a change to win an iPad2 from Clinovo’s CDISC Express Contest:

http://www.clinovo.com/cdisc/game

The due day is July 15th and I already submitted my work. That’s fun.

1. Download and Installation

You can get CDISC Express for free in

http://www.clinovo.com/cdisc/download

It is a window application and will be installed by default in

C:Program FilesCDISC Express

clip_image002[4]

After installation, this path will be coded as a macro variable &CDISCPATH in the following six SAS files which are all located in C:Program FilesCDISC Expressprograms:

create_new_study.sas

generate_Definexml.sas

generate_mapping_template.sas

generate_SDTM.sas

Validate_Mapping_File.sas

Validate_SDTM_Domains.sas

The macro variable reads as

%LET CDISCPATH = C:Program FilesCDISC Express;

If you change the destination folder at the installation stage, e.g., to D:CDISC Express, the value of the macro variable &CDISCPATH will be changed accordingly in the six files mentioned before:

%LET CDISCPATH = D:CDISC Express;

Note that if you want copy the whole folder of files to another destination, you should at least manually change the value of &CDISCPATH in such six files or add some codes to capture the path accordingly. From this point of view, the path setting of CDISC Express is not completely portable. Recommend that if you have such needs, just re-install the software in any destination you want. It will not write any records into registry and you can have many copies in one machine.

The following discussion assumes the software roots in C:Program FilesCDISC Express.

2. Working Flow

You can follow all the 6 action steps one by one coded in

C:Program FilesCDISC Expressprograms

1) Create a new study (create_new_study.sas)

Simple and easy. Just assign a new study name in a macro call and run.

2) Generate mapping file (generate_mapping_template.sas)

This is the critical and most time consuming part. You should design mapping rules for every domain needed in Excel spreadsheets (the MAPPING FILE). If done, all other tasks, such as generate SDTM datasets, SAS transport files, define.xml and validation, can be well done by just clickingclip_image003[4] buttons

.

3) Validate mapping file (Validate_Mapping_File.sas)

For validating the mapping file, just click the button. As mentioned, the most important work is designing mapping file. It would be back and forth to design mapping file and validate it.

4) Generate SDTM datasets (generate_SDTM.sas)

If mapping file is OK, click the button.

5) Validate SDTM datasets (Validate_SDTM_Domains.sas)

Click the button.

6) Generate Define.xml (generate_Definexml.sas)

Click the button.

Following part will dig into the software step by step.

TobeContinued

XML and SAS

Last month, I gave a talk, XML: the SAS Approach, in CDISC Interchange China 2010(at the Medical School of Fudan University, Shanghai, 2010-09-15). FDA favors CDISC and HL7, the two XML based standards, and SAS programmers in biopharmaceutical industry  need incorporate the XML technology into their toolboxes. Fortunately, you don’t need to be an XML expert then to play XML in your daily work, and, SAS system DOES offer multiple tools and applications to handle XML files, i.e. import and export XML data:

  • SAS data steps approach:                        import and export
  • SAS XML Libname engine:                         import and export
  • SAS ODS XML statement(ODS MARKUP):   export
  • PROC CDISC:                                            import and export
  • SAS XML Mapper:                                      import
  • SAS CDISC Viewer:                           as if  import

The SAS CDISC Viewer and PROC CDISC procedure are some bit toys, and the rest really work. The Perl Regular Expression(PRX) approach is also presented to export and import XML data.

A simple demo. First, use FILE and PUT statements to generate an XML file:

data _null_;
    file "export.xml";
    put ‘<?xml version="1.0" encoding="windows-1252" ?>’;
    put ‘<ROWSET>’;
    put ‘<ROW>’;
    put ‘<text> Welcome to CDISC Interchange 2010 China </text>’;
    put ‘</ROW>’;
    put ‘<ROW>’;
    put ‘<text> We are in Shanghai! </text>’;
    put ‘</ROW>’;
    put ‘</ROWSET>’; 
run;

Then read the whole XML file to SAS dataset:

data import0 ;
    infile "export.xml" dsd missover truncover lrecl = 1024;
    input line $1024.;
    if line = ” then delete;
run;

Third step, extract the information you want(the text between <text> and </text> tags) using  Perl Regular Expression:

data import (keep = line );
     retain queName ;
     retain line ;
     set import0;     

     /*use PRX to capture the structure of XML data;*/
     If _n_=1 then do;
            queName=prxparse(‘/^<text> /’);
     end;
     queNameN=prxmatch(queName,line);

    /*use PRX to remove the XXML tags;*/
     if queNameN>0 then do;
        rx1=prxparse("s/<.*?>//");
        call prxchange(rx1,99,line);
        output;
     end;     
run;

The logic of PRX approach to process XML data is very simple and can be easily modified according to your needs:

  • complicate and utilize the PRX codes to capture the hierarchical structure of XML data.
  • remove XML tags and output the information to SAS dataset.