Recently I posted a frequency analysis on Rick Wicklin’s popular SAS/IML blog. Sanjay Matange also produced a nice heatmap on Rick’s blogging history using the summary data I published. Here just release the ideas and SAS codes to get data from Rick’s blog dynamically. You may modify the codes slightly to obtain data from all other SAS in-house blogs (http://blogs.sas.com/index.php) since they share the same template. For other blogs, you should research the web pages accordingly to get the best suitable methods and this post can also serve as an example.
First step: define the scope
For my purpose, I only need the titles and publish dates of Rick’s posts. It is so called the “metadata” of the blog. I do not need all the post contents. By the way, if all information needed, you can use a blog backup tool, or write codes to retrieve all the pages of http://blogs.sas.com/iml at the maximum depth, or simply, you can write to Rick and say: hey Rick, could you please send me all the contents of your blog? And Rick may go to the management console of his own blog, export all the contents to an XML file and get back to you.
Second step: analyze the web pages
Browse to the right panel of Rick’s blog, in the ARCHIVES frame, click “Older”:
And you get
This page just gives a big picture of Rick’s blog (ARCHIVE section is always a good place to get such metadata, for example, archives for my blog). But we need more. Click “view topics” for example of September, 2010:
This page is exactly what we want with titles and dates. Open an editor to write codes immediately to read all the information in this page?—wait. Currently this blog has posts across 11 months and you can expect the increase. You should design a dynamic method to read all the topics pages: Sep 2010, Oct 2010, … and, today().
Return to the archives page. RCM (right click your mouse) and select “View page source” if you use Google Chrome web browser (“View Source” in IE; “View Page Source” in Firefox) and you get all the HTML scripts (Note: you DO not need any knowledge of HTML to understand this post). Copy and paste them into a text editor supporting HTML syntax highlighting (such as Notepad++). Search all instances of “view topics” we mentioned before:
We are lucky. They are 11 instances of “view topics” accompanying with 11 hyperlinks for the currently 11 months’ archives of Rick’s blog. We can read such 11 hyperlinks to a macro array for dynamic retrieval.
Then we return to the single topics page, for example of September, 2010. Review the HTML source file. Search for “posted_by_date” and we get 14 instances which is same as the number of posts in September 2010:
We should also need to locate all the instances of titles. Search “/iml/index.php?/archives/” and we get 17 responses:
We see 3 instances at end of the finding results don’t contain any titles. You can check other pages to confirm such pattern. Yes we can use regular expressions to parse the HTML pages to locate more exactly for the titles. But for a quick job and due to the relative simple HTML pages, some basic SAS character functions are enough for our purpose. In the following codes, limited regular expressions are used only to remove HTML tags such as “<a href=”.
After such explorative search of HTML scripts, we can get the basic idea where can we find the interested information. Then we begin to coding work.
Third step: Coding at last!
For our purpose, we should first read the archive page to get all the topics links to a macro array, then read the all the topics pages dynamically. Finally, we should also add the all the calendar dates with holidays. Some friends may find that they met piece of the following codes before. Yes, such codes just assembled some skills what I learned from Art Carpenter, Richard DeVenezia, Jian Dai and lots of programmers before!
3-1: read archive page
filename archive URL "&URL";
length text $1024;
infile archive lrecl=1024;
input text $;
if index(text, ">view topics<") then output;
3-2: read all topics pages
set archive1 end=eof;
if eof then call symputx(‘total’,II);
%do i=1 %to &total;
filename f&i URL "&&summary&i";
length text $1024;
infile f&i lrecl=1024;
input text $;
if index(text, "/iml/index.php?/archives/") or index(text, "posted_by_date") then output;
/*remove HTML tags;*/
if index(text,"201") and length(text)<10 then delete;/*be carefull! hard coding;*/
by grpn seq flag;
if first.flag then title=lag(text);
if flag=1 then delete;
set %do i=1 %to &total; fff&i %end; ;
format worddat ddmmyy10.;
proc sort ;
by worddat descending seq ;