Parsing JSKit/Echo XML comments files
I just switched over from my temporary JSKit comments to custom local comments. The main reason I went with JSKit to begin with rather than just not have comments is that they provide the comments in an XML file. This meant that I was able to convert the JSKit/Echo comments on my site to the new system.
I wrote it in Python because my comments database uses Django on the back end.
[toggle code]
- #!/usr/bin/python
- # -*- coding: utf-8 -*-
- from optparse import OptionParser
- import sys, urlparse, datetime
- import xml.dom.minidom as minidom
- parser = OptionParser(u'%(prog) [options] <jskit file>')
- (options, args) = parser.parse_args()
-
if not args:
- parser.print_help()
-
def getEntry(comment, key):
- entry = comment.getElementsByTagName(key)
-
if entry:
- return entry[0].firstChild.data.strip()
- return None
-
def getValue(comment, key):
- possibilities = comment.getElementsByTagName('jskit:attribute')
- entry = None
-
for possibility in possibilities:
-
if possibility.getAttribute('key') == key:
- entry = possibility
- break
-
if possibility.getAttribute('key') == key:
-
if entry:
- value = entry.getAttribute('value').strip()
- return value
- return None
-
def getPosterURL(webpresence):
-
if '],[' in webpresence:
- webpresences = webpresence.split('],[')
-
else:
- webpresences = [webpresence]
-
for webpresence in webpresences:
- webpresence = webpresence.strip('["]')
-
if webpresence:
- service, serviceURL = webpresence.split('","')
-
if service in ['login-twitter', 'login-blogspot']:
- return serviceURL
-
if service not in ['login-openid', 'login-gfc']:
- print 'Unknown service:', service, serviceURL
- sys.exit()
- return None
-
if '],[' in webpresence:
The “getEntry” function just gets any subelement from an XML element by name. The “getValue” method gets the value of a specific keyed element named jskit:attribute, which is what JSKit uses to store the commenter’s IP address as well as sometimes their identity and personal web site.
Finally, “getPosterURL” tries to get their personal web site from the sites listed in jskit:attribute keyed as “Webpresence”. That element’s value contains both public sites and private login URLs. Since JSKit displayed them publicly, I thought it would be nice to the commenter to continue linking to their Twitter or Blogspot site in my new system. But, at least among my commenters, I can’t see any reason to link to a person’s openid or gfc URL. (And if the webpresence is neither of those four types, the function will immediately bail and let you know, so that you can add it to either the good list or the ignore list.)
Here is the main loop that goes through every entry in the JSKit file to extract each comment and it’s associated info: commenter name, IP address, and potentially their web address:
[toggle code]
- empties = []
-
for xmlfile in args:
- pages = minidom.parse(xmlfile)
-
for page in pages.getElementsByTagName('channel'):
- pageURL = page.getElementsByTagName('link')[0].firstChild.data
- comments = page.getElementsByTagName('item')
-
if comments:
- print pageURL
- url = urlparse.urlparse(pageURL)
-
for comment in comments:
- pubDate = getEntry(comment, 'pubDate')
- #not too sure about this--it ignores the time zone information
- pubDate = datetime.datetime.strptime(pubDate, '%a, %d %b %Y %H:%M:%S +0000')
- originalComment = getEntry(comment, 'description')
- ipAddress = getValue(comment, 'IP')
- poster = getEntry(comment, 'author')
-
if not poster:
- poster = getValue(comment, 'user_identity')
-
if not poster:
- poster = 'Guest'
- poster = poster
- posterURL = None
- webpresence = getValue(comment, 'Webpresence')
-
if webpresence:
- posterURL = getPosterURL(webpresence)
-
if not posterURL:
- posterURL = ''
- print "\t", pubDate
- print "\t\tIP:", ipAddress
- print "\t\tPoster:", poster
- print "\t\tSnippet:", originalComment[:100].replace("\n", ' ')
-
if posterURL:
- print "\t\tPoster’s web URL:", posterURL
-
else:
- empties.append(pageURL)
-
if empties:
- print 'Listed pages with no comments:'
- print "\t"+"\n\t".join(empties)
This is pretty basic XML parsing in Python. If there is no name for the commenter, they get the name “Guest”. Python’s datetime.datetime class can’t handle time zones, but as far as I can tell JSKit always provides the pubDate in universal time. So make sure that your database also expects it in universal.
JSKit also gets entries for pages that don’t have comments. Just in case I’m not understanding why they do this, I also list out all of the “empties” at the end of the script. They don’t appear to be pages that had comments but which no longer do; at least one page on my site that used to have comments is not in that list.
- October 29, 2012: Parsing JSKit/Echo XML using PHP
-
According to dpusa in the comments, you can manually insert comments into WordPress using something like:
[toggle code]
-
$data = array(
- 'comment_post_ID' => 256,
- 'comment_author' => 'Dave',
- 'comment_author_email' => 'dave@example.com',
- 'comment_author_url' => 'http://hal.example.com',
- 'comment_content' => 'Lorem ipsum dolor sit amet...',
- 'comment_author_IP' => '127.3.1.1',
- 'comment_agent' => 'manual insertion',
- 'comment_date' => date('Y-m-d H:i:s'),
- 'comment_date_gmt' => date('Y-m-d H:i:s'),
- 'comment_approved' => 1,
- );
- $comment_id = wp_insert_comment($data);
In PHP, you should be able to loop through a jskit XML file using something like:
[toggle code]
- $comments = simplexml_load_file("/path/to/comments.xml");
-
function getJSKitAttribute($item, $key) {
- $attribute = $item->xpath('./jskit:attribute[@key="' . $key . '"]/@value');
- $attribute = $attribute[0];
- return $attribute;
- }
-
foreach ($comments as $page) {
-
if ($page->item) {
- $pageURL = $page->link;
- echo $pageURL, "\n";
-
foreach ($page->item as $comment) {
- $date = $comment->pubDate;
- $text = $comment->description;
- $IP = getJSKitAttribute($comment, 'IP');
- echo "\t", substr($text, 0, 80), "\n";
- echo "\t\t", $date, "\n";
- echo "\t\t", $IP, "\n";
- }
- echo "\n";
- }
-
if ($page->item) {
- }
You could then fill out the $data array with the values of $date, $text, $IP, etc., or hard-code them to default values if they don’t exist. Do this in place of (or in addition to) the three “echo” lines.
[toggle code]
-
$data = array(
- 'comment_post_ID' => $comment->guid,
- 'comment_author' => $comment->author,
- 'comment_content' => $text,
- 'comment_author_IP' => $IP,
- 'comment_agent' => 'manual insertion',
- 'comment_date_gmt' => strtotime($date),
- 'comment_approved' => 1,
- );
- $comment_id = wp_insert_comment($data);
-
$data = array(
More Python
- Quick-and-dirty old-school island script
- Here’s a Python-based island generator using the tables from the Judges Guild Island Book 1.
- Astounding Scripts on Monterey
- Monterey removes Python 2, which means that you’ll need to replace it if you’re still using any Python 2 scripts; there’s also a minor change with Layer Windows and GraphicConverter.
- Goodreads: What books did I read last week and last month?
- I occasionally want to look in Goodreads for what I read last month or last week, and that currently means sorting by date read and counting down to the beginning and end of the period in question. This Python script will do that search on an exported Goodreads csv file.
- Test classes and objects in python
- One of the advantages of object-oriented programming is that objects can masquerade as each other.
- Timeout class with retry in Python
- In Paramiko’s ssh client, timeouts don’t seem to work; a signal can handle this—and then can also perform a retry.
- 30 more pages with the topic Python, and other related pages
More XML
- Catalina: iTunes Library XML
- What does Catalina mean for 42 Astounding Scripts?
- Parsing JSKit/Echo XML using PHP
- In the comments, dpusa wants to import JSKit comments into WordPress, which uses PHP. Here’s how to parse them using PHP.
- Auto-closing HTML tags in comments
- One of the biggest problems on blogs is that comments often get stuck with unclosed italics, bold, or links. You can automatically close them by transforming the HTML snippet into an XML document.
- minidom self-closes empty SCRIPT tags
- Python’s minidom will self-close empty script tags—as it should. But it turns out that Firefox 3.6 and IE 8 don’t support empty script tags.
- A present for Palm
- Palm needs a little help understanding XML.
- Five more pages with the topic XML, and other related pages
Hi, Can you provide a simple HTML code for installing in blogger template?
[Moved from other page:] Thanks, I am alos considering switching from JSKit as it is closing on 01 Oct 12 and was looking for a reliable feedback.
NseBse in India at 2:17 a.m. August 31st, 2012
aZv1+
Unfortunately, I don’t think blogger will accept Python code. But if you can find a friend with Mac OS X or Linux they should be able to run this script for you. You’ll need someone who knows Python to modify it to populate your blogger comments.
If you don’t want to do programming, and you want to use Disqus, it looks like they have an import page that you can use to import from JSKit to Disqus.
Jerry in San Diego at 10:21 a.m. August 31st, 2012
3eqBR
Does this work for prepping JSKit comments for import into a different blog using Wordpress's native comments system? I've got a large xml file of comments but can't seem to attach the comments to any of the matching posts which were transferred to WP.
dpusa at 3:35 p.m. October 27th, 2012
Yf/BK
dpusa, if you’re a programmer and there’s an API for inserting comments into WordPress, then yes, this will work. But you’ll need to write the program yourself; since WordPress uses PHP, it may be easier to use PHP’s XML functionality.
Jerry Stratton in San Diego at 9 p.m. October 27th, 2012
+g/Ql
Thanks, Jerry. I'm not a programmer, so I have to ask if the following is the kind of thing you're talking about:
$agent = $_SERVER['HTTP_USER_AGENT'];
$data = array(
'comment_post_ID' => 256,
'comment_author' => 'Dave',
'comment_author_email' => 'dave@domain.com',
'comment_author_url' => 'http://www.someiste.com',
'comment_content' => 'Lorem ipsum dolor sit amet...',
'comment_author_IP' => '127.3.1.1',
'comment_agent' => $agent,
'comment_date' => date('Y-m-d H:i:s'),
'comment_date_gmt' => date('Y-m-d H:i:s'),
'comment_approved' => 1,
);
$comment_id = wp_insert_comment($data);
dpusa at 11:12 a.m. October 28th, 2012
Yf/BK
dpusa, yes, that’s the kind of thing. I’ve written a follow-up to this post showing how to parse JSKit XML using PHP. It might help you get started.
Jerry Stratton in San Diego at 10:45 a.m. October 29th, 2012
3eqBR
Wow! Thank you very much for this. Like I said, I'm no programmer, so I can only guess that you mean this will do something like a bulk replace of the jskit attributes with WP-format ones. If that's the case I might know somebody who'll be able to help me figure out how to actually run it.
dpusa at 2:32 p.m. October 29th, 2012
Yf/BK