Auto-closing HTML tags in comments
I was just on Watts up with that? and saw Anthony Watts’s biggest pet peeve: unclosed italics. However, his blog uses WordPress, and WordPress uses PHP. As it turns out, I’ve been working on integrated comments on my blog (I have it working now on the Biblyon Broadsheet) and tried to deal with this potential issue.
The way I did it was to use PHP’s built-in XML functionality. PHP’s XML objects can take in HTML and write out XML—with all tags fully closed.
Here is a stripped-down version of the method I use to do this, along with a simple test case:
[toggle code]
-
<?
-
function fixHTML($comment) {
- $xml = new DOMDocument('1.0');
-
if (@$xml->loadHTML($comment)) {
- //pull just the body out and save it
- $body = $xml->getElementsByTagName('body');
- $body = $body->item(0);
- $xml = $xml->saveXML($body);
- //DOMDocument appears to not use utf8 as its default
- $xml = utf8_decode($xml);
- //strip out the <body></body> tag
- $xml = substr($xml, 6, -7);
- return $xml;
-
} else {
- return false;
- }
- }
- $testComment = 'I think your blog is the <i>greatest!';
- echo fixHTML($testComment);
-
function fixHTML($comment) {
- ?>
As you can see, the test comment has unclosed italicization. But once run through this function, that html becomes:
- <p>I think your blog is the <i>greatest!</i></p>
Presumably, you will already have used PHP’s strip_tags function to remove all tags except for the ones you want to allow. If not, you can add a strip_tags as the first line of this function.
One of the other issues with allowing HTML in your comments, however, are attributes. The most common attribute is the “href” attribute on the “a” tag for making links. If you want to strip all attributes except the href, you can do that, too:
[toggle code]
- $xml = new DOMDocument('1.0');
-
if (@$xml->loadHTML($comment)) {
- //remove all attributes except href
- $xpath = new DOMXPath($xml);
- $attributeBearingNodes = $xpath->query('//*[@*]');
-
foreach ($attributeBearingNodes as $node) {
- $attributes = array();
-
foreach ($node->attributes as $attributeName=>$attributeNode) {
-
if (!($node->tagName == 'a' && $attributeName == 'href')) {
- $node->removeAttribute($attributeName);
- }
-
if (!($node->tagName == 'a' && $attributeName == 'href')) {
- }
- }
This finds every tag that has an attribute on it, loops through them, and then loops through the attributes on those tags. If the attribute is not an “href” on an “a” tag, the attribute is removed.
That doesn’t necessarily fix everything, though. The “href” attribute itself can contain JavaScript that runs directly from your page.
[toggle code]
-
foreach ($node->attributes as $attributeName=>$attributeNode) {
-
if (!($node->tagName == 'a' && $attributeName == 'href')) {
- $node->removeAttribute($attributeName);
-
} else {
- //this is an HREF on an A tag, but we still want to avoid running javascript directly on the page
- $link = $attributeNode->value;
- $link = strtolower(trim($link));
-
if (strpos($link, 'javascript') === 0) {
- $attributeNode->value = 'http://example.com/prettykittens';
- }
- }
-
if (!($node->tagName == 'a' && $attributeName == 'href')) {
- }
This will check every “a” tag’s “href” to make sure it doesn’t start with the word “javascript”. In the example, it replaces the offending link with a link to pretty kittens. In practice, it might be more appropriate to return an error at that point. Depending on the audience for your blog, you might also decide to play it extremely safe and reverse the logic: rather than only getting rid of “href” attributes that start with “javascript”, get rid of all of them that don’t start with “http”.
- if (strpos($link, 'http') !== 0) {
Doing it that way will also bar ftp links, email links, and other less commonly-used links, but you don’t see too many of those around any more.
- Comment on the Broadsheet
- I’ve enabled comments on recent articles on this blog. A long time ago I used Haloscan, but that didn’t work out very well and eventually they were bought by JS-Kit. I use JS-Kit for the main blog, but they started charging sometime afterward, and while the hoboes.com address is grandfathered in, the godsmonsters.com address was not.
- My biggest pet peeve on running this blog: Anthony Watts at Watts Up With That?
- “PLEASE be careful when trying to bold, italicize, link, or blockquote in comments. Just one transposed character is all it takes. Also, there’s no need to try to hyperlink URL’s, WordPress will automatically hyperlink any URL you type in like this.”
More PHP
- Stable sorting of numerically indexed arrays in PHP
- From PHP 4.1, sorted arrays are no longer “stable”. That is, if they are resorted and two items are equal values, they no longer can be expected to maintain their order vis-a-vis each other.
- Override the Host: header when using PHP’s readfile
- It is possible to specify HTTP headers when using URLs with PHP’s file-oriented functions such as readfile.
- Add nodes to SimpleXMLElement
- If you want to add child nodes in PHP’s SimpleXML, the correct way to do it is to add the node first, then create it.
- Web display of Taskpaper file
- It is easy to use PHP to convert a Taskpaper task file into simple HTML conducive to styling via CSS.
- New PHP Tutorial
- I’ve just uploaded a new version of my PHP tutorial, with a better MySQL section.
- Two more pages with the topic PHP, and other related pages
More XML
- Catalina: iTunes Library XML
- What does Catalina mean for 42 Astounding Scripts?
- Parsing JSKit/Echo XML using PHP
- In the comments, dpusa wants to import JSKit comments into WordPress, which uses PHP. Here’s how to parse them using PHP.
- Parsing JSKit/Echo XML comments files
- While I’m not a big fan of remote comment systems for privacy reasons, I was willing to use JSKit as a temporary solution because they provide an easy XML dump of posted comments. This weekend, I finally moved my main blog to custom comments; here’s how I parsed JSKit’s XML file.
- minidom self-closes empty SCRIPT tags
- Python’s minidom will self-close empty script tags—as it should. But it turns out that Firefox 3.6 and IE 8 don’t support empty script tags.
- A present for Palm
- Palm needs a little help understanding XML.
- Five more pages with the topic XML, and other related pages
This solution will work best when storing the HTML, so that you don’t have to run this code every time you display every comment.