Django: fix_ampersands and abbreviations

Jerry Stratton, May 22, 2011

I’ve been slowly converting all of my Django fields to use UTF8 instead of named entities. The combination of using named entities, talking about code, and talking about D&D makes it difficult to know when the ampersand should be converted and when it shouldn’t.

For the most part the conversion to UTF is going well, except that there’s no easy way to know when < and > need to be converted and when they don’t: sometimes I’m talking about HTML, and sometimes I’m using it. So for those two characters, I’ll need to continue using the ampersand entity directly in my blog content.

It turns out, though, that the fix_ampersands filter handles this. The documentation makes it sound like fix_ampersands converts all ampersands, but in fact it uses a simple regular expression to exclude existing named entities and numeric character references.¹

Unfortunately, this still leaves a few edge cases. Any use of ampersand abbreviations, such as Q&A, R&R, M&Ms, or R&D, runs the risk of triggering one of them. For me, this comes up mainly when talking about role-playing games such as D&D; V&V; and T&T. Those first two ampersands look like named entities to fix_ampersands because django/utils/html.py uses the simple and fast expedient of a very simple regular expression:

unencoded_ampersands_re = re.compile(r'&(?!(\w+|#\d+);)')

In my case, a simple change to the regex will handle the edge case examples above:

unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')

There are no one-character named entities, so this regular expression includes what look like single-character named entities in its fixes. Rather than “\w+” it uses “\w{2,}” to only exclude two-character or longer named entities from replacing.²

This won’t fix the problem when the abbreviation looks like a two-character (or more) entity, but in my case the problem has only shown up for one-character abbreviations. In Python it’s possible to “fix” this without hacking the core code directly. At the end of settings.py, I added:

#modify fix_ampersands to handle single-character abbreviations
import django.utils.html, re
django.utils.html.unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')

This overwrites the regular expression used by django.utils.html with one that ignores single-character entities.

This is only a temporary fix. Once I finish converting every field so that entities appearing in them are switched to their UTF8 equivalents, I will know that there are no entities in my content except for the angle brackets. At that point, I should be able to make a new tag to replace fix_ampersands that excludes nothing except the left and right angle brackets. Something like (and this is untested):

re.compile(r'&(?!(lt|gt);)')

Because the only named entities that will be appearing are these two, I can exclude those directly and escape everything else.

I’ve submitted a patch to the documentation to add this.
↑
I submitted this as a patch, but it was (reasonably) refused, as this does not address the core limitation of fix_ampersands: it will still fail for fantasy and science fiction fans and railroad fans, who have abbreviations (F&SF, AT&SF) with two characters following the ampersand; I expect the military is filled with such abbreviations as well. Two-character entities do exist—such as &lt and &gt.
↑

fix_ampersands does not convert abbreviations followed by a semi-colon: “In django/utils/html.py, unencoded_ampersands_re will not convert ampersands if they are followed by at least one alphabetical character and a semicolon. There are no named entities with only a single character, but abbreviations of that form are common in some circles: D&D and R&D for example.”
Update fix_ampersands documentation for behavior with existing entities: “fix_ampersands doesn’t replace all ampersands with & entities; it attempts, usually successfully, to exclude named entities and numeric character references while converting all ampersands that do need replacing. This (and its behavior in some edge cases) should be documented.”

More Django

Converting an existing Django model to Django-MPTT: Using a SQL database to mimic a filesystem will, eventually, create bottlenecks when it comes to traversing the filesystem. One solution is modified preordered tree traversal, which saves the tree structure in an easily-used manner inside the model.
Two search bookmarklets for Django: Bookmarklets—JavaScript code in a bookmark—can make working with big Django databases much easier.
Fixing Django’s feed generator without hacking Django: It looks like it’s going to be a while before the RSS feed generator in Django is going to get fixed, so I looked into subclassing as a way of getting a working guid in my Django RSS feeds.
ModelForms and FormViews: This is just a notice because when I did a search, nothing came up. Don’t use ModelForm with FormView, use UpdateView instead.
Custom managers for Django ForeignKeys: I’ve got one really annoying model for keywords. There’s one category of keywords that, by default, should not show up when used as a ForeignKey for most models. Key word: most.
29 more pages with the topic Django, and other related pages

Comments?

The undiscovered comment form, whose bourn no poster returns.

Your email, URL, and location are optional—but I won’t be able to contact you if you don’t leave a working email. Your email does not get displayed, your URL and location do. Your name is required but may vary as the needs of the day demand, or you can just use the anonymous Hark Thrice name. You can use the following tags: <em>, <a>, <blockquote>. Use them wisely and post intelligently. Comments may take some time to approve, especially if I’m stuck in a Mexican jail.

If you have private comments, or questions about this page, please, leave a message on the Negative Space Comments Page.

Lost?

If you’re looking for something here, use the search box in the navigation to limit your search to this part of the site, or use the Negative Space search page.

Jerry

It is too late to debug things after the first ten thousand have been shipped. — Wayne Green (Kilobaud February 1979)

Contents of Negative Space™ as a whole Copyright © 1994-2025 Jerry Stratton. Individual copyrights remain held by their respective authors unless they specify otherwise. Site titles, such as Negative Space, Strange Bedfellows, Biblyon Broadsheet, Highland Games, and FireBlade Coffeehouse are trademarks of Jerry Stratton.

Code and code snippets, to the extent that they are copyrightable, may be re-distributed under the terms of the GNU General Public License 3.

Django: fix_ampersands and abbreviations last modified May 22nd, 2011.

Your comment
Your name
Your email
Your web page
Your location

Mimsy Were the Borogoves

Django: fix_ampersands and abbreviations

More Django

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

About Mimsy

Comments?

Lost?

Mimsy Were the Borogoves

Django: fix_ampersands and abbreviations

More Django

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

Blogroll

Keep in touch

About Mimsy

Comments?

Lost?