Django: fix_ampersands and abbreviations
I’ve been slowly converting all of my Django fields to use UTF8 instead of named entities. The combination of using named entities, talking about code, and talking about D&D makes it difficult to know when the ampersand should be converted and when it shouldn’t.
For the most part the conversion to UTF is going well, except that there’s no easy way to know when < and > need to be converted and when they don’t: sometimes I’m talking about HTML, and sometimes I’m using it. So for those two characters, I’ll need to continue using the ampersand entity directly in my blog content.
It turns out, though, that the fix_ampersands filter handles this. The documentation makes it sound like fix_ampersands converts all ampersands, but in fact it uses a simple regular expression to exclude existing named entities and numeric character references.1
Unfortunately, this still leaves a few edge cases. Any use of ampersand abbreviations, such as Q&A, R&R, M&Ms, or R&D, runs the risk of triggering one of them. For me, this comes up mainly when talking about role-playing games such as D&D; V&V; and T&T. Those first two ampersands look like named entities to fix_ampersands because django/utils/html.py uses the simple and fast expedient of a very simple regular expression:
- unencoded_ampersands_re = re.compile(r'&(?!(\w+|#\d+);)')
In my case, a simple change to the regex will handle the edge case examples above:
- unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')
There are no one-character named entities, so this regular expression includes what look like single-character named entities in its fixes. Rather than “\w+” it uses “\w{2,}” to only exclude two-character or longer named entities from replacing.2
This won’t fix the problem when the abbreviation looks like a two-character (or more) entity, but in my case the problem has only shown up for one-character abbreviations. In Python it’s possible to “fix” this without hacking the core code directly. At the end of settings.py, I added:
- #modify fix_ampersands to handle single-character abbreviations
- import django.utils.html, re
- django.utils.html.unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')
This overwrites the regular expression used by django.utils.html with one that ignores single-character entities.
This is only a temporary fix. Once I finish converting every field so that entities appearing in them are switched to their UTF8 equivalents, I will know that there are no entities in my content except for the angle brackets. At that point, I should be able to make a new tag to replace fix_ampersands that excludes nothing except the left and right angle brackets. Something like (and this is untested):
- re.compile(r'&(?!(lt|gt);)')
Because the only named entities that will be appearing are these two, I can exclude those directly and escape everything else.
I’ve submitted a patch to the documentation to add this.
↑I submitted this as a patch, but it was (reasonably) refused, as this does not address the core limitation of fix_ampersands: it will still fail for fantasy and science fiction fans and railroad fans, who have abbreviations (F&SF, AT&SF) with two characters following the ampersand; I expect the military is filled with such abbreviations as well. Two-character entities do exist—such as < and >.
↑
- fix_ampersands does not convert abbreviations followed by a semi-colon
- “In django/utils/html.py, unencoded_ampersands_re will not convert ampersands if they are followed by at least one alphabetical character and a semicolon. There are no named entities with only a single character, but abbreviations of that form are common in some circles: D&D and R&D for example.”
- Update fix_ampersands documentation for behavior with existing entities
- “fix_ampersands doesn’t replace all ampersands with & entities; it attempts, usually successfully, to exclude named entities and numeric character references while converting all ampersands that do need replacing. This (and its behavior in some edge cases) should be documented.”
More Django
- Converting an existing Django model to Django-MPTT
- Using a SQL database to mimic a filesystem will, eventually, create bottlenecks when it comes to traversing the filesystem. One solution is modified preordered tree traversal, which saves the tree structure in an easily-used manner inside the model.
- Two search bookmarklets for Django
- Bookmarklets—JavaScript code in a bookmark—can make working with big Django databases much easier.
- Fixing Django’s feed generator without hacking Django
- It looks like it’s going to be a while before the RSS feed generator in Django is going to get fixed, so I looked into subclassing as a way of getting a working guid in my Django RSS feeds.
- ModelForms and FormViews
- This is just a notice because when I did a search, nothing came up. Don’t use ModelForm with FormView, use UpdateView instead.
- Custom managers for Django ForeignKeys
- I’ve got one really annoying model for keywords. There’s one category of keywords that, by default, should not show up when used as a ForeignKey for most models. Key word: most.
- 29 more pages with the topic Django, and other related pages