Nisus “clean HTML” macro
I’ve recently switched to using Nisus Writer Pro for writing most of what I write. It’s a lot easier to work with than Microsoft Word (for me), especially when it comes to (a) writing and (b) maintaining styles.
One thing that Word arguably does better than Nisus is create HTML from its documents. While Word fills up its HTML-ified documents with a lot of extra crap, it at least creates somewhat structured documents. If you can throw out the crap, a Word-created HTML file is fairly reasonable. Nisus makes pretty much everything be paragraphs, even headlines. Like Word, it tries to recreate the print formatting of the document in HTML styles. Unlike Word, it uses arbitrary class names instead of duplicating the in-document class names in the HTML. So, not only is the HTML unstructured, but it is also difficult to modify the layout.
I prefer to use different layouts on web pages than I use for print documents. The web is not print, and layouts that work in print are often wildly inappropriate for web browsing. What I want from Nisus is for it to keep headlines as Hx tags and to create predictable style classes so that I can optimize them for web viewing.
One thing Nisus has that leaves Word in the dust, however, is a serious scripting language. Nisus has a simple scripting language built in, and it has an advanced scripting language built in. The advanced scripting language in Nisus is Perl, and Nisus scripts can move in and out of Perl easily as needed.
[toggle code]
- $currentParagraph = Read Selection
-
Begin Perl
- chomp($currentParagraph);
- $currentParagraph = reverse($currentParagraph);
- End
- Type Text $currentParagraph
This script grabs the current selection from Nisus, and then switches into Perl to work on the selection (reversing the text). Finally, it comes back out of Perl and in Nisus types the now reversed text.
It turns out to be not particularly difficult to write a Nisus “macro” that will write an entire document to structured HTML, retaining tables and images. The only thing it loses is character-based styles. I don’t use them much, and hopefully, a future version of Nisus will obsolete this script so that I won’t have to worry about them.
The trick is in the RTF
Nisus can grab your selections in two formats: a basically text format that can be written out as paragraphs, and the RTF of the same selection that contains all of the style information for that selection. RTF is not easy to read, but in this case it’s a little simpler because we don’t have to look at the RTF for an entire document; we can deal with it a paragraph at a time.
The script can loop through every paragraph, and use the “standard” version of the paragraph for writing to the HTML while using the RTF version to get the style name and any embedded images.
For example, here is a quick script that extracts every image in your document, and writes them to a folder called “images”.
[toggle code]
- #extract all images in document to “images” folder
- $currentFolder = Document Property "enclosing folder path"
- $imageFolder = "$currentFolder/images"
-
If File Exists $imageFolder
- Prompt "Re-use existing folder?", "An images folder already exists. Cancel this script or overwrite existing items in that folder?", "Overwrite"
-
Else
- #ensure that folder exists
-
Begin Perl
- mkdir($imageFolder);
- End
- End
- #get to just in front of the first image
- Select Image 1
- Select Start
- $imageCount = 0
-
While Select Next Image
- $imageCount += 1
- $currentImage = Read Selection
- $imageRTF = Encode RTF $currentImage
-
Begin Perl
- $imageRTF =~ /\\pngblip ([^}]+)/;
- $image = pack("H*", $1);
- $imageFileName = "image_$imageCount.png";
- $imageFilePath = "$imageFolder/$imageFileName";
-
if (open $imageHandle, ">", $imageFilePath) {
- print $imageHandle $image;
- close($imageHandle);
- }
- End
- End
From “Begin Perl” to “End” is all Perl; the rest is Nisus. Variables created in Nisus can be used—and changed—by Perl, as $imageCount and $imageFolder are here. Variables created in Perl are not available in Nisus.
This script:
- Gets the folder where the current document lives;
- Gives a warning if there is already an “images” folder in the current folder, or creates the folder if it doesn’t exist;
- Moves the selection to just in front of the first image;
- Selects each image in turn and:
- Adds one to the image count; this will be used for the filename;
- Gets the selected image;
- Converts the image to RTF;
- Grabs the PNG from the RTF in Perl;
- Converts the PNG back to binary in Perl;
- Creates the filename as “image_” and the image number inPerl;
- Writes the binary data to that filename in Perl.
It isn’t necessary to know how to read RTF, just to look for the specific code you need. In this case, Nisus stores images in RTF as a “pngblip” (a PNG image) ending with a space and a curly bracket.
You might find this Nisus macro useful for inspecting a selection’s RTF:
- #convert the current selection to RTF and put in new document
- $currentParagraph = Read Selection
- $currentRTF = Encode RTF $currentParagraph
- New
- View:Draft View
- Insert Text $currentRTF
Being able to extract images is a big step towards being able to create clean HTML from Nisus documents.
Clean HTML
Here’s the script I currently have. Except for character-level styles, it works with everything I currently throw at it. There are some things it won’t work with, such as multiple levels of lists. I don’t use those in any of the documents I used for testing, so I haven’t added that functionality in (I’m not sure how easy it would even be).
The script separates some common functionality into a separate file I called “nisus.nwm”; more on that below.
[toggle code]
- $pageName = Document Property "file name without extension"
- $title = $pageName
-
Begin Perl
- require "/Users/USER/bin/nisus.nwm";
- $title = cleanText($title);
- $pageName = slugify($pageName);
- End
- $currentFolder = Document Property "enclosing folder path"
- $htmlPage = "$currentFolder/$pageName.html"
-
If File Exists $htmlPage
- Prompt "Erase existing file?", "The file $htmlPage already exists. Cancel this script or overwrite the existing file?", "Overwrite"
- End
- Write to File "<html>\n", $htmlPage
- Append to File "\t<head>\n", $htmlPage
- Append to File "\t\t<title>$title</title>\n", $htmlPage
- Append to File "\t\t<link href=\"$pageName.css\" rel=\"StyleSheet\" media=\"all\" />\n", $htmlPage
- Append to File "\t</head>\n", $htmlPage
- Select Paragraph 1
- Select Start
- Append to File "\t<body>\n", $htmlPage
- $tabs = "\t\t"
- $inList = false
- $previousRow = 0
- $previousColumn = 0
- $listType = ""
- $imageCount = 0
-
While Select Next Paragraph
- $currentParagraph = Read Selection
- $row = Selection Row Index
- $column = Selection Column Index
- $currentRTF = Encode RTF $currentParagraph
- $precedingTag = ""
-
Begin Perl
- require "/Users/USER/bin/nisus.nwm";
- $currentParagraph = cleanText($currentParagraph);
- @precedingTags = ();
- ($tag, $style) = parseParagraph($currentRTF);
- #deal with ending lists
-
if ($inList && !isList($currentParagraph)) {
- $inList = 0;
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</$listType>";
- }
- #handle tables
- $needCell = 0;
-
if ($row > $previousRow) {
- #new row
-
if ($row == 1) {
- #new table
- $precedingTags[$#precedingTags+1] = "$tabs<table>";
- $tabs .= "\t";
-
} else {
- #new row of existing table
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</td>";
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</tr>";
- }
- $precedingTags[$#precedingTags+1] = "$tabs<tr>";
- $tabs .= "\t";
- $needCell = 1;
- $previousColumn = 0;
-
} elsif ($previousRow > $row) {
- #end of table
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</td>";
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</tr>";
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</table>";
-
} elsif ($column > $previousColumn) {
- #new column in existing row
- chop $tabs;
- $precedingTags[$#precedingTags+1] = "$tabs</td>";
- $needCell = 1;
- }
- #need to open any cell(s)?
-
if ($needCell) {
- #handle any empty cells before this one
- $columnCount = $column-$previousColumn;
-
while ($columnCount>1) {
- $precedingTags[$#precedingTags+1] = "$tabs<td></td>";
- $columnCount--;
- }
- $precedingTags[$#precedingTags+1] = "$tabs<td>";
- $tabs .= "\t";
- }
- #handle new lists
-
if ($tag eq "p") {
-
if (($newParagraph, $newList) = isList($currentParagraph)) {
- $currentParagraph = $newParagraph;
- $listType = $newList;
- $tag = "li";
-
if (!$inList) {
- $inList = 1;
- $precedingTags[$#precedingTags+1] = "$tabs<$listType>";
- $tabs .= "\t";
- }
- }
-
if (($newParagraph, $newList) = isList($currentParagraph)) {
- }
- #is there an image here?
- $imageFile = "";
-
while ($currentRTF =~ /\\pngblip ([^}]+)/) {
- $image = pack("H*", $1);
- $currentRTF =~ s/\\pngblip [^}]+//;
-
if ($image ne $oldImage) {
- $imageCount++;
- $imageFolder = "$currentFolder/images";
- mkdir($imageFolder);
- $imageFileName = "$imageCount.png";
- $imageFilePath = "$imageFolder/$imageFileName";
-
if (open $imageHandle, ">", $imageFilePath) {
- print $imageHandle $image;
- close($imageHandle);
- }
-
if (($style ne "image") && $currentParagraph) {
-
if ($imageCount % 2 == 1) {
- $order = "odd";
-
} else {
- $order = "even";
- }
- $currentParagraph = "<img class=\"inline $order\" src=\"images/$imageFileName\" />" . $currentParagraph;
-
if ($imageCount % 2 == 1) {
-
} else {
-
if (!$style) {
- $style = "image";
- }
- $currentParagraph .= "<img src=\"images/$imageFileName\" />";
-
if (!$style) {
- }
- }
- $oldImage = $image;
- }
- #create tag and class
-
if ($style) {
- $style = slugify($style);
- $startTag = "<$tag class=\"$style\">";
-
} else {
- $startTag = "<$tag>";
- }
- $endTag = "</$tag>";
-
if ($currentParagraph) {
- $currentParagraph = "$tabs$startTag$currentParagraph$endTag";
- }
- $precedingTag = join("\n", @precedingTags);
- End
-
If $precedingTag
- Append to File "$precedingTag\n", $htmlPage
- End
-
If $currentParagraph
- Append to File "$currentParagraph\n", $htmlPage
- End
- $previousRow = $row
- $previousColumn = $column
- End
- $closer = false
-
Begin Perl
- #check for open items
- @closers = ();
- #lists
-
if ($inList) {
- chop $tabs;
- $closers[$#closers+1] = "$tabs</$listType>";
- }
- #tables
-
if ($previousRow) {
- chop $tabs;
- $closers[$#closers+1] = "$tabs</td>";
- chop $tabs;
- $closers[$#closers+1] = "$tabs</tr>";
- chop $tabs;
- $closers[$#closers+1] = "$tabs</table>";
- }
- $closer = join("\n", @closers);
- End
-
If $closer
- Append to File "$closer\n", $htmlPage
- End
- Append to File "\t</body>\n", $htmlPage
- Append to File "</html>", $htmlPage
That looks fairly complex, but what it does is easy enough to grasp: it goes through each paragraph and writes it out to an HTML file. It uses the paragraph text to get the content, and it uses the paragraph RTF to get class names for styles, as well as heading levels.
Along the way it checks for lists and tables to make sure that those get converted correctly.
Support file
The above script calls a support file twice. You’ll need to store this file on your system and put your path where it gets required:
[toggle code]
- #Nisus subroutines
- #edit in Nisus for encoding
- use utf8;
-
sub cleanText {
- my($text) = shift;
- chomp $text;
- $text =~ s/[ ]+$//;
- $text =~ s/&/&/g;
- $text =~ s/‘/‘/g;
- $text =~ s/’/’/g;
- $text =~ s/“/“/g;
- $text =~ s/”/”/g;
- $text =~ s/…/…/g;
- $text =~ s/—/—/g;
- $text =~ s/©/©/g;
- $text =~ s/ë/ë/g;
- $text =~ s/é/é/g;
- $text =~ s/\x{FFFC}//g;
- $text =~ s/\x{2028}/<br \/>\n/g;
- $text =~ s/\x{0C}//g;
- $text =~ s/\x{0A}//g;
- $text =~ s/^ +//;
- $text =~ s/ +$//;
- return $text;
- }
-
sub slugify {
- my($text) = shift;
- $text = cleanText($text);
- $text =~ s/\&[a-z]+\;//g;
- $text =~ s/ /_/g;
- $text = lc($text);
- return $text;
- }
- #decide current tag
-
sub parseParagraph {
- my($RTF) = shift;
- my($tag) = "p";
- my($style) = "";
- #is this a heading?
-
if ($RTF =~ /\\tcl([0-9]) /) {
- my($headingLevel) = $1;
-
if ($headingLevel) {
- $tag = "h$headingLevel";
- }
- }
- #get the style if one exists
- #be careful not to get other info such as font name
-
if ($RTF =~ /\\tcl[0-9] ([^;\\]+);}/) {
- $style = $1;
-
} elsif ($RTF =~ /\in0 ([a-z0-9 ]+);}/i) {
- $style = $1;
-
} elsif ($RTF =~ /[0-9]+ ([a-z0-9 ]+);}/i) {
- $style = $1;
- }
- #some styles are just the defaults, and don't need classes
-
if ($style eq "Normal") {
- $style = "";
- }
-
if ($style =~ /^Heading [0-9]$/) {
- $style = "";
- }
- return ($tag, $style);
- }
-
sub isList {
- my($text) = shift;
-
if ($text =~ s/^[0-9]+\.\t//) {
- return ($text, 'ol');
-
} elsif ($text =~ s/^•\t//) {
- return ($text, 'ul');
-
} else {
- return ();
- }
- }
- 1;
There’s probably a better way in Perl to convert diacriticals and other special characters to their HTML entities. Notice that the top of the file tells Perl to “use utf8”; that’s what Nisus uses when it sends text to your script. It tells Perl this automatically in the scripts it calls directly, but you need to specify it in any required files.
Wish list
Obviously it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.
Less obviously, named sections would be useful. It would then be possible to put a DIV around each section with a predictable name, for applying styles to that section.
And if there were an Encode HTML equivalent to Encode RTF, this might allow me to maintain character-level styles.
Of course, the ultimate wish is for Nisus’s save as HTML to do all this automatically.
- Nisus
- I use Nisus Writer Pro for almost all of my new documents now. It’s a lot easier to use than the other word processors I’ve tried.
More HTML
- Flash on iPhone not in anybody’s interest
- Flash on iPhone is not in the interest of people who buy iPhones. The only people who really want it are poor web designers who can’t get out of 1992.
- Web display of Taskpaper file
- It is easy to use PHP to convert a Taskpaper task file into simple HTML conducive to styling via CSS.
- ELinks text-only web browser
- If you need a text browsers on Mac OS X, the ELinks browser compiles out of the box.
- iPhone development another FairPlay squeeze play?
- Why no iPhone-only applications? Is it short-sightedness on Apple’s part, or are they trying to encourage something big?
- Cascading style sheets and HTML
- You can use style sheets to simplify your web pages, making them readable across a wide variety of browsers and situations, without sacrificing presentation quality.
- Six more pages with the topic HTML, and other related pages
More Nisus
- Importing an index into Nisus
- Nisus makes it very easy to import an externally-generated index into a document.
- Text to image filter for Smashwords conversions
- Smashwords has very strange requirements for ebooks. This script is what I use to convert books to .doc format for Smashwords, including converting tables to images.
- Nisus HTML script now handles floating content
- My Nisus simple HTML publish script now handles floating images and floating text boxes.
- Lulu, Nisus, and Gods & Monsters
- Lulu is sometimes really annoying. But they usually get the job done. Nisus, on the other hand, is rarely annoying to use and always gets the job done.
- Nisus Writer Pro 2.0
- The new Nisus is pure awesome: very easy to use, and it does everything I need.
- Four more pages with the topic Nisus, and other related pages
More Perl
- Simple .ics iCalendar file creator
- A simple Perl script to create an ics file from a human-readable text of events.
- No premature optimization
- Don’t optimize code before it needs optimization or you’re likely to create unoptimized code.
- Using Term::ANSIColor with GeekTool
- Rather than using the raw codes directly, Perl (at least on OS X) comes with Term::ANSIColor built in.
- Nisus HTML conversion
- New features in Nisus’s scripting language make HTML conversion almost a breeze.
- SilverService and Taskpaper
- SilverService is a great little app if you commonly need to repetitiously modify text. Any application that supports services will support running selected text through command-line scripts via SilverService.
- Three more pages with the topic Perl, and other related pages
Note that Nisus also has what looks to be a very useful AppleScript dictionary as well.
And one final caveat: I don’t know RTF; that this works for me is no guarantee that it will work for you. It probably won’t.