Thursday, February 21, 2008

Word file formats: the Nisus achievement and a gentle wish for .DOC

I use Nisus Writer Express for OS X, and one of these days I'll probably upgrade to Nisus Writer Pro. There are many fine features of this high quality product, including the fact that they don't emulate Microsoft Word. The key features for me [1], however, are that:

  • Nisus uses RTF as a native file format
  • Nisus can, optionally, use Word .DOC files as their native file format and do a pretty reasonable job of editing existing Word files without messing them up too much.

I've learned many times over many years that data mobility is a critical requirement of my digital world [2]. In 2008 RTF is the closest thing we have to a mobile word processing file format, .DOC is next, and the Oasis OpenDocument File format is a very distant third.

Recently Microsoft released the specification for Word's .DOC binary file format [3]. Joel Spolsky tells us a bit about that format:

Why are the Microsoft Office file formats so complicated? (And some workarounds) - Joel on Software

...The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word. That means that whenever a programmer on the Word team had to make a decision about how to change the file format, the only thing they cared about was (a) what was fast and (b) what took the fewest lines of code in the Word code base. The idea of things like SGML and HTML—interchangeable, standardized file formats—didn’t really take hold until the Internet made it practical to interchange documents in the first place [jg - 4]; this was a decade later than the Office binary formats were first invented. There was always an assumption that you could use importers and exporters to exchange documents. In fact Word does have a format designed for easy interchange, called RTF, which has been there almost since the beginning. It’s still 100% supported.

...Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.... [6]

Hats off to Nisus. Their ability to work with .DOC file formats [5] is a great achievement, and a testimony to coding excellence and true grit.

Beyond Nisus (buy it) this is another illustration of why we need to care about file formats. Is the Word 2007 .DOC binary file format really the a good way to carry our documents forward?

Today uber-geeks like Schneier seem to be discovering lock-in - decades after I put (small) audiences to sleep with it. A bit late, but it's a start.

Die .DOC die.

--

[1] Nisus doesn't really promote this as a feature. I think they should, but I'm a market of one.

[2] My biggest concession to data-lock is iPhoto. Even there I know that with AppleScript I could extract the majority of the data structures in my photo libraries - not easy, but doable. If I'm every lacking for work to do I might turn that into a product.

[3] A grudging effort to appease governments. I suspect China is a big driver.

[4] Spolsky used to be a Microsoft guy, which is probably why he doesn't remember that WordPerfect, Ami Pro, MacWrite and many other extinct applications used to read and write Word's .DOC format. It was "practical to interchange documents" eons before HTML/SGML - in fact it was a necessity as Microsoft rode its proprietary file formats and trade press control to victory.

[5] Not perfectly of course. Even Nisus Writer Pro can't read and represent an style-generated Word table of contents -- a feature I use extensively.

No comments: