a HTML Spaghetti Western: The Serializer, The Parser and The Ugly Blank Lines

Our content serializer has a lot of flaws. I could mention wrapping that often does not work as expected and should not wrap asian scripts, raw mode that is not really raw, and more. But one of the most painful bugs is a bug related to blank lines suddenly appearing in the serialization of HTML documents. To experience that bug with a modern gecko-based editor, launch BlueGriffon, click on the drop-down arrow of the New button and create a transitional HTML4 document. Switch to source view and back to wysiwyg a few times.

  1. there is an extra blank line appearing between <head> and <meta> each time you switch to Source !
  2. there are unexplainable blank lines after the paragraph

These are two different problems.... The latter one is weird and painful: if I load this trivial empty HTML4 document into Firefox and inspect it using the DOM Inspector or Firebug, I see a text node erroneously (in fact, it's erroneous for SGML parsing that applies to HTML 4, but that's the html5 parsing model) containing multiple CRs... In fact, the serializer here is not buggy, there are really multiple CRs after the end of the paragraph element. In fact all the nodes between just after </body> and the end of the document are appended to the body element in the DOM; adjacent text nodes are concatenated. So the serializer is not buggy here, it really sees blank lines. I will hack BlueGriffon source view to work around that.

<hsivonen> glazou: that's intentional, spec compliant and by design
<hsivonen> glazou: sucks for editors. makes sense for all other classes of products ingesting HTML
<hsivonen> glazou: this significantly simplifies dealing with random cruft after </body>
<hsivonen> glazou: WebKit has done this for years with great success
<hsivonen> glazou: this is the Web
<aja> "leave all sense of logic at the door" :-)

If I understand perfectly the rationale behind that parsing choice for browsers, it just sucks for non-browsers. It's a pain for editors, filters, all tools having to deal with source or exact DOM representation of the document instance. A text node can live in the DOM after the body or even the html elements. That's how we can have a xml-stylesheet PI or comments living before the root of the document...

I think the parsing model should be changed here, or adapted to be more precise : extra nodes after the body element should be appended to the body element if and only if they are not all blank (matching /^\s*$/) text nodes. If they are all blank text nodes, they remain in place and are ignored by the UAs or are deleted from the DOM. That would preserve the behaviour hsivonen needs above and would help a lot editors and transformation tools. A good compromise, in my humble opinion. At least a better one than the current one.

The head/meta problem is a weakness of Gecko's serializer: the mMayIgnoreLineBreakSequence attribute is not correctly used in nsHTMLContentSerializer and nsXMLContentSerializer. In particular, nsXMLContentSerializer::AppendToString should detect if the string to append contains CR or LF and contains only whitespaces (à la \s). If both are true, then mMayIgnoreLineBreakSequence should be set to true. Of course, AppendNewLineToString() should return without proceeding if mMayIgnoreLineBreakSequence is true, and AppendIndentation should probably not set it back to false. A few other spots have to be modified too but all in all it should be doable. I hope there won't be bad side effects for instance on some specific elements like pre. Stay tuned.


1. On Wednesday 5 October 2011, 14:05 by Mardeg

The first I noticed of this was when showing BlueGriffon off to someone recently. They asked me why more and more blank lines were being added as they switched views. My lack of an answer ended their chances of using it.
Sadly I don't think linking them to this blog post would have changed the outcome.

2. On Wednesday 5 October 2011, 19:37 by yt75

Ah the simplicity of lisp oriented things such as Postscript, time to get out of this "human readable format dogma" for document files maybe ? After all magazines and books are highly readable, add to this an open, bar codes style, IDs space for operators and terminals/litteral types and that's it. But too late most probably.

3. On Saturday 8 October 2011, 02:37 by Mathnerd314

Which version / OS of BlueGriffon is this? I can't reproduce with BlueGriffon 1.2.1 / Windows 7; switching back and forth changes nothing...