a HTML Spaghetti Western: The Serializer, The Parser and The Ugly Blank Lines
Our content serializer has a lot of flaws. I could mention wrapping that often does not work as expected and should not wrap asian scripts, raw mode that is not really raw, and more. But one of the most painful bugs is a bug related to blank lines suddenly appearing in the serialization of HTML documents. To experience that bug with a modern gecko-based editor, launch BlueGriffon, click on the drop-down arrow of the New button and create a transitional HTML4 document. Switch to source view and back to wysiwyg a few times.
- there is an extra blank line appearing between <head> and <meta> each time you switch to Source !
- there are unexplainable blank lines after the paragraph
These are two different problems.... The latter one is weird and painful: if I load this trivial empty HTML4 document into Firefox and inspect it using the DOM Inspector or Firebug, I see a text node erroneously (in fact, it's erroneous for SGML parsing that applies to HTML 4, but that's the html5 parsing model) containing multiple CRs... In fact, the serializer here is not buggy, there are really multiple CRs after the end of the paragraph element. In fact all the nodes between just after </body> and the end of the document are appended to the body element in the DOM; adjacent text nodes are concatenated. So the serializer is not buggy here, it really sees blank lines. I will hack BlueGriffon source view to work around that.
<hsivonen> glazou: that's intentional, spec compliant and by design
<hsivonen> glazou: sucks for editors. makes sense for all other classes of products ingesting HTML
<hsivonen> glazou: this significantly simplifies dealing with random cruft after </body>
<hsivonen> glazou: WebKit has done this for years with great success
<hsivonen> glazou: this is the Web
<aja> "leave all sense of logic at the door" :-)
If I understand perfectly the rationale behind that parsing choice for browsers, it just sucks for non-browsers. It's a pain for editors, filters, all tools having to deal with source or exact DOM representation of the document instance. A text node can live in the DOM after the body or even the html elements. That's how we can have a xml-stylesheet PI or comments living before the root of the document...
I think the parsing model should be changed here, or adapted to be more precise : extra nodes after the body element should be appended to the body element if and only if they are not all blank (matching /^\s*$/) text nodes. If they are all blank text nodes, they remain in place and are ignored by the UAs or are deleted from the DOM. That would preserve the behaviour hsivonen needs above and would help a lot editors and transformation tools. A good compromise, in my humble opinion. At least a better one than the current one.
The head/meta problem is a weakness of Gecko's serializer: the
mMayIgnoreLineBreakSequence attribute is not correctly used in nsHTMLContentSerializer and nsXMLContentSerializer. In particular,
nsXMLContentSerializer::AppendToString should detect if the string to append contains CR or LF and contains only whitespaces (à la \s). If both are true, then
mMayIgnoreLineBreakSequence should be set to true. Of course,
AppendNewLineToString() should return without proceeding if
mMayIgnoreLineBreakSequence is true, and
AppendIndentation should probably not set it back to false. A few other spots have to be modified too but all in all it should be doable. I hope there won't be bad side effects for instance on some specific elements like pre. Stay tuned.