<Glazblog/>

EPUB3 fun #6

In the EPUB ebook world, there's one sentence I heard so many times from so various sources I did not even feel the need to verify it:

"Validation of EPUB is extremely important and we heavily rely on the EPUB Validator"

First thought: "excellent!". People need to distribute validated packages because they don't want to suffer big issues on the some of the many ebook readers on the market. Excellent.

Unfortunately, after a closer look, the situation seems to me a bit different from that idealistic (I could even say utopian) view... Let me explain (please note I have spent time carefully reading the epubcheck source code for instance):

  • if you count the number of major Standards (de jure, de facto, proprietary) involved in the validation of a given *.epub EPUB3 file, you'll find a few dozens of them. And again, only the major ones, the ones a serious industrial validator must absolutely validate against.
  • some validations are complex and expensive, for instance a serious validation of encryption.xml or signatures.xml will involve much more than just a RNG-based validation of the XML instance... Validation of external property vocabularies can be extremely tricky or painful/expensive to implement.
  • the first step of validation is related to the ZIP package itself, and the epub30-ocf spec has a few very technical requirements there. I sincerely doubt all of them are validated, I sincerely doubt EPUB3 packages all around the world currently pass or will pass all the conformance requirements there.
  • only in the W3C space, an EPUB3 validator must at least validate against a dozen of specs.
  • the complexity of some of these specs is huge, drastically impacting users (ebook authors) either on a learning curve's basis or on a financial one. Or both.

EPUB3 is probably too complex as a spec. In EPUB2, most people did not understand the difference between spine, ncx and guide. Most common question was "why do we have multiple table of contents and which one is the good one?". In EPUB3, it's a bit better but only a bit. We have landmarks, multiple table of contents and still a spine. Personally I still wonder why there is a manifest of files; probably only because of the MIME types. Hey, even OSes rely on file extensions to infer a MIME type!!!

I'd love to see appear an EPUB4 strictly xhtml5/svg/mathml-based. No other XML namespace allowed. No more OCF, OPF, NCX. Manifest of files coming from the ZIP list of entries. Direct inclusion of non-conflicting property vocabularies. No need to have an OCF, we can have a nice index.xhtml. Rely on a vocabulary of classes/IDs/roles (not the ARIA role...) expressing extra constraints/behaviours on existing html5 elements. Would be enough for most ebook authors and publishers and would drastically decrease the complexity of the publishing chain, IMHO.

Comments

1. On Friday 10 August 2012, 13:10 by Paul Salvette

EPUB2 was too complex as a spec, and now EPUB3 is even more challenging. I fear most eBook authors will continue going the "upload a word doc" to Kindle/Smashwords option to generate nasty, broken eBooks. The learning curve is becoming even more difficult as you mention.

Great series by the way. I just read all 6 posts. Thank you for spending your time writing these.

2. On Friday 10 August 2012, 15:22 by Pete

I think Paul has a point when he mentions that authors simply use Word and expect to get an eBook somehow. Actually that's the same way print media works. The authors types into Word and dso not start using Quark Express to get their novels written.

3. On Monday 13 August 2012, 08:17 by Candelaria

With havin so much content and articles do you ever run into any problems of
plagorism or copyright infringement? My blog has a lot of completely unique content I've either written myself or outsourced but it seems a lot of it is popping it up all over the internet without my permission. Do you know any methods to help stop content from being ripped off? I'd
definitely appreciate it.

4. On Wednesday 15 August 2012, 02:26 by karl

And there is validity and… conformance.
The validity makes it easier to be processed by the XML toolchain… but the conformance is what makes it properly readable or kind of.

When you open these epub files, you can only notice how badly edited they are.

Each time an epub is badly made, Gutemberg kills a kitten.

5. On Wednesday 15 August 2012, 15:25 by Jean

“xhtml5/svg/mathml-based”

Please don’t forget MusicXML. You can’t anymore depend on still images, PDF (or worse Flash !) to display written music on electronic devices.

http://fr.wikipedia.org/wiki/MusicX...

This music sharing website depends on Flash to display lead sheets : http://wikifonia.org