In the EPUB ebook world, there's one sentence I heard so many times from so various sources I did not even feel the need to verify it:

"Validation of EPUB is extremely important and we heavily rely on the EPUB Validator"

First thought: "excellent!". People need to distribute validated packages because they don't want to suffer big issues on the some of the many ebook readers on the market. Excellent.

Unfortunately, after a closer look, the situation seems to me a bit different from that idealistic (I could even say utopian) view... Let me explain (please note I have spent time carefully reading the epubcheck source code for instance):

  • if you count the number of major Standards (de jure, de facto, proprietary) involved in the validation of a given *.epub EPUB3 file, you'll find a few dozens of them. And again, only the major ones, the ones a serious industrial validator must absolutely validate against.
  • some validations are complex and expensive, for instance a serious validation of encryption.xml or signatures.xml will involve much more than just a RNG-based validation of the XML instance... Validation of external property vocabularies can be extremely tricky or painful/expensive to implement.
  • the first step of validation is related to the ZIP package itself, and the epub30-ocf spec has a few very technical requirements there. I sincerely doubt all of them are validated, I sincerely doubt EPUB3 packages all around the world currently pass or will pass all the conformance requirements there.
  • only in the W3C space, an EPUB3 validator must at least validate against a dozen of specs.
  • the complexity of some of these specs is huge, drastically impacting users (ebook authors) either on a learning curve's basis or on a financial one. Or both.

EPUB3 is probably too complex as a spec. In EPUB2, most people did not understand the difference between spine, ncx and guide. Most common question was "why do we have multiple table of contents and which one is the good one?". In EPUB3, it's a bit better but only a bit. We have landmarks, multiple table of contents and still a spine. Personally I still wonder why there is a manifest of files; probably only because of the MIME types. Hey, even OSes rely on file extensions to infer a MIME type!!!

I'd love to see appear an EPUB4 strictly xhtml5/svg/mathml-based. No other XML namespace allowed. No more OCF, OPF, NCX. Manifest of files coming from the ZIP list of entries. Direct inclusion of non-conflicting property vocabularies. No need to have an OCF, we can have a nice index.xhtml. Rely on a vocabulary of classes/IDs/roles (not the ARIA role...) expressing extra constraints/behaviours on existing html5 elements. Would be enough for most ebook authors and publishers and would drastically decrease the complexity of the publishing chain, IMHO.