Word to HTML Conversion—Progress Report

In a blog post, Martin Eve, Professor of Literature, Technology and Publishing at Birkbeck, University of London, and Editoria advisory board member, summed up the impact of Microsoft Word on the current system of scholarly communication:

It may sound overblown, but a crucial stumbling block in reconfiguring the economics of scholarly communications for the digital age is Microsoft Word. Specifically, the fact that users are wedded to this format presents typesetting and conversion costs that are completely out of proportion to the needs of the system.

I encourage everyone to read Martin’s post, which summarizes nicely one of the central challenges posed by the Editoria project: how to get manuscripts out of Word as early in the editorial production process as possible.

From the outset, we felt that the only viable way to both control the outputs of the editorial production process and build a collaborative editing environment was to remove Word from the process entirely.  Because the use of Word is so ubiquitous among authors of monographs and journal articles there are many pieces of software that have been built over the years as add-ins or plug-ins for Microsoft Word.  While this approach can be very successful in an environment where documents are passed from one participant to the next, it is not appropriate for a collaborative, web-based editorial production environment, which was what we intended to build.  Moreover, it forces publishers to continue to rely on a piece of proprietary software that was never designed with the needs of scholarly communication in mind.

Thus, one of the first tasks we set ourselves in the development of Editoria was developing an ingest and conversion mechanism that would allow project editors (or anyone else on the team for that matter) to upload a Word document and convert it to HTML, which is how content is stored in Editoria. This would allow authors to continue to use their tool of choice, Word, to craft the manuscript, and allow us to enforce document structure and semantics once the manuscript is submitted and we take over production.  We’ve been working with Wendell Piez, Alex Theg, and Adam Hyde at the Collaborative Knowledge Foundation on developing a set of XSLT 2.0 stylesheets for extraction and refinement of data from MS Office Open XML (.docx) format and producing HTML for editorial workflows, which will be incorporated into Editoria. 

As it exists now, the first part of the conversion is to extract as much information as possible from the Word document and clean it up.  Initially there’s some of noise in it—things like duplicate tags, Word-specific tags we don’t need in the HTML, or empty tags left over from user actions in Word (say, bolding and then unbolding some text), so there’s a fair amount of cleanup to do.  We want to make this extraction and cleanup as robust as possible so it can handle lots of author inputs.

The second part of the conversion is enhancing the clean HTML.  Currently, the conversion translates footnotes and endnotes from Word into the HTML as numbered links that connect the endnotes and their inline callouts.  It’s also capturing indentation for paragraphs and block quotes.  Soon the converter will be able to recognize things like links, potential headings, text alignments, etc.  It’s a long road, but the Coko team has made significant progress on developing enhancements and refining the tool.

Because the MS Office Open XML format is so complex, and because there is so much required to tidy up a document that was authored in Word and present it as a structured HTML document, there will always be a crucial role to be played in styling and structuring a manuscript that will necessarily need to be done post-conversion by a skilled editor or someone familiar with a book structure. But, the work that the Coko team have been able to do on creating a conversion pipeline, is getting us much closer to the day when all work on a manuscript after submission will be able to happen on the web.

Further reading