Structured storage (also known as compound file) is a technology to store hierarchical data within a single file. Microsoft Office uses the structured storage as a container for storing binary Office documents (doc, xls, ppt), but also for storing some special objects such as macros in an Open XML document. Information about the StructuredStorage class we are using in this translator project can be found here.
Such a structured storage container is made up of a number of virtual streams which contain text, data and control structures of the binary Office documents, i.e. the container is like a small file system of its own. The content of these streams or subfiles is document type-specific, i.e. Word documents contain other streams than Excel spreadsheets or PowerPoint presentations.
Consequently, we have to tackle two levels of format interpretation:
Subsequently, these internal data structures will then be translated to equivalent Open XML element structures using mapping tables (these mapping tables are also document type-specific; a very initial version of a doc to docx mapping table can be found here).
In a last step, the Open XML element structures are assembled into an Open XML package file. This might also require, e.g. for macros, a structured storage container to be created in the Open XML document (cf. Structured Storage Writer in the diagram below).
The diagram below shows a high-level view of the architecture of the Office Binary (doc, xls, ppt) Translator to Open XML project (Note: Since all the binary Office formats use the same approach for storing meta data, the Property Reader & Generator component can be used for all document types).
We focused in the first phase of this project on the translation of binary Word documents to Open XML Word documents (doc to docx). The other two translators have been developed in phase II and phase III of the project.
A binary Word file (doc) consists of the following streams inside the structured storage (the stream structure of the two other binary formats is similar. They will be explained in a paper of its own):
The diagram below shows the streams in a sample Word document
During the development of the binary translators we encountered a number of challenging and tricky data structures in the binary Office formats. In addition, the file format specifications are not always precise enough (in particular, the previous versions did not contain any examples!).
We have compiled the experience we've made in a number of "how to" guides which might be quite useful for other developers but also for Microsoft to improve the binary file format specifications.
On June 30, 2008 Microsoft released new technical specifications for the binary formats. They can be downloaded from http://msdn.microsoft.com/en-us/library/cc216514.aspx. More or less all the problems we have encountered in the old formats specification have been fixed and a considerable number of examples has been added. Consequently, the remarks concerning the formats specification we have made in our documents below refer to the old specification mainly!
Word, Excel and Powerpoint have the feature to draw custom shapes (scribble lines or self defined polygons). This text shows how such freeform shapes are stored in the files.
It's a bit tricky to extract the text contained in a binary Word document; however, we've managed it. Have a look in the document to read more details.
Table formatting is another demanding conversion task.
This document gives an overview on the implementation of macros and OLE objects in Microsoft's binary Office file format. It explains how they are stored in a binary Word file and why it doesn't take a specification to convert them to the new Open XML Format.
Running the Binary Office to OpenXML under Linux or other non-Windows platforms? No problem!
It works out of the box! You only have to install Mono.
Just execute doc2x.exe with Mono and see how it works:
mono doc2x.exe <file_name>
We tested doc2x.exe in a kubuntu-VM with Mono JIT compiler version 1.2.3.1:
The following features are implemented in the M2 Release of the doc/docx translator (this is quite a high level view of the feature mapping, a more detailed mapping table is available in XLSX or PDF format):
An initial mapping from ppt to pptx and from xls to xlsx was planned for M2. This mapping covers the following features.