Microsoft did a good job to disclose the binary Office file formats specification in February 2008 (see http://www.microsoft.com/interop/docs/officebinaryformats.mspx). Everyone can now use this information to build tools to access existing content in binary Office documents and convert it to another format (e.g. OpenXML) or to use it in some other way.

I put the everyone in italics for some good reason: From a legal point of view everyone can actually do it, no question (see also Microsoft’s Open Specification Promise http://www.microsoft.com/interop/osp/default.mspx).

However, be aware! You really need some patience and persistence including a good amount of willingness to struggle through all these bits and bytes which define the contents and layout of the binary Office documents. Sometimes, we have brooded many hours about the hex dump of a Word document on one side and Microsoft’s (cryptic :-) February release of the specification on the other side until we have understood the intricacies of a binary substructure such as the PICF structure.

But don’t be too afraid about this. There is some good help for you:

  • Microsoft has released in the meantime a fully revamped version of the formats specification (it seems that they have taken some of the week points we have found in the old specification to heart :-). The new specification now contains a lot of examples explaining the binary structures in detail. It is available here http://msdn.microsoft.com/en-us/library/cc216514.aspx.
  • The "How-To" guides we have drawn up during the development of the tools (see Documentation tab above). They give you some additional hints how some of the complex data structures are to be interpreted.
  • And last but not least our tools themselves including the sources. Being an Open Source project under the quite liberate Berkeley Software Distribution (BSD) license you can re-use the executables and/or sources in many ways in your own projects.

The translator from doc to docx is already quite mature and only a few feature mappings are missing.The two other translators (ppt to pptx and xsl to xslx) are more in a proof-of-concept phase and need for sure some improvements.

We are currently discussing and planning how this project can evolve in the future. If you have some special requirements or ideas, don’t hesitate to contact us.

Enjoy the M2 release and stay tuned for our plans for the future!