Office Binary (doc, xls, ppt) Translator to Open XML

Table of Contents

High-Level Architecture Description

Structured storage (also known as compound file) is a technology to store hierarchical data within a single file. Microsoft Office uses the structured storage as a container for storing binary Office documents (doc, xls, ppt).

Such a structured storage container is made up of a number of virtual streams which contain text, data and control structures of the binary Office documents, i.e. the container is like a small file system of its own. The content of these streams or subfiles is document type-specific, i.e. Word documents contain other streams than Excel spreadsheets or PowerPoint presentations.

Consequently, we have to tackle two levels of format interpretation:

Subsequently, these internal data structures will then be translated to equivalent Open XML element structures using mapping tables (these mapping tables are also document type-specific; a very initial version of a doc to docx mapping table can be found here).

In a last step, the Open XML element structures are assembled into an Open XML package file.

The diagram below shows a high-level view of the planned architecture of the Office Binary (doc, xls, ppt) Translator to Open XML project (Note: Since all the binary Office formats use the same approach for storing meta data, the Property Reader & Generator component can be used for all document types).

Architecture

We will focus in the first phase of this project on the translation of binary Word documents to Open XML Word documents (doc to docx).

Streams in a Word Document

A binary Word file (doc) consists of the following streams inside the structured storage:

The diagram below shows the streams in a sample Word document

Streams in a sample Word document

More details will follow in the weeks to come.

How to Retrieve Text from a Binary .doc File

It's a bit tricky to extract the text contained in a binary Word document; however, we've managed it. Have a look here about more details.

A Guide to Table Formatting

Table formatting is another demanding conversion task. Some details can be found here.

Mapping Tables

File Format Documentation

Project page on SourceForge

SourceForge.net Logo