Structured storage (also known as compound file) is a technology to store hierarchical data within a single file. Microsoft Office uses the structured storage as a container for storing binary Office documents (doc, xls, ppt).
Such a structured storage container is made up of a number of virtual streams which contain text, data and control structures of the binary Office documents, i.e. the container is like a small file system of its own. The content of these streams or subfiles is document type-specific, i.e. Word documents contain other streams than Excel spreadsheets or PowerPoint presentations.
Consequently, we have to tackle two levels of format interpretation:
Subsequently, these internal data structures will then be translated to equivalent Open XML element structures using mapping tables (these mapping tables are also document type-specific; a very initial version of a doc to docx mapping table can be found here).
In a last step, the Open XML element structures are assembled into an Open XML package file.
The diagram below shows a high-level view of the planned architecture of the Office Binary (doc, xls, ppt) Translator to Open XML project (Note: Since all the binary Office formats use the same approach for storing meta data, the Property Reader & Generator component can be used for all document types).

We will focus in the first phase of this project on the translation of binary Word documents to Open XML Word documents (doc to docx).
A binary Word file (doc) consists of the following streams inside the structured storage:
The diagram below shows the streams in a sample Word document

More details will follow in the weeks to come.
It's a bit tricky to extract the text contained in a binary Word document; however, we've managed it. Have a look here about more details.
Table formatting is another demanding conversion task. Some details can be found here.