Office Binary (doc, xls, ppt) Translator to Open XML

b2xTranslator Team Blog

Friday 26 June 2009

Phase III Reloaded

After the “final” milestone of Phase 3 was already released we decided to add two more milestones.


Thursday 7 May 2009

b2xtranslator Phase III – Another Major Release Completed

Phase III of the b2xtranslator project resulted in quite evolved tools to translate Microsoft binary Office documents into OpenXML. The main purpose of this project was for sure to demonstrate that it is feasible to develop such translators just using the available file format specifications for the binary formats as well as for OpenXML.

But you can find some more goodies in this project:

  • the command line oriented architecture allows the translators to be easily integrated into existing systems, e.g. document management systems running on a server
  • the only software requirement is the .Net framework – or Mono; thus, portability is quasi built in
  • the open source approach allows the code to be re-used like a toolbox
  • implementation notes on developing the translator using the open document specifications of the binary format provided by Microsoft

Is this now the end of the line? No, I don’t think so. Apart from a number of still missing features for the Excel and PowerPoint translators such as Chart Translation or Pivot Tables we have some more ideas to enhance the b2xtranslators, e.g.

  • broadening the platform support by replacing legacy C-libraries still used for zipping OpenXML documents; thus, Mac OS X could also be supported
  • offering an option to view binary Word documents in a browser by combining the b2xtranslator with another open source project called OpenXML Document Viewer

These are just some of our ideas for a future version.

What’s about your input? Your feedback would be appreciated in our discussions how Phase IV of the b2xtranslator should be shaped.

Thursday 26 March 2009

Feature Completion Milestone

All scheduled features have now been implemented in Phase III/Milestone 2 of the b2xTranslator:

  • Excel (xls2x)
    • Shared Formulas
    • String Formatting
    • Data Type Formatting (number, date, currency, etc.)
    • Cell Formatting
  • Power Point (ppt2x)
    • Textbox Formatting
    • Shapes
    • Basic Animations
    • Notes (including Formatting)

The detailed feature mapping description can be found here.

Since the binary translator for Word documents (doc2x) was already quite complete we mainly concentrated in Phase III on high-priority bug fixes.

We will focus in the coming weeks on testing and code stabilization in order to release a stable and highly productive version in April. Then you will also come across an updated b2xTranslator web site on SourceForge including the test reports for the translators and updated documentation on binary Excel and PowerPoint file formats and xls/xlsx & ppt/pptx mapping.

And we are already paving the way for another phase of the b2xTranslator with some exciting new features …

Stay tuned!

Friday 27 February 2009

Phase III - Milestone 1

Time flies when you're having fun developing such awesome software. I can't believe that it's already time to release M1 of Phase III.

The ppt2x and xls2x translators are now in a quite good shape, i.e. you can already run some serious translation jobs. Nevertheless, there is still some room for improvements (stay tuned for M2!).

Another nice feature of M1 is the single setup procedure (including context menu registration) for all the translators.

Have some fun with the new release, too!

Wednesday 21 January 2009

b2xTranslator Phase III

Previous releases of the b2xTranslator were mainly focused on the translation from doc to docx (doc2x) while the translators for xls to xlsx (xls2x) and ppt to pptx (ppt2x) have been neglected a bit.

This is going to be changed in Phase III of the b2xTranslator project: The doc2x translator is quite mature now (nevertheless, some of the high priority bugs reported on SourceForge are going to be fixed in Phase III); consequently, the two other translators xls2x and ppt2x will benefit from new feature implementations.

Keeping this in mind the project scope is centered on the following topics:

  • New feature implementation in xls2x and ppt2x
  • Sustaining doc2x (fixing high-priority defects)
  • Quality, performance and regression testing (all translators)
  • Compatibility with Mono (all translators)
  • Checking the completeness and clarity of the file format specifications

While Microsoft provides with Office 2007 and the File Format Compatibiliy pack for earlier Office versions a migration path from binary Office formats to OpenXML the b2xTranslator project is still necessary for the following reasons

  • Enables the back-office / batch scenario due to its a command-line-based architecture
  • Provides a cross-platform story via .Net/Mono, i.e. it the translators run, for example, on SUSE Linux
  • Proves the usability and completeness of the file format specifications
  • Allows that anyone uses the mapping, code snippets, etc. due to the open source development approach based on the liberate BSD license

There are some other very interesting news coming from Microsoft's document format teams: They've published another set of document-format implementation notes, this time for the ECMA-376 1st Edition implementation in Office 2007 SP2. As with the ODF 1.1 implementation notes published in December, the goal of publishing these notes is to help other implementers improve interoperability with Office, by transparently documenting the details of Microsoft's implementation.

To get to the ECMA-376 implementer notes, go to the DII home page and click on Reference and then select ECMA-376 1st Edition from the dropdown list. You'll then see a treeview control in the panel on the left, which contains the entire structure of the ECMA-376 spec.

Check also Doug's blog for more information ...

Friday 5 December 2008

Phase II Final Milestone

Phase II of the binary translator for Word documents (doc to docx) has been finished. The latest version has been extensively tested and offers a number of new features as described in the previous blog entries.

Let me thank all internal and external contributors, which helped to improve the translation quality. In particular, I was amazed and happy at the same time that some of you guys really looked deep in our code and identified quite a few issues. Thanks again!

We are now looking forward to also improving the two other binary translators: xls to xlsx and ppt to pptx. We are currently planning the feature scope of the next release and hope you will follow up our work and contribute to it as you did so for the Word translator. For example, feature requests can be submitted via the SourceForge tracking system.

Wednesday 5 November 2008

Phase II – Second Milestone

M2 is really a big jump ahead. Let me describe some of the highlights.

Twofold Interoperability

Of course, the main purpose of the binary translator is to provide for interoperability between the binary (legacy) Word documents and the new OpenXML world. However, it also proves to be interoperable on the platform level: The binary translator is not bound to the Windows platform only; it also runs on any platform supporting Mono.

Cross Platform Interoperability -- Use Case

To mention just one use case: You prefer to run Linux on your servers and you want to implement a service on your servers which translates from doc to docx. No problem, just run the binary translator with Mono.

Broadening the Scope

The binary translator is not only a tool for translating your documents from doc to docx but also your templates from dot to dotm. Such a translation includes:

  • styles,
  • macros,
  • autotext entries
  • and even toolbar settings

Of course, the internal document type and the extensions are probably set, i.e. the translation of a doc file containing macros results in a docm file (and not in a docx file).

What Else?

Most of the defects reported on SourceForge are fixed (some will be added in the near future, see Next Steps below :-). The shape implementation is now complete (except shapes, which won’t be supported such as the Action* buttons).

The “StructuredStorage” library which was the basic component for reading structured storage files is now also able to create such storages. This extension was necessary for creating macros and OLE objects in the OpenXML documents.

Field handling, in particular for form fields, was improved and revision marks (track changes) and comments are now translated.

An installation routine (MSI) makes it easier for you to install the binary translator under Windows.

Next Steps

Some features still need some more finetuning, e.g. charts and SmartArts. This will be for sure taking into account in November. In addition we are planning to extensively test all the new features of the binary translator before making the final release available beginning of December.

Please let us know your feature requests and feedback.

Have fun!

Wednesday 1 October 2008

Phase II – the first milestone has been reached

Today we have uploaded the first milestone release of Phase II – on schedule :-)

You can download the executables and sources as usual from the download area on SourceForge. Some more information is available on the project web site under Supplementary Downloads:

  • the feature scope of the final Phase II release and
  • some results of our "convert & open" tests as part of the unit testing

These "convert & open" tests have proven again that they are a valuable tool for detecting problems: A test run yesterday resulted in an unacceptable error quote of 25%. The analysis of the problems revealed that most of them were related to the newly implemented comment translation feature (the binary file format specification is not so clear here, we will analyse this in more detail and report about it). The issue could easily be remedied and another "convert & open" test run resulted today in an error quote of 2% only (the remaining erroneous documents will be analysed and the problems fixed in the coming days).

In a nutshell, M1 is not a big, yet quite important interim release. We will continue our effort in implementing outstanding features and remedying found problems.

Stay tuned for future releases!

Monday 22 September 2008

Crossing things off the checklist

Three weeks ago since we have started the work on the next doc2x build and we have already implemented some core features of Phase II. So let’s have a look at the current state of development:

  • We enhanced our paragraph conversion and are now able to convert the “Floating Properties” of paragraphs. Thus, the converter fully supports Frames and Textboxes, regardless if the Textbox is a floating paragraph or a floating shape.
  • A second core feature that we already finished is the conversion of “Comments”.
  • In addition we fixed many open bugs and made the converter more stable and efficient. By the way, all of you guys are welcome to test the converters and submit bugs to the tracker on!

But there is still something to do before the final release in December: Currently we are tackling the conversion of OLE objects, charts and macros. They are stored as several “Structured Storage Streams” in the binary file format. These streams need to be bundled up to a “Structured Storage File” in the “OpenXml” archive.

So one of the main goals of the next week will be to extend our “StructuredStorageReader” library and make it to a “StructuredStorageWriter” ;) If you want to know more about OLE objects, charts and macros, it might interest you that we just added a new documentation to our How To Guides section.

Finally I can tell you that we are going to release the first Milestone of Phase II as planned at the beginning of October.
So long!

Friday 12 September 2008

Binary Translator – Phase II

The Office Binary Translator to OpenXML for Word documents which we developed during the first half year of 2008 was already more than a prototype or proof of concept: A large number of binary Word documents could be translated to OpenXML without any loss of information and in some cases our resulting documents were even better than the documents created using the converter integrated in Office 2007.

Although our translator is already quite mature a number of more complex and less used features are not yet mapped/translated, e.g.

  • frame paragraphs lose their floating properties and are translated to normal paragraphs
  • OLE Objects are currently not supported and are lost after translation
  • SmartArts, Charts and Comments are currently not supported and are lost after translation
  • macros are currently not translated

In addition, some features have not yet been completely implemented:

  • 55 of about 200 different shape types are currently supported
  • Track Changes: Due to its high complexity, the revision marking (track changes) feature is not yet completely implemented; however, paragraph and character property modifications are implemented
  • a number of bugs have been reported which are not yet fixed

We are going to tackle all these features and bugs in Phase II of the Binary Translator project.

Hopefully, our work will be facilitated by the improved specification of the binary formats which has been released by Microsoft in June (we keep you informed about our findings).

Our schedule: Project start is now in September and two intermediate milestones are planned for beginning of October and November. The final version is planned for beginning of December. In addition to unit testing we are going to accomplish elaborate testing routines to guarantee a stable and high quality translator release.

Some weeks of hard coding and testing work are waiting for us – let’s roll up the sleeves and get it done …

Thursday 10 July 2008

M2 has been released!

Microsoft did a good job to disclose the binary Office file formats specification in February 2008 (see Everyone can now use this information to build tools to access existing content in binary Office documents and convert it to another format (e.g. OpenXML) or to use it in some other way.

I put the everyone in italics for some good reason: From a legal point of view everyone can actually do it, no question (see also Microsoft’s Open Specification Promise

However, be aware! You really need some patience and persistence including a good amount of willingness to struggle through all these bits and bytes which define the contents and layout of the binary Office documents. Sometimes, we have brooded many hours about the hex dump of a Word document on one side and Microsoft’s (cryptic :-) February release of the specification on the other side until we have understood the intricacies of a binary substructure such as the PICF structure.

But don’t be too afraid about this. There is some good help for you:

  • Microsoft has released in the meantime a fully revamped version of the formats specification (it seems that they have taken some of the week points we have found in the old specification to heart :-). The new specification now contains a lot of examples explaining the binary structures in detail. It is available here
  • The "How-To" guides we have drawn up during the development of the tools (see Documentation tab above). They give you some additional hints how some of the complex data structures are to be interpreted.
  • And last but not least our tools themselves including the sources. Being an Open Source project under the quite liberate Berkeley Software Distribution (BSD) license you can re-use the executables and/or sources in many ways in your own projects.

The translator from doc to docx is already quite mature and only a few feature mappings are missing.The two other translators (ppt to pptx and xsl to xslx) are more in a proof-of-concept phase and need for sure some improvements.

We are currently discussing and planning how this project can evolve in the future. If you have some special requirements or ideas, don’t hesitate to contact us.

Enjoy the M2 release and stay tuned for our plans for the future!

Powered by DotClear

Project page on SourceForge Logo