“Antiword” for Office 2007
Since I’ve already created a script to parse Office 2007 documents to extract metadata information from them (see here) there wasn’t much effort to re-write it a little bit to include a parser to parse the content of a Word document in a similar manner as Antiword does for older version of the Microsoft Word format.
The script can be downloaded from here. The current version of the script only displays the content of a document created in Word, but there shouldn’t be that much work left to complete other formats as well (such as Excel and Powerpoint documents).
It works basically the same as the metadata extracter, but instead of displaying the metadata information, it displays the content of the file itself. It is the first version of this script, just slightly modified so some specific Word stuff does not print out as nicely as they should (such as Table of Content and Cover Pages) but that will be fixed in a future version if it.
Thank you for creating this script! I’ve been looking for this kind of thing for a long time. I often use antiword or catdoc, but the new docx format has been slowing me down. But not anymore!
I’m also just starting to try to learn perl, and it’s inspiring to see such a great script.
This is awesome! Thank you