Home > Forensics > Office 2007 metadata

Office 2007 metadata

Metadata information from documents can be a great source of information for investigators.  And I’ve often come across documents created in Microsoft Word or other Office documents.  There are several scripts and tools to read the properitary binary office 2003 and earlier format that Microsoft created and I’ve got nothing to add to those tools.  But I couldn’t find any tools that listed the metadata information from Office documents created using Office 2007, which use the OpenXML document format.  So I decided to examine it a bit further.

Microsoft has published a good document describing the structure of OpenXML, for instance here. Essentially a document created in the OpenXML document format is a compressed file, using the well known ZIP format.  Inside the ZIP file are predefined structures of files, mostly XML files that describe the document and it’s content.  So it can be easily read using standard available libraries in scripting languages such as Perl.

According to Microsoft a folder is created inside the ZIP archive called “_reis”.  This folder contains a file named “.rels” that defines the root relationships within the package.  This should be the first place to be able to parse the content of the document.  Whitin the .res file you find tags that define the relationship of the document:

<Relationship Id="someID" Type="relationshipType" Target="targetPart"/>

Metadata is stored in files that contain a type of “*properties”, most notable the “core-properties” and “extended-properties”. These files are usually stored in the following location:

  • docProps/core.xml
  • docProps/app.xml

These files then contain the actual metadata information, such as document creator, last saved by information, etc. These files then need to be extracted and parsed to display the metadata information.

To do this I wrote the script read_open_xml.pl that parses the contents of the .rels file to locate metadata information from the document and then extracts the metadata and prints it to the screen. Example of the usages is:

./read_open_xml.pl test.docx
==========================================================================
 cmd line: ./read_open_xml.pl test.docx
==========================================================================

Document name: test.docx
Date: Tue Jun  9 16:51:23 GMT 2009

--------------------------------------------------------------------------
File Metadata
--------------------------------------------------------------------------
 title = my company template
 subject = Document template
 creator = Kristinn Gudjonsson
 keywords = template, word
 description =
 lastModifiedBy = Kristinn Gudjonsson
 revision = 3
 lastPrinted = 2008-08-15T10:14:00Z
 created = 2008-08-15T10:14:00Z
 modified = 2008-08-15T10:14:00Z
 category = template
--------------------------------------------------------------------------
Application Metadata
--------------------------------------------------------------------------
 Template = my_template.dot
 TotalTime = 0
 Pages = 2
 Words = 159
 Characters = 908
 Application = Microsoft Word 12.1.2
 DocSecurity = 0
 Lines = 7
 Paragraphs = 1
 ScaleCrop = false
 Manager = Some dude
 Company = My Company
 LinksUpToDate = false
 CharactersWithSpaces = 1115
 SharedDoc = false
 HyperlinksChanged = false
 AppVersion = 12.0258

copyright, Kristinn Gudjonsson, 2009

The script also reads the character encoding of the XML documents and encodes the output accordingly.  If you experience any problems using the script, please notify me so I can fix the problem, but so far I haven’t come across any openXML document that hasn’t been correcly parsed using this script.

Update 1

I’ve modified the script slightly so it can be used in Windows.  I’ve tested the script on a Win XP SP3 machine using ActivePerl 5.10 and it should work.  You can get the Windows version here.
Categories: Forensics Tags:
  1. No comments yet.
  1. No trackbacks yet.
-->