Extracting Images from MS Word Documents

The following is a quick demonstration about how to extract images (and other binary embeded objects) from a Microsoft Word document using java.

by: Brian Doyle

Microsoft started supporting the ability to save a document as XML in Office 2003. Images are embeded in the XML document as binary data using Base 64 encoding. MS Word stores embeded binary data using the w:binData tag. A name attribute will be created using the psuedo protocol wordml. The URI will be an incremental name with an extension that matches the file type. For example: <w:binData w:name="wordml://01000002.gif">
<w:binData w:name="wordml://03000001.png">

The content of the tag will be the base 64 encoded binary data (there is no markup to indicate that the embeded data is base 64).

Using a SAX parser makes for a simple way of extracting out the image data. The following example extracts all of the images (and binaries) from a word document and writes them out as files.

The source may be downloaded as a zip file. To run it, just compile and run: java com.doylecentral.word.FileTester wordFile.xml outputDirectory The class FileTester provides a simple example of how to use the classes. The choice to write to the file system is arbitrary and be be changed to send the data to a database or wherever is needed.

The concept of the code extraction is shown in the code below (the zip file contains a refactored version).

Comments may be left on the blog as normal.

  public class ImageExtractor {
   CharArrayWriter text = new CharArrayWriter();

   Map dataMap = new HashMap();

   int foundImages;

   public ImageExtractor() {
      //C
   }
   /**
    * InputStream is closed internally.
    @param is
    @throws IOException
    */
   public ImageExtractor(InputStream isthrows IOException {
      parseXmlFile(is, new ImageParseHandler() false);
      is.close();
   }
   /**
    * Refuse to Validate against dtd.
    @param is
    @param handler
    @param validating
    */
   private void parseXmlFile(InputStream is, DefaultHandler handler,
        boolean validating)
   {
      try
      {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating(validating);
        factory.newSAXParser().parse(is , handler);
      catch (SAXException e)
      {
        // A parsing error occurred; the xml input is not valid
      catch (ParserConfigurationException e)
      {
        //
      catch (IOException e)
      {
        //
      }
   }

   private class ImageParseHandler extends DefaultHandler {

      private boolean inImage = false;

      private StringBuffer encodedDataSb = null;

      private String imageName;

      Locator locator;

      public void setDocumentLocator(Locator locator)
      {
        this.locator = locator;
      }

      public void characters(char[] chars, int start, int len)
           throws SAXException
      {
        if (inImage)
        {
           encodedDataSb.append(new String(chars, start, len));
        }

      }

      public void startElement(String uri, String localName, String qName,
           Attributes attributesthrows SAXException
      {
        text.reset();
        if (qName.equals("w:binData"))
        {
           imageName = attributes.getValue("w:name");
           if (imageName.endsWith(".png"|| imageName.endsWith(".jpg"))
           {
              encodedDataSb = new StringBuffer();
              inImage = true;
           else
           {
              inImage = false;
           }
           foundImages++;
        }
      }

      public void endElement(String uri, String localName, String qName)
           throws SAXException
      {
        if (qName.equals("w:binData"&& inImage)
        {
           ByteArrayInputStream is = new ByteArrayInputStream(encodedDataSb.toString().getBytes());
           ByteArrayOutputStream baos = new ByteArrayOutputStream();
           ImageDecoder id = new ImageDecoder();
           id.decodeImage(is, baos);
           dataMap.put(imageName, baos.toByteArray());
           try
           {
              is.close();
              baos.close();
           catch (IOException e)
           {
              // TODO Auto-generated catch block
              e.printStackTrace();
           }
           inImage = false;
        }
      }

   }

   public int getFoundImages()
   {
      return foundImages;
   }
   public Map getDataMap()
   {
      return dataMap;
   }
}

A container hold the Base64 decoding. This could be modified to use the Base64 decoder from the Jakarta Commons Codec package if desired. This would be necessary for users of the classpath project with GCJ.
public class ImageDecoder {
  
   public void decodeImage(InputStream is, OutputStream os)
   {
      BASE64Decoder decoder = new BASE64Decoder();
      try
      {
        decoder.decodeBuffer(is, os);
      catch (IOException e)
      {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
   }
}

Discuss

leave a comment

Sponsors:

About willCode4Beer