Stream-based processing of zipped XML content

Most office applications use XML files in a ZIP as document format:

XML

This article shows how to process and manipulate such content.

1. Manipulating ZIP file content on the fly

ZIP files can be read and manipulated using the Java built-in ZipInputStream/ZipOutputStream. This just copies all the content from one zip file to another:

ZipInputStream zipsrc = new ZipInputStream(src);
ZipOutputStream zipdest = new ZipOutputStream(dest);
ZipEntry entry;
while ((entry = zipsrc.getNextEntry()) != null) {
    zipdest.putNextEntry(new ZipEntry(name));
    IOUtils.copy(zipsrc, zipdest);
    zipsrc.closeEntry();
    zipdest.closeEntry();
}

2. Manipulating XML content on the fly

Now we need a way to transform XML on the fly, for example to insert new content into tags. Preferably by just working on a XML stream, without the need to have the full document in memory. In Java this can be done using the Streaming API for XML (StAX), for example:

XMLEventFactory eventFactory = XMLEventFactory.newInstance();
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(src);
XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(dest);

while (reader.hasNext()) {
    XMLEvent event = (XMLEvent) reader.next();
    writer.add(event);
}
writer.flush();

Example code

I created a open source library named serialletter to replace serial letter fields in zipped XML office documents. Currently supported are Apple Pages '09 documents. It is on GitHub, you can also get it via Maven Central:

<dependency>
    <groupId>de.ralfebert</groupId>
    <artifactId>serialletter</artifactId>
    <version>1.0.0</version>
</dependency>
Ralf Ebert

Ralf Ebert is an independent software developer and trainer. He makes apps for Mac OS X and iOS and offers iOS and Git training courses for software developers.