about Word and PowerPoint formats

As far as I know the structure of a Microsoft office file format is kept secret, meaning there is no way to pull the text out of a .doc file without knowing the exact structure of the doc file. The same goes for pulling images from a .ppt file. Is this an accurate assumption? If it is then I was wondering, how do zip archives turn a 20kb .doc file into a 2k zipped file? Did Microsoft specifically format it to be friendly to compression algorithms?

In the end, what I am trying to find out is whether or not it is possible to pull out the text data out of a .doc file or images out of a .ppt file. Is this possible?

Google is your friend:

http://jakarta.apache.org/poi/

Kev

there is both OpenOffice and StarOffice that can read doc-files. I believe both are written in Java (at least parts - I’m not quite sure) and OpenOffice is open-source, so you could look how it’s done there.

openoffice.org 2 is java-based afaik and it reads all µ$ office formats (save access databases)