As far as I know the structure of a Microsoft office file format is kept secret, meaning there is no way to pull the text out of a .doc file without knowing the exact structure of the doc file. The same goes for pulling images from a .ppt file. Is this an accurate assumption? If it is then I was wondering, how do zip archives turn a 20kb .doc file into a 2k zipped file? Did Microsoft specifically format it to be friendly to compression algorithms?
In the end, what I am trying to find out is whether or not it is possible to pull out the text data out of a .doc file or images out of a .ppt file. Is this possible?