On little known web formats.
I’d never paid much attention to the “web archive (.mht)” format used by Internet Explorer, because I do not use IE at all. Yesterday, I was forced to use the dinosaur of a web browser on a bank site, which shall remain nameless except that the programmers of that site would probably give you a “deer in headlights” look if I said anything about palindromes, web standards and cross browser compatibility. Sigh.
So, I saved a completed form in .mht format for later use, on the tiny EEE laptop on which I have windows XP, for it did not have CutePDF installed. Today, Opera on my MacBook opened this file to my relief, but it was to be short lived. Opera kept crashing every time I tried to print the file. I looked around the web for a converter from .mht to a more humane format. I found only paid software.
That’s when I decided to poke into the file itself and see whether I could salvage the document. Surprisingly, .mht file format is defined in RFC 2557. Jacob Palme, one of the proposers of the standard, has a web page explaining the MHTML format.
The format looks simple enough to be parsed using Python’s mimetypes library.
I’ve started a github project — pyMHTML to write a library to parse web archive files.
Hacking on this library is my fallback for the PyCon sprint sessions.