Why Is HTML Conversion So Hard?

Of maybe “why does MS Word suck so much” would be better.


I’m creating a bunch of stuff at work and am working off a script given to me to put in articles and activities. In one activity (for middle school age kids) it’s a mnemonics page that explains what they are and then gives the kids the chance to submit their own. Anyway, the word document has a section of the common mnemonics with the letters used highlighed. I had no desire to go and individually select a letter, bold it, find the next letter, etc, so I figured I’d just convert to HTML and copy and paste into the perl mason source file I was working in.


What a horrible, horrible mistake.


First of all, why can’t people create document->>HTML filters that don’t suck?


I tried in both word and Openoffice and neither gave me really satisfactory, “do it and you’re done” results.


The word source file for the text was 22k. Opening this up in Openoffice and selecting “save as webpage” resulted in a 4k HTML file. The file (here) contained minimul style information and only a couple of lines of CSS. The information within <HEAD> was only 12 lines. Most of the tags were sane but it seemed to put a lot of font color information in duplicate (why can’t filters be smart and realize that puting multiple “font black” tags around the same text is the same as putting only one around that text?). Some of the high bit characters were converted into garbage. So there’d still be some crap to get rid of, and I’d still have to go through and look for one or two stray characters every couple of lines and get rid of them.


Converting to HTML in word was just scary. While I saw their filters get better around the word 97-2000 era, they’ve gone back downhill. I realize that MS wants things to look just great and perfect and duplicate the look of the document itself but hey, it’s HTML! It’s not the word document anymore! Let it go!


The result of “save as html” in MS Word-XP was a 10k file (here). It consisted of 109 lines in <HEAD>, a third or so of it commented out XML data containing the author name, save dates, statistics information of lines, paragraphs, and so on). The rest of it was all CSS defining font information. The tags were sane and it looked like the high characters were replaced with their proper HTML equivilents, which is good, and the embedded font tags mentioned above seemed to be taken care of. However, instead of using <B> to bold words it used a <B> tag with style information in it, meaning each tag was 35 characters longer than needed. And while useless font tags were not used, it did seem to randomly enjoy throwing in font and style information around blank spaces, meaning that a blank line was not a simple <P> tag, but 40 characters of other crap.


The end result is that both files render pretty much exactly the same, except that one is 10k and one is 4.


I could have created the HTML file that I wanted by sitting down and typing it out, but like I said I’m lazy. However, I’m going to make a bet that my end result would have not been nearly as ugly, but far less fun to laugh at.


I just want a filter that when I say save to HTML it does what I want. Give me the choice of attributes to save. Maybe I want to save font face information, maybe I don’t. Maybe I care about bolding or not. Maybe I just want the table structure (oh gads, I don’t want to think what any sort of layout would have done). I want a blank line to be represented by a simple P tag and not 40 extra characters. I don’t want to have to spend time going through and fixing up the conversion and getting rid of other tags and getting rid of garbage characters. I want the computer to help me and not hinder me.


Update Completely forgot to link to this page where James rants about similar things.