Freelance consultant for digital heritage

Cleaning up Word HTML

Today, whilst building a new data downloads section for the Archaeology at Heathrow T5 website, I had to convert a load of Word documents full of tables and subheadings into beautiful xHTML Strict for pages in a WordPress environment.

Normally, I’d open the files in Word 2004 (on a Mac), save them as HTML, then use Dreamweaver 8 to open each file, clean up the HTML via the “Clean Up Word HTML” command, then perhaps do a bit of cleaning by hand (i.e. removing the inline CSS).

But faced with 8 fairly complex documents, I decided that there must be a more efficient way of doing this. A quick Google (“clean word html osx”) revealed a remarkably simple process.

I’ll repeat it here, just for my own notes.

Open the Word documents in TextEdit (I’m a Mac user, remember!). In TextEdit go to Preferences, then go to the “Opening and Saving” tab. In the HTML saving options select “XHTML 1.0 Strict” and “No CSS”. You can also tick “Ignore rich text commands in HTML files if you like.

Then saving your Word documents as HTML using TextEdit gives you beautifully clean code to work with.

TextEdit’s HTML export options


2 responses to “Cleaning up Word HTML”

  1. Now that is just sublimely straight forward. Thanks for the tip!

  2. great tip and so much easier then the old process. Ill try it out.