Friday, October 15, 2010

Html to XHTML/XML with tidy

As you all know the browser will perform in quirks mode
www.quirksmode.org
if you don't specify into your html file what type of document is it.
This will make your life super difficult because browser will display your pages as they wish.
The best thing to do then is to declare what type of document you present.

If there is only one page to change of course it can be done manually but if you have
many pages to do is always better to use a programmatic approach.
So tidy is to rescue !
Tidy is a small program that runs on Linux and can help on this. More info can be found at tidy.sourceforge.net

So let's get practical:
problem - you have a large number of files you want to convert from simple html to xhtml or xml.
solution - open a terminal (xterm. gterm or anything you want - works on direct console as well)
and type the following

for i in `find ./*.html -type f`; do tidy -asxhtml  < $i > $i.xhtml ; done 

Q: What just happened ?
A1: The command looked for all the *.html (all files ending with html extension) into the current directory and then feed tidy with the files matched.
Tidy took the file as input converted to xhtml (option -asxhtml) and wrote it back to the disk with the original_name.html.xhtml
A2: Into each file that was wrote by tidy you can see the new tags that are necessary for xhtml.

For example into the original file you had
<html>
        <head>
                <title> me - home page  </title>
        </head>
        <body>
                <h1> my home page </h1>
                <p> Lorem ipsum dolor sit amet, consectetur adipiscing elit.
                        Morbi vitae lorem justo. Cum sociis natoque penatibus et
                        magnis dis parturient montes, nascetur ridiculus mus.
                        Sed sed consectetur massa. Morbi ac erat sit amet eros
                        malesuada pellentesque tempus ultricies dui. Nam laoreet nibh
                        sit amet massa pellentesque fermentum. Aenean suscipit laoreet
                        lorem sed dignissim. Integer dapibus pulvinar lorem ac placerat.
                        Mauris suscipit, risus commodo tempor feugiat, mi metus facilisis
                        quam, auctor sagittis magna mi vitae lacus. Suspendisse purus velit,
                        ultrices at iaculis dictum, rutrum nec tortor.
                        Nullam a libero ut erat semper mollis in et lectus.
                        Sed suscipit lorem eu mi tristique volutpat. Proin lobortis
                        vehicula nunc vel condimentum. Aliquam mattis, lacus eu adipiscing
                        posuere, dui enim sodales felis, et tincidunt nulla urna lacinia velit.
                </p>
        </body>
</html>


Now you have

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
<title>me - home page</title>
</head>
<body>
<h1>my home page</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
vitae lorem justo. Cum sociis natoque penatibus et magnis dis
parturient montes, nascetur ridiculus mus. Sed sed consectetur
massa. Morbi ac erat sit amet eros malesuada pellentesque tempus
ultricies dui. Nam laoreet nibh sit amet massa pellentesque
fermentum. Aenean suscipit laoreet lorem sed dignissim. Integer
dapibus pulvinar lorem ac placerat. Mauris suscipit, risus commodo
tempor feugiat, mi metus facilisis quam, auctor sagittis magna mi
vitae lacus. Suspendisse purus velit, ultrices at iaculis dictum,
rutrum nec tortor. Nullam a libero ut erat semper mollis in et
lectus. Sed suscipit lorem eu mi tristique volutpat. Proin lobortis
vehicula nunc vel condimentum. Aliquam mattis, lacus eu adipiscing
posuere, dui enim sodales felis, et tincidunt nulla urna lacinia
velit.</p>
</body>
</html>


That's it - now you have xhtml !
You can change the above command to make tidy do other useful things as preserving the indentation level, clean the file by some tags etc.