:web design/

A first look at XHTML

Unless you are completely new to the Internet, the chances are you have at least heard of XHTML. It is a lot less likely that you will know very much about it. A degree of nervousness is understandable — after all, if you have just learned to use HTML, the idea of learning a new markup language to code web pages is daunting.

Fortunately, there is really nothing to worry about. If you have got to grips with HTML, there is only a little you need to learn if you want to move on to coding in XHTML. But first of all, what is XHTML?

A word about XML

HTML, or the HyperText Markup Language, is derived from SGML, the Standard Generalized Markup Language. SGML is a meta-language — that is, it is a language used to define other languages, one of which is HTML. XML is also a meta-language, used to define other languages — in fact, it is a subset of SGML — and it is the basis of XHTML. Why the switch? The clue lies in the name of XML: the eXtensible Markup Language — languages built using XML can be extended, i.e. have elements added to them. It would be feasible for a company to define elements such as <invoice> or <due> in an XML-based application. On the other hand, extending HTML would mean rewriting the entire HTML document type definition (DTD).

The point to having an extensible language is that, according to certain Net gurus, over the next few years the bulk of Internet access may be via technology other than the desktop computer: palmtops, televisions, telephones, even fridges (seriously — in fact, one manufacturer already has an Internet-ready fridge). [Don’t know whether Elextrolux still markets such an item, but you can see the old page using the Wayback Machinethere’s an image here.] With an extensible language based on XML, porting existing applications to these new environments and creating new ones will be much simplified. That's the theory. Whether this will actually prove useful to most of us in the real world remains to be seen…

What is XHTML 1.0?

In January 2000, the W3C recommendation changed from HTML 4.01 to XHTML 1.0 — so XHTML is now the ‘standard’, in so far as there is one, for writing web pages. HTML 4 was based upon SGML; XHTML 1.0 is HTML 4 reformulated as an XML application.

Clearly, therefore, there are not going to be huge differences between HTML 4 and XHTML 1, since they are really the same language. There are differences, of course, but nothing too drastic as yet. Future versions of XHTML will introduce new concepts such as modularity — although, again, how much use that will be in the real world remains to be seem — but XHTML 1 documents remain very much like HTML documents.

The first question on everyone’s mind, though, is naturally: What are the differences from HTML?

XHTML is case-sensitive

In HTML it was perfectly legal to use uppercase or lowercase letters or a mixture of both for element names, but in XHTML only lowercase is acceptable. This applies to both the element names and their attributes. Where this was fine before:

<IMG SRC=images/logo.GIF width="200" HEIGHT=85
ALIGN=Left>

we must now write:

<img src="images/logo.GIF" width="200" height="85"
align="left" />

Note that case-sensitivity also applies to the values of many of the attributes — "Left", "Right" and "Center" are not legal values for align in XHTML, they must be written as "left", "right" and "center". Of course, where the value is determined outside of XHTML — as in, for example, the URI which is the value for the src attribute — then the presentation of the value is not subject to XHTML’s rules. As well as URIs, be careful when it comes to font family names. Note, too, that all attribute values must be enclosed within quotes — this is no longer optional.

One area to particularly note is that event attributes also must be written entirely in lowercase — thus not:

<body onLoad="this.focus()">

but:

<body onload="this.focus()">

Close all elements

Anyone who has been writing HTML is likely to find this more irksome than anything else because some of the most commonly used elements in HTML, before the arrival of CSS anyway, did not have to be closed: <p>, <li>, for example. In XHTML, all elements must be closed.

This rule even applies to empty elements — such as <br>,
<hr> or <img>. This is done by adding a slash at the end of the tag: <br />. Because this would confuse older browsers, a space is inserted before the terminal slash; this means that browsers which don’t understand XHTML syntax will identify the slash as an unknown attribute and ignore it. Well, that’s the theory, anyway.

No minimised attributes

The final new point to note about element tags is that minimised attributes are no longer allowed. What is a minimised attribute? Here's an example. When a horizontal rule is inserted in a page, usually it is given a 3D look like so:

[Image: 3D-effect horizontal rule]

The code for this in HTML is:

<hr size="4">

If we want a solid line, we insert the noshade attribute:

<hr size="4" noshade>

This, which is the correct HTML4 syntax, renders like so:

[Image: horizontal rule with noshade attribute applied]

The use of noshade without any value is an example of what we mean when we say it is “minimised” — it is as if we were taking something like noshade="yes" or noshade="1" and writing it simply as noshade. However, this is not allowed in XHTML: every attribute must have a value. You might expect the syntax, then, to be one of those two options, but it isn’t. Although it seems silly and redundant, the correct syntax is now:

<hr size="4" noshade="noshade" />

Note that the presentational attributes of hr (size and noshade) or only permitted in Transitional XHTML 1.0, not Strict. There are other attributes which require a similar approach, for example:

Element Attribute XHTML
ul, dl, ol compact compact="compact"
option selected selected="selected"
input checked checked="checked"
td nowrap nowrap="nowrap"

Similarly, attributes which could be optionally minimised, such as border for border="1" in <table> tags, must be written in full.

Nest!

This isn’t actually new, proper nesting of elements has been recommended for some time for HTML. Having said that, HTML 3.2 was fairly lax about this and most browsers would render something like this without problems:

<p><b>Important text</p></b>

Things should have tightened up with the move to HTML 4, but many designers stuck with old, bad habits — particularly those who rarely looked beyond the very forgiving Internet Explorer. Those who started to use CSS or who used the more picky Netscape browsers, though, quickly found that proper nesting really did matter. With XHTML, it is absolutely mandatory to properly nest tags.

Structure of an XHTML document

Not surprisingly, an XHTML document is structured pretty much like an HTML one. The major divisions are the <head> and the <body>, which are enclosed by the <html> tags. At the start of the document is a document type declaration.

One different feature of XHTML documents is that they should begin with a statement about the version of XML upon which the language used in the document is based:

<?xml version="1.0" encoding="utf-8"?>

The problem with this is that few browsers understand this, and even some recent browsers will refuse to render, or render unpredictably, a page which opens with this declaration. For that reason this declaration is recommended, but not required — so it can be omitted and the document will still validate as XHTML.

If it is omitted, it is good practice to insert a <meta> tag in the document head with the appropriate information about the content of the document:

<meta http-equiv="Content-Type"
    content="text/html; charset=iso-8859-1" />

Next after the XML declaration comes the document type declaration which is required. This is the transitional (or “loose”) declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Note that although every other tag in the document must be lowercase, <!DOCTYPE> must be uppercase! The rationale for this is that the document type declaration is not considered part of the document, it is a sort of prologue to it which states what type of document will follow. Here are the strict and the frameset forms:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

The main difference between the strict and transitional forms is that many of the presentational attributes allowed in the transitional DTD are not permitted in the strict. Examples are noted later in relation to the <img> element. Note that you should not use a <!DOCTYPE> like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "DTD/xhtml1-strict.dtd">

This is something you see from time to time in the source of web pages, it has been cut-and-pasted straight from the W3C site. The problem is, the URI included in the <!DOCTYPE> is relative; that’s fine if the page is on the W3C’s servers, but meaningless anywhere else: unless you actually have the DTD on your own server, which is unlikely, you need to use an absolute URI (as in the examples above).

The next part of the structure is, of course, the <html> element which encloses the rest of the document — i.e. it is the root element of the document. This should contain the appropriate XML namespace for the document:

<html xmlns="http://www.w3.org/1999/xhtml">

XML namespaces are collections of names used in XML documents as elements and attributes, and the namespace used is identified by a URI reference, as above. That is, the namespace lists every tag and attribute which can be used in XHTML.

A minimal XHTML document will look very familiar if you are used to HTML 4:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <title>Document Title</title>
        </head>
        <body>
           <p>This is where the content goes...</p>
        </body>
    </html>

The differences are quite minor, aren’t they? Remember that the following elements must be present in every XHTML document: <!DOCTYPE>, <html>, <head>, <title>, and (framesets excepted) <body>. As with HTML, the <head> element contains CSS information, either embedded or linked/imported, meta-information, possibly <link> elements, and scripts. Remember that <link> and <meta> elements are empty and require the terminal slash to close the element.

One point to note in particular is that inline page elements (such as text or <br />) must be enclosed in an appropriate block-level element. This markup is not correct:

<body>

<p>This paragraph is OK, but the line 
breaks following it are not.</p>
<br />
<br />
<blockquote cite="http://www.foo.bar/">
  This text needs to be enclosed in an appropriate 
  element, such as a paragraph.
</blockquote>

This text also needs to be in an appropriate element.

</body>

We will now look at some specific elements to note differences from HTML.

Particular points to note

<a>

The anchor tag. When used to insert an anchor point within a document, the HTML syntax was:

<a name="foo">

The use of the name attribute, however, is deprecated — ultimately, the W3C wants it to disappear and be replaced by id, which should replace name in the example above. However, because older browsers may have problems with id, the current recommended syntax is:

<a id="foo" name="foo">

<br>

The line break: in XHTML this of course becomes <br />. One important point to note is that this now must be enclosed within an appropriate block level element — <p> or <div>, for example; “naked” inline elements are no longer allowed.

Forms

Take note of what we have said about the minimising of tags no longer being allowed — some of the elements used within forms are affected. Additionally, if your document is to validate as XHTML, the form element must have an action attribute. If there is no appropriate value for this, it should be written like so:

<form action=" ">
    <input type="button" value="Close" onclick="javascript:void 
        window.close()">
</form>

<img>

The image element. Note that it is an example of an empty element, and therefore needs to be closed with a terminal slash, as with br. Also note that some of its attributes, such as hspace, vspace, border and align are not allowed in the strict XHTML specification. The other important point to make is yet another thing which could just as well be said in relation to HTML 4: every <img> element must have an appropriate alt attribute.

Lists

The important point to repeat here is that all elements must be closed. Even more than <p>, it was the norm to write list items using only the opening <li> tag — and, in fact, there are many HTML editors which insert li elements in this format. The closing tag is mandatory.

<p>

The paragraph element — another tag frequently not closed in HTML but which must be closed in XHTML. It is also worth mentioning that one use of <p> should be dropped:

<p>Some content here...</p>
	<p>
	<p>
	<p>
	<p>
	<p>...and some more here.

This was a bad thing even with HTML, and has no place in XHTML documents, even if you close all the tags! Remember that HTML is supposed to describe the logical structure of a document. The group of four <p> tags are not paragraphs, they are simply there to place a gap between the two pieces of text. This is what the <br /> tag is for, the line break (athough in most cases it would be better to use CSS to control spacing). Use <p>…</p> as it should be used, to mark a particular part of the structure of the page: the paragraph.

Summary of XHTML’s features

  1. XHTML 1.0 is very similar to HTML 4 and should work well in most browsers.
  2. All tags, attributes and values defined by XHTML are written in lowercase.
  3. All elements, including empty elements, must be closed.
  4. Attributes may not be minimised.
  5. Elements must be properly nested.
  6. All inline page elements must be contained within an appropriate block element.

What does XHTML have to offer?

XHTML is a clear successor to HTML 4, and continues the W3C’s aim to clean up document code through separating the logical structure from presentation, by deprecating proprietary and problematic elements. With XHTML, authors have to make a document type declaration — and, it is hoped, keep to the rules of the DTD they declare. Greater standards compliance on the part of designers may encourage better standards compliance by the browser makers — and that would benefit everyone, designers and users both. It goes beyond HTML 4 in laying the groundwork for extensibility, to provide greater ease in transferring web applications to non-desktop platforms.

The question is — should we be using it?

The first thing to emphasise is that XHTML is no more difficult than HTML 4. In particular, anyone who has been diligent about closing tags, proper nesting of elements and proper use of the alt attribute won’t have too much trouble using XHTML.

Whether or not you use XHTML depends, in the end, on what you are doing. If you are maintaining a large site which has been in place for some time and contains pages written in HTML 3.2 and HTML 4, there’s no real urgency in shifting the site to XHTML right now — or perhaps ever. For the foreseeable future, browsers are going to be able to handle old versions of HTML as well as, hopefully, render properly standards-compliant code.

On the other hand, if you are just starting work on a new site, there is no reason not to build it with XHTML. The coding is as easy as HTML 4 — which is the only alternative you should consider — and it means if you should need to transfer the documents at some point in the future to another XML-based language it will be that much easier, especially if proper use is made of CSS.

Certainly, there is no good reason for anyone just starting out in web authoring to learn anything other than XHTML as their primary language for building web pages: it works in old browsers and new, and it is the standard — and has been the standard since January 26th, 2000. It isn’t something brand new which has just been sprung on us.

Further reading:


Site Meter
Top of Page
Valid XHTML 1.0! Valid CSS! Level Triple-A conformance with WAI guidelines