Extensible HyperText Markup Language

XHTML.com recommends: PSD to HTML Slicing Service

Media Types - How The Web Works

Media types (sometimes called MIME types or content types) are a classification system used to identify files commonly found on Web sites. Media types are crucial to the functioning of the Web, because when a client computer requests a Web page from the server, the client computer uses media types to tell the server what type of files the client computer will accept. Conversely, when the server sends files back to the client computer, the server uses media types to identify what type of files it is sending back. This information tells the client computer software (for example a Web browser) how to render or process the files that it receives.

Media types were developed by the Internet Engineering Task Force (IETF) to help the Internet work efficiently and consistently. Software vendors that produce Web servers and Web browser have widely adopted media types, which are now the primary method for classifying files used on Web sites.

Common examples of media types include text/html (used to identify HTML files) and image/jpeg (used to identify JPEG image files), but there are also media types for multi-media content such as music or video. As discussed below, media types also play a key role in determining how Web browsers process XHTML.

How Media Types Are Passed Between Computers

When the URL http://www.w3.org/Consortium/Overview.html is entered into a Web browser, the browser splits it into 3 parts:

http://
the protocol used to communicate with the server
www.w3.org
the name of the Web site
/Consortium/Overview.html
the path to the file that is being requested

After looking up the IP address for the Web site name (www.w3.org), the browser sends a text message to the Web server located at that IP address. The message may look like this:

  1. GET /Consortium/Overview.html HTTP/1.1
  2. Host: www.w3.org
  3. User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.9) Gecko/20050711 Firefox/1.0.5
  4. Accept: text/xml, application/xhtml+xml, text/html, text/plain, image/png, image/gif, image/jpeg

On line 1, GET is the type of HTTP request (as opposed to POST or HEAD). Consortium/Overview.html is the relative path of the file that needs to be fetched, and HTTP/1.1 shows the version of HTTP used.

On line 2, you see the Web site location.

On line 3, the browser identification string identifies Firefox as the browser that is sending the message.

On line 4 is a list of media types that the browser is able to process (or "accept" from the server). This line is called the Accept header. In our example, the browser is saying it will accept pure XML files, XHTML files served as XML, HTML files, plain text files, PNG images, GIF images and JPEG images.

If the Web server finds the requested file, it will send the file back to the browser. But in front of the file, it will put additional information about the file, called a header. A blank line separates the header from the file being sent back. Thus the server response to the browser may look like this:

  1. HTTP/1.1 200 OK
  2. Date: Fri, 03 Feb 2006 03:34:33 GMT
  3. Server: Apache/1.3.33 (Unix) PHP/4.3.10
  4. Content-Length: 14582
  5. Content-Type: text/html; charset=utf-8
  6. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ...
  7. <html xmlns="http://www.w3.org/1999/xhtml" ...
  8. <head>
  9. <title>About W3C</title>
  10. </head>
  11. <body>
  12. ...
  13. </body>
  14. </html>

On line 1, the version of HTTP being used by the server is mentioned (1.1), together with a status code. In this case, the status code 200 indicates that the Web server was successful in finding the file that the browser requested.

On line 2, is the current date and time.

On line 3, is information about the type of Web server.

On line 4, is the size of the file that the server is sending back to the browser.

On line 5, the server identifies the media type for the file it is sending: text/html. The browser uses this information to determine how it will to render or process the file.

Line 6 is the blank line that separates the header from the file being sent.

Line 7 and onward contains the contents of the file itself.

So, to recap, when a Web browser requests a file from a Web server, the media type tells the server which file types the browser will accept. When the server sends back files, it confirms for the Web browser what type of files it is returning.

Below is a partial list of common media types:

text/html
Used to identify HTML files.
text/css
Used to identify Cascading Style Sheets (CSS) files.
text/plain
Used to identify plain text files.
text/xml
Used to identify XML files.
image/gif
Used to identify GIF image files.
image/jpeg
Used to identify JPEG image files.
image/png
Used to identify PNG image files.
application/pdf
Used to identify Portable Document Format (PDF) files.
application/octet-stream
Used as a generic way to identify binary files which is often used for transferring EXE or ZIP files.
audio/mpeg
Used to identify MPEG audio file but is often used for other formats such as MP3.
application/xhtml+xml
Used to identify XHTML files served as XML.

Media Types And XHTML

If a Web server identifies the media type for an XHTML Web page as text/html, the browser to which it is sending the page will parse the Web page as though it were HTML.

If the media type is identified as application/xhtml+xml, the browser will parse the Web page as XML. This means the browser will enforce the rules of XML, which means that, if the markup contains errors (such as missing closing elements, incorrectly nested elements, or attributes that are not quoted), the browser will not render the Web page.

Since the media type application/xhtml+xml exists, why doesn't everyone who writes XHTML Web pages serve the pages as XML? The answer is - Internet Explorer. Although most other modern browsers such as Firefox, Opera and Safari support the application/xhtml+xml media type, Internet Explorer still does not. The IE development team has indicated they plan to support the application/xhtml+xml media type, but until IE comes up to speed, most developers continue to serve XHTML 1.0 as HTML. This can be done because, when written correctly, XHTML 1.0 is backward-compatible to HTML 4.01.

In addition of course, XHTML incorporates all the forward-looking strengths of XML. As W3C's Working Group wrote in the XHTML 1.0 spec:

The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility.

Further Reading

References

Frequently Asked Questions

Why not use file extensions like (.htm and .gif) instead of media types to determine the type of file?

Many Web pages are created via server-side scripts (PHP, ASP, ASP.NET, etc.). File extensions for these scripts are different from those used for static files. These server-side scripts can return different types of files - not only Web pages. Also, many URLs point to Web pages that don't have file extensions.

Why do Web browsers enforce XML rules but not HTML rules?

Early Web browsers did enforce the rules of HTML and stopped rendering Web pages if HTML markup contained errors. However, in the late 1990s, in a race to gain market share, Web browsers started to make adjustments for invalid markup. This tolerance of poor markup came at a price for browser vendors and developers:

  • Web browsers became more complex to build and maintain.
  • They required more computing resources since more and more code was devoted to fixing markup mistakes.
  • Web browser support for invalid markup encouraged sloppy authoring practices.

Tolerance of invalid markup also stifled innovation:

  • Since Web browsers were more complex to build, there were fewer competitors.
  • Since browsers were more complex to maintain, there were fewer updates to the software, and less development resources went into new features.
  • Since Web browsers required more computing resources, they could not be incorporated into low-resource devices such as early cell phones and other embedded systems.
  • Sloppy markup practices also undermined any foundation for extending HTML to support more complex sub-languages used to describe mathematics, music, etc.

To kick start innovation, HTML was reformulated into XHTML.