Media Types - How The Web Works
Media types (sometimes called MIME types or content types) are a classification system used to identify files commonly found on Web sites. Media types are crucial to the functioning of the Web, because when a client computer requests a Web page from the server, the client computer uses media types to tell the server what type of files the client computer will accept. Conversely, when the server sends files back to the client computer, the server uses media types to identify what type of files it is sending back. This information tells the client computer software (for example a Web browser) how to render or process the files that it receives.
Media types were developed by the Internet Engineering Task Force (IETF) to help the Internet work efficiently and consistently. Software vendors that produce Web servers and Web browser have widely adopted media types, which are now the primary method for classifying files used on Web sites.
Common examples of media types include text/html (used to identify HTML files) and image/jpeg (used to identify JPEG image files), but there are also media types for multi-media content such as music or video. As discussed below, media types also play a key role in determining how Web browsers process XHTML.
How Media Types Are Passed Between Computers
When the URL http://www.w3.org/Consortium/Overview.html is entered into a Web browser, the browser splits it into 3 parts:
http://- the protocol used to communicate with the server
www.w3.org- the name of the Web site
/Consortium/Overview.html- the path to the file that is being requested
After looking up the IP address for the Web site name (www.w3.org), the browser sends a text message to the Web server located at that IP address. The message may look like this:
GET /Consortium/Overview.html HTTP/1.1Host: www.w3.orgUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.9) Gecko/20050711 Firefox/1.0.5Accept: text/xml, application/xhtml+xml, text/html, text/plain, image/png, image/gif, image/jpeg
On line 1, GET is the type of HTTP request (as opposed to POST or HEAD). Consortium/Overview.html is the relative path of the file that needs to be fetched, and HTTP/1.1 shows the version of HTTP used.
On line 2, you see the Web site location.
On line 3, the browser identification string identifies Firefox as the browser that is sending the message.
On line 4 is a list of media types that the browser is able to process (or "accept" from the server). This line is called the Accept header. In our example, the browser is saying it will accept pure XML files, XHTML files served as XML, HTML files, plain text files, PNG images, GIF images and JPEG images.
If the Web server finds the requested file, it will send the file back to the browser. But in front of the file, it will put additional information about the file, called a header. A blank line separates the header from the file being sent back. Thus the server response to the browser may look like this:
HTTP/1.1 200 OKDate: Fri, 03 Feb 2006 03:34:33 GMTServer: Apache/1.3.33 (Unix) PHP/4.3.10Content-Length: 14582Content-Type: text/html; charset=utf-8<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ...<html xmlns="http://www.w3.org/1999/xhtml" ...<head><title>About W3C</title></head><body>...</body></html>
On line 1, the version of HTTP being used by the server is mentioned (1.1), together with a status code. In this case, the status code 200 indicates that the Web server was successful in finding the file that the browser requested.
On line 2, is the current date and time.
On line 3, is information about the type of Web server.
On line 4, is the size of the file that the server is sending back to the browser.
On line 5, the server identifies the media type for the file it is sending: text/html. The browser uses this information to determine how it will to render or process the file.
Line 6 is the blank line that separates the header from the file being sent.
Line 7 and onward contains the contents of the file itself.
So, to recap, when a Web browser requests a file from a Web server, the media type tells the server which file types the browser will accept. When the server sends back files, it confirms for the Web browser what type of files it is returning.
Below is a partial list of common media types:
text/html- Used to identify HTML files.
text/css- Used to identify Cascading Style Sheets (CSS) files.
text/plain- Used to identify plain text files.
text/xml- Used to identify XML files.
image/gif- Used to identify GIF image files.
image/jpeg- Used to identify JPEG image files.
image/png- Used to identify PNG image files.
application/pdf- Used to identify Portable Document Format (PDF) files.
application/octet-stream- Used as a generic way to identify binary files which is often used for transferring EXE or ZIP files.
audio/mpeg- Used to identify MPEG audio file but is often used for other formats such as MP3.
application/xhtml+xml- Used to identify XHTML files served as XML.
Media Types And XHTML
If a Web server identifies the media type for an XHTML Web page as text/html, the browser to which it is sending the page will parse the Web page as though it were HTML.
If the media type is identified as application/xhtml+xml, the browser will parse the Web page as XML. This means the browser will enforce the rules of XML, which means that, if the markup contains errors (such as missing closing elements, incorrectly nested elements, or attributes that are not quoted), the browser will not render the Web page.
Since the media type application/xhtml+xml exists, why doesn't everyone who writes XHTML Web pages serve the pages as XML? The answer is - Internet Explorer. Although most other modern browsers such as Firefox, Opera and Safari support the application/xhtml+xml media type, Internet Explorer still does not. The IE development team has indicated they plan to support the application/xhtml+xml media type, but until IE comes up to speed, most developers continue to serve XHTML 1.0 as HTML. This can be done because, when written correctly, XHTML 1.0 is backward-compatible to HTML 4.01.
In addition of course, XHTML incorporates all the forward-looking strengths of XML. As W3C's Working Group wrote in the XHTML 1.0 spec:
The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility.
Further Reading
References
Frequently Asked Questions
Why not use file extensions like (.htm and .gif) instead of media types to determine the type of file?
Many Web pages are created via server-side scripts (PHP, ASP, ASP.NET, etc.). File extensions for these scripts are different from those used for static files. These server-side scripts can return different types of files - not only Web pages. Also, many URLs point to Web pages that don't have file extensions.
Why do Web browsers enforce XML rules but not HTML rules?
Early Web browsers did enforce the rules of HTML and stopped rendering Web pages if HTML markup contained errors. However, in the late 1990s, in a race to gain market share, Web browsers started to make adjustments for invalid markup. This tolerance of poor markup came at a price for browser vendors and developers:
- Web browsers became more complex to build and maintain.
- They required more computing resources since more and more code was devoted to fixing markup mistakes.
- Web browser support for invalid markup encouraged sloppy authoring practices.
Tolerance of invalid markup also stifled innovation:
- Since Web browsers were more complex to build, there were fewer competitors.
- Since browsers were more complex to maintain, there were fewer updates to the software, and less development resources went into new features.
- Since Web browsers required more computing resources, they could not be incorporated into low-resource devices such as early cell phones and other embedded systems.
- Sloppy markup practices also undermined any foundation for extending HTML to support more complex sub-languages used to describe mathematics, music, etc.
To kick start innovation, HTML was reformulated into XHTML.