Validome-Logo

HTML / XHTML / WML / XML Validator

 
Validome - Validator Home de | en
Validome-Logo

HTML-Error documents regarding charset encodings

Identifying the charset encoding of a HTML-document was based on the HTML 4.01 W3C-Specifications
But some questions are left open by that. The following points are still unexplained or inconsistent
  1. If there is no charset encoding in HTTP-Header, Byte Order Mark (BOM) or Meta-Tag and no automatic identification based on the document's binary construction is possible, specification doesn't tell anything about the charset encoding to use.
    It recommends not to adopt any charset, but no word tells us, how the document should be processed.
    Furthermore HTTP-protokoll-specification RFC2616 shall be ignored (this demands an fallback to ISO-8859-1 for cases like that). And still no alternative is told.
  2. On principle the charset statement in HTTP-Header has a higher priority as one in Meta-Tag.
    If a document is e.g. encoded in UTF-18 using Byte Order Mark (BOM) it is still open, what priority BOM will have.
    If the charset encoding can be identificated , there further on is no word about how to process the document.
Because of these circumstances we have to use our own Processing Rules. These work in the following way:
  1. If there is no charset encoding in HTTP-Header, Byte Order Mark (BOM) or Meta-Tag and no automatic identification based on the document's binary construction is possible, processing will be aborted and an Error message will be shown to the user.
  2. The priority-order (highest to lowest) of the possibilities to identify a HTML-document's charset encoding is shown in the following list:
    1. HTTP-Header "Content-Type" ("charset"-parameter).
    2. If a Byte Order Mark (BOM) exists, it will be used.
    3. If it's possible to identify the charset encoding atomatically (based on the document's binary construction), it will be used.
    4. A Meta-declaration with "http-equiv", which is set to "Content-Type" and a valid value set for "charset".

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=ISO-8859-1

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>no error</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  8:   </head>
  9:   <body></body>
 10: </html>

This document contains a valid charset statement in Meta-Tag and in HTTP-Header.

The HTTP-Header charset encoding has to be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html

  1: EF BB BF<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>Byte order Mark != Meta-Charset</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

The document is encoded in UTF-8, but Meta-Tag defines ISO-8859-1.
That conflict should be pointed out to the user.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=UTF-8

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>HTTP-Header != Meta-Charset</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

The charset encoding in Meta-Tag is different to charset encoding in HTTP-Header.

HTTP-Header charset encoding has to be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>missing http-charset</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  8:   </head>
  9:   <body>äöüÄÖÜß?</body>
 10: </html>

Missing charset encoding in HTTP-Header.

Meta-Tag charset encoding has to be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=UTF-8

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>missing meta-charset</title>
  7:   </head>
  8:   <body>äöüÄÖÜß</body>
  9: </html>

There isn't any charset encoding statement in this document. Only in HTTP-Header, a charset encoding statement is being communicated. Because of not being able to identify the charset encoding, if the document will not be sent via HTTP (e.g. a local copy), a notice will be shown to the user.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>missing all-charset</title>
  7:   </head>
  8:   <body>äöüÄÖÜß</body>
  9: </html>

Neither in Meta-Tag, Byte Order Mark (BOM), nor in HTTP-Header a charset encoding statement was found.

The W3C-specification recommends to ignore RFC2616 and consequently not to perform a fallback to ISO-8859-1, but nothing is being told about which charset encoding should be used instead.
For this reason, we decided to abort validation and report an Error message.
From now on, there are UTF-16 encoded HTML-documents.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=UTF-16

  1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>no error UTF-16 without byte order mark</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
  8:   </head>
  9:   <body>äöüÖÄÜß</body>
 10: </html>

This HTML-document has been encoded in UTF-16 and contains no Errors.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=UTF-16

  1: FF FE<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>no error UTF-16 with byte order mark</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
  8:   </head>
  9:   <body>äöüÖÄÜß</body>
 10: </html>

This HTML-document has been encoded in UTF-16 with existing Byte Order Mark (BOM) and contains no Errors.
The following examples are UTF-16 encoded with existing Byte Order Mark (BOM).

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=ISO-8859-1

  1: FF FE<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>UTF-16; HTTP-Header != BOM</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=utf-16">
  8:   </head>
  9:   <body></body>
 10: </html>

HTTP-Header charset encoding statement is different to BOM in this example document.
In such cases HTTP-Header charset encoding is being used.
Because of the document being encoded in UTF-16, but having to process it with ISO-8859-1, some parsing Errors should be reported.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html; charset=UTF-16

  1: FF FE<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>UTF-16; HTTP-Header != META</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

Meta-Tag charset encoding is different to HTTP-Header charset encoding.

HTTP-Header charset encoding has to be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/html

  1: FF FE<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2:      "http://www.w3.org/TR/html4/loose.dtd">
  3: 
  4: <html>
  5:   <head>
  6:     <title>UTF-16; BOM != META</title>
  7:     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

This HTML-document is UTF-16 encoded, but the Meta-Tag charset statement tells us to process it with ISO-8859-1. BOM describes the right charset.
That conflict should be pointed out to the user.

Top   v3.0.0 - 16.11.2010 © validome.org - all rights reserved Datenschutzerklärung

Valid XHTML 1.0