Validome-Logo

HTML / XHTML / WML / XML Validator

 
Validome - Validator Home de | en
Validome-Logo

XML- and XHTML-documents regarding charset encodings

Charset encodings in XML and XHTML documents can be detected by means of the following sources:
  1. HTTP-Header-"charset"-Parameter (Content-Type).
  2. XML-Declaration encoding Attribute.
  3. Byte Order Mark (BOM) of a document.
  4. The binary pattern of "<?xml" in the document.
Because XML-documents, as well as HTML-documents, have no clearly defined specifications concerning charset encoding statements, we are using the following order of priorities (highest to lowest priority):
  1. HTTP-Header-"charset"-Parameter (Content-Type).
  2. BOM of a document.
  3. The binary pattern of "<?xml" in the document.
  4. The encoding-Attribute in the XML-Declaration.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=ISO-8859-1

  1: <?xml version="1.0" encoding="ISO-8859-1"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>no error</title>
  8:   </head>
  9:   <body/>
 10: </html>

No Error exists.
HTTP-Header charset encoding will be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: text/xml

  1: <!DOCTYPE root [
  2:   <!ELEMENT foo (#PCDATA)>
  3:   <!ELEMENT root (foo)>
  4: ]>
  5: <root>
  6:   <foo>foo</foo>
  7: </root>

If charset encoding could not be detected, the validator uses a fallback to US-ASCII.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: EF BB BF<?xml version="1.0"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>BOM-Charset</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

The charset encoding statement can only be identified at byte Order Mark (BOM), because no other statements exist.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: <?xml version="1.0"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>automatic</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

This UTF-16 encoded Document contains no charset encoding in the XML-Declaration. Furthermore no HTTP-Header charset encoding will be submitted.
Because charset encoding can be identified by means of the binary pattern of UTF-16, the right charset encoding will be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: <?xml version="1.0" encoding="UTF-8"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>XML-Charset</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

Only within the XML-Declaration a charset encoding statement (UTF-8) exists. There is no charset encoding submitted in HTTP-Header.
So XML-Declaration charset encoding (UTF-8) will be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: <?xml version="1.0"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>no XML-Declaration</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

If no charset encoding could be found in XHTML-documents, Validome performs a fallback to UTF-8.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: <?xml version="1.0" encoding="ISO-8859-1"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  5:     <head>
  6:       <title>XML-Charset</title>
  7:       <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  8:     </head>
  9:     <body>äöüÄÖÜß</body>
 10:   </html>

If there's any difference between XML-Declaration charset encoding and Meta charset encoding, this should be reported.
From now on, you see some examples with UTF-32 encoded documents.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=UTF-32

  1: 00 00 FE FF<?xml version="1.0" encoding="UTF-32"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>UTF-32 (1234)</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

Validome is also able to handle UTF-32 encoded documents.
Every UTF-32 encoded character consists of four Byte and per document four different Byte Orders are possible.
The following four documents demonstrate these different Byte Orders.

1. Byte Order 1234

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=UTF-32

  1: FF FE 00 00<?xml version="1.0" encoding="UTF-32"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>UTF-32 (4321)</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

2. Byte Order 4321

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=UTF-32

  1: 00 00 FF FE<?xml version="1.0" encoding="UTF-32"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>UTF-32 (4321)</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

3. Byte Order 2143

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=UTF-32

  1: FE FF 00 00<?xml version="1.0" encoding="UTF-32"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>UTF-32 (3412)</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

4. Byte Order 3412
From now on, some example documents with conflicts between possible charset sources were shown.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: FF FE<?xml version="1.0" encoding="UTF-8"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>BOM != XML</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

BOM charset encoding (UTF-16 in this case) is different to XML-Declaration charset encoding (UTF-8).
BOM charset encoding has to be used.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: EF BB BF<?xml version="1.0" encoding="UTF-16"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>BOM != XML-AUTO</title>
  8:   </head>
  9:   <body />
 10: </html>

This document is UTF-16 encoded, but Byte Order Mark (BOM) specifies UTF-8.
Some Error messages should be reported, because the document has to be validated with UTF-8 encoding.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml

  1: <?xml version="1.0" encoding="UTF-8"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>XML != XML-Auto</title>
  8:   </head>
  9:   <body />
 10: </html>

The charset encoding statement of the XML-Declaration specifies an UTF-8 encoded document. In fact it is UTF-16 encoded and so some Error messages should be reported.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=UTF-8

  1: <?xml version="1.0" encoding="ISO-8859-1"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>HTTP != XML</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

This document is UTF-8 encoded. HTTP-Header charset encoding statement specifies the right encoding. The XML-Declaration charset encoding statement defines ISO-8859-1. HTTP-Header charset encoding UTF-8 has to be used, but a notice according tis conflict has to be reported.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=ISO-8859-1

  1: <?xml version="1.0"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>HTTP != AUTO</title>
  8:   </head>
  9:   <body>äöüÄÖÜß</body>
 10: </html>

HTTP-Header charset encoding (ISO-8859-1) is different to the real encoding (UTF-16).
Because ISO-8859-1 (from HTTP-Header) has to be used, some Error messages have to be reported.

  Validome W3C-Validator WDG-Validator Total-Validator Site Valet-Validator

HTTP-Header:
Content-Type: application/xhtml+xml; charset=ISO-8859-1

  1: EF BB BF<?xml version="1.0"?>
  2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  3:      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4:  
  5: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  6:   <head>
  7:     <title>HTTP != BOM</title>
  8:   </head>
  9:   <body />
 10: </html>

Byte Order Mark (BOM) defines the UTF-8 charset and the HTTP-Header charset encoding statement defines ISO-8859-1.
BOM is right, but HTTP-Header charset has to be used. So some Error messages should be reported.

Top   v2.6.9 - 24.4.2008 © validome.org - all rights reserved Datenschutzerklärung

Valid XHTML 1.0