Error converting VARCHAR(MAX) to XML due to "UTF-8" attribute

Your XML stored in a varchar(max) column should look like this.

<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
  <TAB DIM="30" ID="ZC3D2_1" SIZE="5">
    <LIN NUM = "1">
      <FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
    </LIN>
  </TAB>
</PARAM>

The ó should be represented with a double byte value ó.

If you don't have a UTF-8 encoded string stored in your column, the right way to go about this is to remove the encoding from the XML before you convert the value to the XML datatype.


I think you have a deeper problem. UTF-8 allow for more characters than the regular non-Unicode collations in SQL server. So, to be safe you should either use SQL Server 2019 which has UTF-8 collations (and I sunderstand if that isn't doable/desirable for many reasons) os use (try) nvarchar instead of varchar.

If you are afraid of storage increase going from varchar to nvarchar, you can possibly use row compression. But that requires Enterprise Edition prior to SQL Server 2016.


What's happening here is:

  1. The XML type stores data internally as UTF-16 Little Endian (most of the time, at least). It doesn't matter what the source encoding is, the end-result will be UTF-16 LE (and no <xml> tag, hence no encoding="...").
  2. When converting a string to XML:
    1. It's the bytes of the string that get converted, not the characters (will explain the difference in a moment)
    2. NVARCHAR data is assumed to be UTF-16 LE. If there is an <xml> tag and it contains the encoding attribute, the only valid value is "UTF-16".
    3. VARCHAR data is assumed to be in the 8-bit code page associated with the collation of the data when there is no <xml> tag, or if one exists but there is no encoding attribute. Else the data will be interpreted as being encoded in the code page specified in the encoding attribute (even though it is encoded in the code page associated with the collation of the data).
  3. Your data is most likely encoded as Windows code page 1252 (this is determined by the collation of the column that the data resides in, not the collation of the instance or even the database, but since you mention that the instance is using Latin1_General_BIN, it's safe-enough to assume for the moment that the column is using the same collation).
  4. The code point for the ó character in code page Windows-1252 is: 0xF3.
  5. The <xml> tag, however, is declaring that the XML data is encoded as UTF-8.
  6. In UTF-8, 0xF3 must be followed by three bytes, each being between 0x80 and 0xBF, yet in your data it's followed by a p, which has a value of 0x70. Hence you get the "illegal xml character" error (because the encoding="UTF-8" tells the conversion function that the bytes are valid UTF-8 bytes; the conversion doesn't see the ó character).

Your options are:

  1. Ideally, the column would be converted to XML and the encoding attribute of the <xml> tag, or the entire <xml> tag itself, would be removed on the way in. AND, the XML datatype can save space if there are repeating element and/or attribute names as it creates a dictionary (lookup list) of names internally and records the structure using the ID values.

  2. Set the [data] column to use a UTF-8 collation (new in SQL Server 2019, so not an option for you)

  3. Set the [data] column to be NVARCHAR, and remove the encoding attribute of the <xml> tag, or the entire <xml> tag.

  4. Convert the incoming string into UTF-8 bytes. So the ó character is two bytes in UTF-8: 0xC3B3, which appear as ó in Windows-1252.

    DECLARE @Good VARCHAR(MAX) = '<?xml version="1.0" encoding="UTF-8"?><a>hell'
            + CONVERT(VARCHAR(MAX), 0xC3B3)
            + '</a>';
    SELECT @Good, CONVERT(XML, @Good)
    -- <?xml version="1.0" encoding="UTF-8"?><a>helló</a>
    --
    -- <a>helló</a>
    

NOTES:

  • Simply removing the encoding attribute of the <xml> tag, or the entire <xml> tag, is not an option. Sure, it will work in this particular case, but it won't work in all cases due to the column being VARCHAR and UTF-8 collations not being available in SQL Server 2014. Hence, any Unicode characters not available in Windows code page 1252 will be converted to ? or ?? (depending on BMP character or Supplementary Character):
    DECLARE @Test VARCHAR(MAX) = '<test>ó - ☢ - </test>';
    SELECT @Test, CONVERT(XML, @Test);
    -- <test>ó - ? - ??</test>
    --
    -- <test>ó - ? - ??</test>
    
  • Do NOT simply change the collation of the column to a different locale / culture. While that might get rid of the error, it would only accomplish that by silently getting rid of the data that was causing the error. For example:
    DECLARE @Data NVARCHAR(MAX) = N'ó';
    SELECT CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) AS [Latin1_General],
        CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) COLLATE
                 Cyrillic_General_CI_AS AS [Cyrillic];
    /*
    Latin1_General    Cyrillic
    ó                 o
    */
    
    "Cyrillic" uses a different code page than "Latin1_General", and the ó character is not available on the Cyrillic code page. But, there is a "Best Fit" mapping which is why we end up with an o instead of a ?.
  • You, and anyone working on SQL Server 2008 or newer, really should be using the _100_ level collations. Additionally, anyone working on SQL Server 2012 or newer should ideally be using the _100_ level collation that ends with _SC (for Supplementary Characters). Finally, when needing a binary collation on SQL Server 2005 or newer, use one ending in _BIN2 (see my post here as to why).
  • This issue has nothing to do with whether the query is ad hoc or in a stored procedure (T-SQL or SQLCLR).