When `nvarchar/nchar` is going to be used with SQL Server 2019?

UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. That is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 bytes (see this comparison chart for more details). Therefore, ensuring coverage of the same set of characters in UTF-8 would require char(3).

For example:

DECLARE @T AS table 
(
    n integer PRIMARY KEY,
    UTF16 nchar(1) COLLATE Latin1_General_CI_AS,
    UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);

INSERT @T (n, UTF16, UTF8)
SELECT 911, NCHAR(911), NCHAR(911);

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table 
(
    n integer PRIMARY KEY,
    UTF16 nchar(1) COLLATE Latin1_General_CI_AS,
    UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);

INSERT @T (n, UTF16, UTF8)
SELECT 911, NCHAR(911), NCHAR(911);

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.


this can reduce the size of tables and indexes (emphasis added)

Reduction in size is only possible if most of the characters are essentially [space], 0 - 9, A - Z, a - z, and some basic punctuation. Outside of that specific set of characters (in practical usage terms, standard ASCII values 32 - 126), you will be at best equal in size to NVARCHAR / UTF-16, or in many cases larger.

I am planning to migrate the data as I believe reading less data will lead to better performance for the system at all.

Be careful. UTF-8 is not a magic "fix everything" switch. All other things being equal, yes, reading less does improve performance. But here "all other things" are not equal. Even when storing only standard ASCII characters (meaning: all characters are 1 byte, hence requiring half the space as compared to storing in NVARCHAR), there is a slight performance penalty for using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.

In my testing, even with only standard ASCII characters, storing the data as UTF-8 provided no savings of elapsed time, but was definitely worse for CPU time. And that was without Data Compression, so at least there was less disk space used. But, when using compression, the space required for UTF-8 was only 1% - 1.5% smaller. So effectively no space savings yet higher CPU time for UTF-8.

Things get more complicated when using NVARCHAR(MAX) since Unicode Compression does not work with that datatype, even if the value is small enough to be stored in row. But, if the data is small enough, it should still benefit from Row or Page Compression (in which case it actually becomes faster than UTF-8). However, off-row data cannot use any compression. Still, making the table a Clustered Columnstore Index does greatly reduce the size of NVARCHAR(MAX) (even if it is still slightly larger than UTF-8 when using Clustered Columnstore Index).

Can anyone point a scenario and reason, not to use the char data types with UTF encoding

Definitely. In fact, I don't really find a compelling reason to use it in most cases. The only scenario that truly benefits from UTF-8 is:

  1. Data is mostly standard ASCII (values 0 - 127)
  2. It needs to be Unicode because it might need to store a wider range of characters than are available on any single 8-bit Code Page (i.e. VARCHAR)
  3. Most of the data is stored off-row (so Page compression doesn't even work)
  4. You have enough data that you need / want to reduce the size for non-query-performance reasons (e.g. reduce backup size, reduce time required to backup / restore, etc)
  5. You cannot use Clustered Columnstore Index (perhaps the usage of the table makes performance worse in this case?)

My testing shows that in nearly all cases, NVARCHAR was faster, especially when there was more data. In fact, 21k rows with an average of 5k characters per row required 165 MB for UTF-8 and 236 MB for NVARCHAR uncompressed. And yet the NVARCHAR was 2x faster in elapsed time, and at least 2x faster (sometimes more) in CPU time. Still, it did take up 71 MB more on disk.

Outside of that, I still wouldn't recommend using UTF-8, at least as of CTP 2, due to a variety of bugs that I have found in this feature.

For a detailed analysis of this new feature, including an explanation of the differences between UTF-16 and UTF-8, and a listing of those bugs, please see my post:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?