Detect if any values in NVARCHAR columns are actually unicode

Suppose one of your columns does not contain any unicode data. To verify that you would need to read the column value for every row. Unless you have an index on the column, with a rowstore table you will need to read every data page from the table. With that in mind I think it makes a lot of sense to combine all of the column checks into a single query against the table. That way you won't be reading the table's data many times and you don't have to code a cursor or some other kind of loop.

To check a single column believe that you can just do this:

SELECT COLUMN_1
FROM [P].[Q]
WHERE CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80));

A cast from NVARCHAR to VARCHAR should give you the same result except if there are unicode characters. Unicode characters will be converted to ?. So the above code should handle NULL cases correctly. You have 24 columns to check, so you check each column in a single query by using scalar aggregates. One implementation is below:

SELECT 
  MAX(CASE WHEN CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80)) THEN 1 ELSE 0 END) COLUMN_1_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_14 AS VARCHAR(30)) <> CAST(COLUMN_14 AS NVARCHAR(30)) THEN 1 ELSE 0 END) COLUMN_14_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_23 AS VARCHAR(8)) <> CAST(COLUMN_23 AS NVARCHAR(8)) THEN 1 ELSE 0 END) COLUMN_23_RESULT
FROM [P].[Q];

For each column you will get a result of 1 if any of its values contain unicode. A result of 0 means that all data can be safely converted.

I strongly recommend making a copy of the table with the new column definitions and copying your data there. You'll be doing expensive conversions if you do it in place so making a copy might not be all that much slower. Having a copy means that you can easily validate that all of the data is still there (one way is to use the EXCEPT keyword) and you can undo the operation very easily.

Also, be aware that you might not have any unicode data currently it's possible that a future ETL could load unicode into a previously clean column. If there is not a check for this in your ETL process you should consider adding that before doing this conversion.

Before doing anything, please consider the questions posed by @RDFozz in a comment on the question, namely:

Are there any other sources besides [Q].[G] populating this table?

If the response is anything outside of "I am 100% certain that this is the only source of data for this destination table", then don't make any changes, regardless of whether or not the data currently in the table can be converted without data loss.
Are there any plans / discussions related to adding additional sources to populate this data in the near future?

And I would add a related question: Has there been any discussion around supporting multiple languages in the current source table (i.e. [Q].[G]) by converting it to NVARCHAR?

You will need to ask around to get a sense of these possibilities. I assume you haven't currently been told anything that would point in this direction else you wouldn't be asking this question, but if these questions have been assumed to be "no", then they need to be asked, and asked of a wide-enough audience to get the most accurate / complete answer.

The main issue here is not so much having Unicode code points that can't convert (ever), but more so having code points that won't all fit onto a single code page. That is the nice thing about Unicode: it can hold characters from ALL code pages. If you convert from NVARCHAR – where you don't need to worry about code pages – to VARCHAR, then you will need to make sure that the Collation of the destination column is using the same code page as the source column. This assumes having either one source, or multiple sources using the same code page (not necessarily the same Collation, though). But if there are multiple sources with multiple code pages, then you can potentially run into the following problem:

DECLARE @Reporting TABLE
(
  ID INT IDENTITY(1, 1) PRIMARY KEY,
  SourceSlovak VARCHAR(50) COLLATE Slovak_CI_AS,
  SourceHebrew VARCHAR(50) COLLATE Hebrew_CI_AS,
  Destination NVARCHAR(50) COLLATE Latin1_General_CI_AS,
  DestinationS VARCHAR(50) COLLATE Slovak_CI_AS,
  DestinationH VARCHAR(50) COLLATE Hebrew_CI_AS
);

INSERT INTO @Reporting ([SourceSlovak]) VALUES (0xDE20FA);
INSERT INTO @Reporting ([SourceHebrew]) VALUES (0xE820FA);

UPDATE @Reporting
SET    [Destination] = [SourceSlovak]
WHERE  [SourceSlovak] IS NOT NULL;

UPDATE @Reporting
SET    [Destination] = [SourceHebrew]
WHERE  [SourceHebrew] IS NOT NULL;

SELECT * FROM @Reporting;

UPDATE @Reporting
SET    [DestinationS] = [Destination],
       [DestinationH] = [Destination]

SELECT * FROM @Reporting;

Returns (2nd result set):

ID    SourceSlovak    SourceHebrew    Destination    DestinationS    DestinationH
1     Ţ ú             NULL            Ţ ú            Ţ ú             ? ?
2     NULL            ט ת             ? ?            ט ת             ט ת

As you can see, all of those characters can convert to VARCHAR, just not in the same VARCHAR column.

Use the following query to determine what the code page is for each column of your source table:

SELECT OBJECT_NAME(sc.[object_id]) AS [TableName],
       COLLATIONPROPERTY(sc.[collation_name], 'CodePage') AS [CodePage],
       sc.*
FROM   sys.columns sc
WHERE  OBJECT_NAME(sc.[object_id]) = N'source_table_name';

THAT BEING SAID....

You mentioned being on SQL Server 2008 R2, BUT, you didn't say what Edition. IF you happen to be on Enterprise Edition, then forget about all this conversion stuff (since you are likely doing it just to save space), and enable Data Compression:

Unicode Compression Implementation

If using Standard Edition (and it now seems that you are ) then there is another looooong-shot possibility: upgrade to SQL Server 2016 since SP1 includes the ability for all Editions to use Data Compression (remember, I did say "long-shot" ).

Of course, now that it has just been clarified that there is only one source for the data, then you don't have anything to worry about since the source couldn't contain any Unicode-only characters, or characters outside of its specific code page. In which case, the only thing you should need to be mindful of is using the same Collation as the source column, or at least one that is using the same Code Page. Meaning, if the source column is using SQL_Latin1_General_CP1_CI_AS, then you could use Latin1_General_100_CI_AS at the destination.

Once you know what Collation to use, you can either:

ALTER TABLE ... ALTER COLUMN ... to be VARCHAR (be sure to specify the current NULL / NOT NULL setting), which requires a bit of time and a lot of transaction log space for 87 million rows, OR
Create new "ColumnName_tmp" columns for each one and slowly populate via UPDATE doing TOP (1000) ... WHERE new_column IS NULL. Once all rows are populated (and validated that they all copied over correctly! you might need a trigger to handle UPDATEs, if there are any), in an explicit transaction, use sp_rename to swap the column names of the "current" columns to be "_Old" and then the new "_tmp" columns to simply remove the "_tmp" from the names. Then call sp_reconfigure on the table to invalidate any cached plans referencing the table, and if there are any Views referencing the table you will need to call sp_refreshview (or something like that). Once you have validated the app and ETL is working correctly with it, then you can drop the columns.

I have some experience with this from back when I had a real job. Since at the time I wanted to preserve the base data, and I also had to account for new data that could possibly have characters that would get lost in the shuffle, I went with a non-persisted computed column.

Here's a quick example using a copy of the Super User database from the SO data dump.

We can see right off the bat that there are DisplayNames with Unicode characters:

Nuts

So let's add a computed column to figure out how many! The DisplayName column is NVARCHAR(40).

USE SUPERUSER

ALTER TABLE dbo.Users
ADD DisplayNameStandard AS CONVERT(VARCHAR(40), DisplayName)

SELECT COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.DisplayName <> u.DisplayNameStandard

The count returns ~3000 rows

Nuts

The execution plan is a bit of a drag, though. The query finishes fast, but this data set isn't terribly large.

Nuts

Since computed columns don't need to be persisted to add an index, we can do one of these:

CREATE UNIQUE NONCLUSTERED INDEX ix_helper
ON dbo.Users(DisplayName, DisplayNameStandard, Id)

Which gives us a slightly tidier plan:

Nuts

I understand if this isn't the answer, since it involves architectural changes, but considering the size of the data, you're probably looking at adding indexes to cope with queries that self join the table anyway.

Hope this helps!

Detect if any values in NVARCHAR columns are actually unicode

Tags:

Sql Server

Unicode

Datatypes

Related

Recent Posts