nvarchar(max) converting to varchar and table optimization

_{{ This is a little lengthy, perhaps, but your actual problem(s) cannot be solved by looking at execution plans. There are two main issues, and both are architectural. }}

Distractions

Let's start with the items that are not your major problem areas. These are things that should be looked into as it definitely helps improve performance to use the datatypes that you need and not just a general, one-size-fits-most datatype. There is a very good reason why the different datatypes exist, and if storing 100 characters in NVARCHAR(MAX) had no negative impact on queries (or any other aspect of the system), then everything would be stored as NVARCHAR(MAX). However, cleaning up these areas won't lead to true scalamability.

to MAX, or not to MAX

I am working with a table that has all character types set to nvarchar some of them are nvarchar(max).

Ok. This is not necessarily a bad thing, though most often there is at least one field that is of a numeric type as an ID. However, there are certainly valid cases for the scenario described so far. And there is nothing inherently bad about MAX fields since they will store the data on the data page (i.e. in row) if the data can fit there. And in that situation it should perform as well as a non-MAX value of that same datatype. But yes, a bunch of MAX type fields is a sign of sloppiness in the data modeling and is far more likely to have most (or all) of that MAX data stored in separate data pages (i.e. off row) that need an extra lookup, hence less efficient.

VARCHAR vs NVARCHAR

We are converting all these to varchar...

Ok, but why exactly (yes, I know that info and comments that follow this statement add clarity, but I am going in order to preserve the conversational aspect for a reason). Each datatype has its place. VARCHAR is 1 byte per character and can represent 256 characters (most of the time) as defined on a single code page. While character values 0 - 127 are the same between code pages, character values between 128 and 255 can change:

;WITH chars ([SampleCharacters]) AS
(
  SELECT CHAR(42) + ' '   -- *
       + CHAR(65) + ' '   -- A
       + CHAR(126) + ' '  -- 
   -------------------------------
       + CHAR(128) + ' '  -- €
       + CHAR(149) + ' '  -- •
       + CHAR(165) + ' '  -- ¥, Y, ?
       + CHAR(183) + ' '  -- ·, ?
       + CHAR(229) + ' '  -- å, a, ?
)
SELECT chr.SampleCharacters COLLATE SQL_Latin1_General_CP1_CI_AS AS [SQL_Latin1_General_CP1_CI_AS],
       chr.SampleCharacters COLLATE SQL_Latin1_General_CP1255_CI_AS AS [SQL_Latin1_General_CP1255_CI_AS],
       chr.SampleCharacters COLLATE Thai_CI_AS_KS_WS AS [Thai_CI_AS_KS_WS],
       chr.SampleCharacters COLLATE Yakut_100_CS_AS_KS AS [Yakut_100_CS_AS_KS],
       chr.SampleCharacters COLLATE Albanian_CS_AI AS [Albanian_CS_AI]
FROM   chars chr;

Please note that it is possible for VARCHAR data to take up 2 bytes per character and represent more than 256 characters. For more information on Double-Byte Characters Sets, please see the following answer: Storing Japanese characters in a table .

NVARCHAR is stored as UTF-16 (Little Endian) and is either 2 or 4 bytes per character, which can represent the full Unicode spectrum. So, if your data will need to ever store more characters than can be represented by a single code page, then switching to VARCHAR won't truly help you.

Prior to converting to VARCHAR, you need to make sure that you are not storing any Unicode characters. Try the following query to see if there are any rows that cannot be converted to VARCHAR without losing data:

SELECT tbl.PKfield, tbl.SubType
FROM   dbo.[Listings] tbl
WHERE  tbl.SubType <> CONVERT(NVARCHAR(MAX), CONVERT(VARCHAR(MAX), tbl.SubType))

To clarify how NVARCHAR works: the max length of an NVARCHAR field is the number of 2-byte characters. Hence, NVARCHAR(50), will allow for a maximum of 100 bytes. How many characters will fit into that 100 bytes depends on how many 4 byte characters there are: none will allow you to fit in all 50 characters, all characters being 4-bytes will only fit 25 characters, and many combinations between.

Another thing to consider regarding space taken up by VARCHAR vs NVARCHAR: starting in SQL Server 2008 (Enterprise and Developer editions only!) you can enable Row or Page compression on Tables, Indexes, and Indexed Views. For situations where much of the data within an NVARCHAR field can actually fit within VARCHAR without any data loss, compression will allow for characters that do fit into VARCHAR to be stored as 1 byte. And only characters that require either 2 or 4 bytes will take up that space. This should remove one of the larger reasons that people often choose to stick with VARCHAR. For more info on Compression, please see the MSDN page for Creating Compressed Tables and Indexes. Please note that data in MAX datatypes that is being stored off-row is not compressible.

Real Areas of Concern

The following areas should be addressed if you want this table to be truly scalable.

Problemo Numero Uno

...and specifying a character width based upon the actual usage in production. The production data uses a range of 2 characters up to 900 characters of actual used width for any given column. We are going to add padding of 10% when applicable.

Uh, what? Have you added all of those values up? Given how many MAX fields you have, it is possible that 1 or more of those fields has 900 characters, and even though that should equate to 1800 bytes, the value stored on the main data page is just 24 bytes (not always 24 as the size varies related to several factors). And that could be why there are so many MAX fields: they couldn't fit in another NVARCHAR(100) (taking up to 200 bytes), but they did have room for 24 bytes.

If the goal is to improve performance, then converting the full strings to codes is, on some levels, a step in the right direction. You are drastically reducing the size of each row which is more efficient for the buffer pool and disk I/O. And shorter strings take less time to compare. That's good, but not awesome.

If the goal is to dramastically improve performance, then converting to codes is the wrong step in the right direction. It still relies upon string-based scans (with 30 indexes and 140 columns, there should be a lot of scans, unless most of the fields are not used for filtering), and I assume that those will be case-insensitives scans at that, which are less efficient than if they were case-sensitive or binary (i.e. using a case-sensitive, or binary, collation).

Additionally, converting to string-based codes ultimately misses the point of how to properly optimize a transactional system. Are these codes going to be entered in on a search form? Having people use 'S' for [SubType] is far less meaningful than searching on 'Single Family'.

There is a way to retain your full descriptive text while both reducing the space used and greatly speeding up queries: create a lookup table. You should have a table named [SubType] that stores each of the descriptive terms distinctly and has a [SubTypeID] for each one. If the data is part of the system (i.e. an enum), then the [SubTypeID] field should not be an IDENTITY field as the data should get populated via a release script. If the values are entered by end users, then the [SubTypeID] field should be an IDENTITY. In both situations:

[SubTypeID] is the Primary Key.
Most likely use INT for [SubTypeID].
If the data is internal / system data, and you know that the max number of distinct values will always be under 40k, then you might could get away with SMALLINT. If you start numbering at 1 (either manually or via IDENTITY seed), then you get a max of 32,768. But, if you start at the lowest value, -32,768, then you get the full 65,535 values to use.
If you are using Enterprise Edition, then enable Row or Page Compression
Descriptive text field can be called either [SubType] (same as the table name), or maybe [SubTypeDescription]
UNIQUE INDEX on [SubTypeDescription]. Keep in mind that indexes have a max size of 900 bytes. If the max length of this data in Production is 900 characters, and if you do need NVARCHAR, then this might work with compression enabled, OR use VARCHAR only if you definitely do NOT need to store Unicode characters, ELSE enforce uniqueness via an AFTER INSERT, UPDATE trigger.
[Listings] table has [SubTypeID] field.
[SubTypeID] field in [Listings] table is Foreign Key, referencing [SubType].[SubTypeID].
Queries can now JOIN the [SubType] and [Listings] tables and search on the full text of [SubTypeDescription] (case-insensitive, even, same as current functionality), while using that ID to perform a very efficient search on the indexed FK field in [Listings].

This approach can (and should) be applied to other fields in this table (and other tables) that behave similarly.

Problemo Numero Dos

A large overhaul of this table that consists literally 140 (nvarchar) columns, 11 being MAX. I am dropping 30 indexes and recreating them afterwards.

If this is a transactional system and not a data warehouse, then I would say that (again, generally), 140 columns is too much to handle efficiently. I highly doubt that all 140 fields are used at the same time and/or have the same use cases. The fact that 11 are MAX is irrelevant if they need to contain more than 4000 characters. BUT, having 30 indexes on a transactional table is again a bit unwieldy (as you are clearly seeing).

Is there a technical reason why the table needs to have all 140 fields? Can those fields be split into a few smaller groups? Consider the following:

Find the "core" (most important / frequently used) fields and put them into the "main" table, named [Listing] (I prefer to keep with singular words so that the ID field can easily be just TableName + "ID").
[Listing] table has this PK: [ListingID] INT IDENTITY(1, 1) NOT NULL CONSTRAINT [PK_Listing] PRIMARY KEY
"secondary" tables are named as [Listing{GroupName}] (e.g. [ListingPropertyAttribute] -- "attributes" as in: NumberOfBedrooms, NumberOfBathrooms, etc).
[ListingPropertyAttribute] table has this PK: [ListingID] INT NOT NULL CONSTRAINT [PK_ListingPropertyAttribute] PRIMARY KEY, CONSTRAINT [FK_ListingPropertyAttribute_Listing] FOREIGN KEY REFERENCES([Listing].[ListingID])
- notice no IDENTITY here
- notice PK name is the same between "core" and "secondary" tables
- notice PK and FK to "core" table is the same field
"core" [Listing] table gets both [CreatedDate] and [LastModifiedDate] fields
"secondary" tables only get [LastModifiedDate] field. The assumption is that all secondary tables get their rows populated at the same time as the "core" table (i.e. all rows should always be represented across all "secondary" tables). Hence the [CreatedDate] value in the "core" [Listing] table would always be the same across all "secondary" tables, on a per-row basis, so no need to duplicate it across the "secondary" tables. But they can each be updated at different times.

This structure does increase the number of JOINs that many queries will need, though one or more Views can be created to encapsulate the more frequently used JOINs, for coding convenience. But on the plus side:

when it comes to DML statements, there should be much less contention since the "core" table should get most of the updates.
most updates will take less time since they are modifying smaller rows.
index maintenance on each of the new tables (both "core" and "secondary" tables) should be faster, at least on a per-table basis.

Recap

The current model is designed to be inefficient, and it seems to be fulfilling that design goal (i.e. it be slow). If you want the system to be fast, then the data model needs to be designed to be efficient, not merely less-inefficient.

In what situations is varchar(max) preferred

Commenters have addressed this point in detail already. I'd say that VARCHAR(MAX) generally makes sense if you are 100% sure that the column will never need non-ASCII characters and the column's maximum length is either unknown or greater than 8,000 characters. You can read https://stackoverflow.com/questions/7141402/why-not-use-varcharmax for a similar question.

We are having update procedures timing out

Based on the execution plan, I think that a major factor impacting the performance of your update could be that you have a full-text index on the column being updated and you are using CHANGE_TRACKING = AUTO for that full-text index.

The script at the bottom of this answer shows a simple update statement on a modest number of rows where the performance goes from 19ms to over 500ms just by adding such a full-text index.

Depending on the business needs for your full-text search, you might be able to create the full-text index with CHANGE_TRACKING = OFF (which will incur none of this overhead) and periodically run ALTER FULLTEXT INDEX...START FULL POPULATION or START INCREMENTAL POPULATION in order to sync the table's data into the full-text index. See the BOL article here: https://msdn.microsoft.com/en-us/library/ms187317.aspx

-- Create test data on a new database
CREATE DATABASE TestFullTextUpdate
GO
USE TestFullTextUpdate
GO
CREATE TABLE dbo.fulltextUpdateTest (
    id INT NOT NULL IDENTITY(1,1)
        CONSTRAINT PK_fulltextUpdateTest PRIMARY KEY,
    object_counter_name NVARCHAR(512) NOT NULL
)
GO

--13660 row(s) affected
INSERT INTO dbo.fulltextUpdateTest (object_counter_name)
SELECT object_name + ': ' + counter_name
FROM sys.dm_os_performance_counters
CROSS JOIN (SELECT TOP 10 * FROM master..spt_values) x10
GO

-- ~19ms per update
DECLARE @startDate DATETIME2 = GETDATE()
SET NOCOUNT ON
UPDATE dbo.fulltextUpdateTest SET object_counter_name = object_counter_name
SET NOCOUNT OFF
DECLARE @endDate DATETIME2 = GETDATE()
PRINT(DATEDIFF(ms, @startDate, @endDate))
GO 10

-- Add a fulltext index with the default change-tracking behavior
CREATE FULLTEXT CATALOG DefaultFulltextCatalog AS DEFAULT AUTHORIZATION dbo
GO
CREATE FULLTEXT INDEX ON dbo.fulltextUpdateTest (object_counter_name)
KEY INDEX PK_fulltextUpdateTest
WITH CHANGE_TRACKING = AUTO
GO

-- ~522ms per update
-- Execution plan, like the plan in your question, shows that we must
-- touch the fulltext_index_docidstatus for each row that is updated
DECLARE @startDate DATETIME2 = GETDATE()
SET NOCOUNT ON
UPDATE dbo.fulltextUpdateTest SET object_counter_name = object_counter_name
SET NOCOUNT OFF
DECLARE @endDate DATETIME2 = GETDATE()
PRINT(DATEDIFF(ms, @startDate, @endDate))
GO 10

-- Cleanup
USE master
GO
DROP DATABASE TestFullTextUpdate
GO

nvarchar(max) converting to varchar and table optimization

Distractions

to MAX, or not to MAX

VARCHAR vs NVARCHAR

Real Areas of Concern

Problemo Numero Uno

Problemo Numero Dos

Recap

Tags:

Sql Server

Optimization

Database Design

Sql Server 2008

Related

Recent Posts