Are duplicated NVARCHAR values stored as copies in SQL Server?

Under "normal" conditions, no, data in VARCHAR and NVARCHAR columns is not de-duped (although duplicate attribute and/or element names in a single XML value are reduced to a unique instance).

Using one of the Data Compression options is probably your best bet. Here are some things to consider:

  1. Unicode Compression (part of Row Compression) only works on NVARCHAR(1 - 4000), not NVARCHAR(MAX) (please vote for / support: Unicode compression NVARCHAR(MAX)).
  2. Page Compression can work with NVARCHAR(MAX), but only for in-row data. Off-row data (LOB pages) is not compressed.

Since the data won't really be changing, you should look into the Columnstore Index options (also available in Azure SQL Database):

  • Columnstore indexes: Overview
  • Columnstore indexes - Design guidance

Columnstore compression should be better than Page compression.

 

Also, you should probably avoid using the SPARSE option, due to:

  1. The SPARSE option offers no benefit compared to Data Compression.
  2. It is mainly intended for columns sets / wide tables (i.e. up to 30,000 columns).
  3. It mostly helps with fixed-length datatypes (e.g. INT, DATETIME). So, prior to Data Compression being available, it was useful for CHAR and NCHAR, but not for VARCHAR or NVARCHAR as they don't take up space when NULL.
  4. It only benefits columns set to NULL.
  5. It slightly hurts non-NULL values by adding 2 bytes to each one.
  6. You should probably have NULL for 50% - 60% of the rows in order to get enough savings for it to be worth using this option.

For more details on working with character data, please see the following post of mine:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide


Yes duplicated data is stored as copies in SQL Server

To change this behavior, you would need to implement PAGE COMPRESSION feature - create or rebuild indexes with (data_compression = on) option

It is a great feature that helps to save space. Once you enable it, SQL Server is pointing to the same reference of the string behind the scenes

Beware that PAGE COMPRESSION is not available in every SQL Server Edition, and it can have some CPU overhead

So you might want to make a lookup table if your edition does not allow for page compression


Compression is fine and probably useful in your case, but you should normalize your tables anyway and try to achieve a structure where you're only storing one copy of the same address. This will lead to less redundancy, and lighten the primary table you're currently asking about.

Another two things to consider too are:

  1. NVARCHAR can use twice as much space as a VARCHAR of equally used length (Solomon Rutzky's answer links a good article regarding this), so it can be more data heavy. If you can use VARCHAR instead, you can probably drop the length down to something much more reasonable for an address field and save a lot more space. Here's an article comparing the two: SQL varchar data type deep dive

  2. You should also look into sparse columns which can save you a significant amount of space too when used correctly. Here's the Microsoft docs: Use Sparse Columns

    (Important to note that a minimum amount of NULL values need to be present for sparse columns to be worth looking into.)