How to create Unicode parameter and variable names

Well, identifiers are always Unicode / NVARCHAR, so technically you can't create anything that doesn't have a Unicode name 🙃.

The problem you are having here is due entirely to the classification of the character(s) being used. The rules for regular (i.e. non-delimited) identifiers are:

First letter must be:
- A letter as defined by the Unicode Standard 3.2.
- underscore (_), at sign (@), or number sign (#)
Subsequent letters can be:
- Letters as defined in the Unicode Standard 3.2.
- Decimal numbers from either Basic Latin or other national scripts.
- underscore (_), at sign (@), number sign (#), or dollar sign ($)
Embedded spaces or special characters are not allowed.
Supplementary characters are not allowed.

I bolded the only rules that matter in this context. The reason that the "First letter" rules are not relevant here is that the first letter in all local variables and parameters is always the "at sign" @.

And to be clear: what is considered a "letter" and what is considered a "decimal digit" is based upon the properties that each character is assigned in the Unicode Character Database. Unicode assigns many properties to each character, such as: is_uppercase, is_lowercase, is_digit, is_decimal, is_combining, etc, etc. This is not a matter of what we mortals would consider letters or decimal digits, but which characters have been assigned these properties. These properties are often used in Regular Expressions to match on "punctuation", etc. For example, \p{Lu} matches any upper-case letter (across all languages / scripts), and \p{IsDingbats} matches any "Dingbats" character.

So, in your attempt to do:

DECLARE @¯\_(ツ)_/¯ INT;

only the _ (underscore or "low line") and ツ (Katakana Letter Tu U+30C4) characters fit into those rules. Now, all of the characters in ¯\_(ツ)_/¯ are fine for delimited identifiers, but unfortunately it seems that variable / parameter names and GOTO labels cannot be delimited (although cursor names can be).

So, for variable / parameter names, since they cannot be delimited, you are stuck with using only characters that qualify as being either "letters" or "decimal digits" as of Unicode 3.2 (well, according to the documentation; I need to test if classifications have been updated for newer versions of Unicode since classifications are handled differently than sort weights).

HOWEVER #1, things are not as straight-forward as they should be. I have now been able to complete my research and have found that the stated definition is not entirely correct. The precise (and verifiable) definition of which characters are valid for regular identifiers is:

First character:
- Can be anything classified in Unicode 3.2 as "ID_Start" (which does include "Letters" but also "letterlike numeric characters")
- Can be _ (low line / underscore) or ＿ (fullwidth low line)
- Can be @, but only for variables / parameters
- Can be #, but if schema-bound object, then only for Tables and Stored Procedures (in which case they indicate that the object is temporary)
Subsequent characters:
- Can be anything classified in Unicode 3.2 as "ID_Continue" (which includes "decimal" numbers, but also "spacing and nonspacing combining marks", and "connecting punctuation marks")
- Can be @, #, or $
- Can be any of the 26 characters classified in Unicode 3.2 as format control characters

(fun fact: the "ID" in "ID_Start" and "ID_Continue" stands for "Identifier". Imagine that ;-)

According to "Unicode Utilities: UnicodeSet":

Valid starting characters

[:Age=3.2:] & [:ID_Start=Yes:]

-- Test one "Letter" from each of 10+ languages, as of Unicode 3.2
DECLARE @ᔠᑥᑒᏯשፙᇏᆇᄳᄈლဪඤaൌgೋӁｳﺲﶨ   INT;
-- works


-- Test a Supplementary Character that is a "Letter" as of Unicode 3.2
DECLARE @ INT;-- Mathematical Script Capital W (U+1D4B2)
/*
Msg 102, Level 15, State 1, Line XXXXX
Incorrect syntax near '0xd835'.
*/

Valid continuation characters

[:Age=3.2:] & [:ID_Continue=Yes:]

-- Test various decimal numbers, but none are Supplementary Characters
DECLARE @६৮༦൯௫୫９ INT;
-- works (including some Hebrew and Arabic, which are right-to-left languages)


-- Test a Supplementary Character that is a "decimal" number as of Unicode 3.2
DECLARE @ INT; -- MATHEMATICAL DOUBLE-STRUCK DIGIT FOUR (U+1D7DC)
/*
Msg 102, Level 15, State 1, Line XXXXX
Incorrect syntax near '0xd835'.
*/
-- D835 is the first character in the surrogate pair D835 DFDC that makes up U+1D7DC

HOWEVER #2, not even searching the Unicode database can be that easy. Those two searches do produce a list of valid characters for those categorizations, and those characters are from Unicode 3.2, BUT the definitions of the various categorizations changes across versions of the Unicode Standard. Meaning, the definition of "ID_Start" in Unicode v 10.0 (what that search is using today, 2018-03-26) is not what it was in Unicode v 3.2. So, the online search cannot provide an exact list. But you can grab the Unicode 3.2 data files and grab the list of "ID_Start" and "ID_Continue" characters from there to compare to what SQL Server actually uses. And I have done this and confirmed an exact match to the rules I stated above in "HOWEVER #1".

The following two blog posts detail the steps taken to find the exact list of characters, including links to the import scripts:

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers, Part 1
The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers, Part 2

Finally, for anyone that just wants to see the list and is not concerned with what it took to discover and verify it, you can find that here:

Completely Complete List of Valid T-SQL Identifier Characters
(please give the page a moment to load; it's 3.5 MB and almost 47k lines)

Regarding "valid" ASCII characters, such as / and -, not working: the issue has nothing to do with whether or not the characters are also defined in the ASCII character set. In order to be valid, the character has to have either the ID_Start or ID_Continue property, or be one of the few custom characters noted separately. There are quite a few "valid" ASCII characters (62 of the 128 total — mostly punctuation and control characters) that are not valid in "Regular" Identifiers.

Regarding Supplementary Characters: while they certainly can be used in delimited identifiers (and the documentation does not appear to be stating otherwise), if it is true that they cannot be used in regular identifiers, that is most likely due to them not being fully supported in built-in functions prior to Supplementary Character-Aware Collations were introduced in SQL Server 2012 (they are treated as two individual "unknown" characters), nor could they even be differentiated from each other in non-binary Collations prior to the 100-level Collations (introduced in SQL Server 2008).

Regarding ASCII: 8-bit encodings are not being used here since all identifiers are Unicode / NVARCHAR / UTF-16 LE. The statement SELECT ASCII('ツ'); returns a value of 63 which is a "?" (try: SELECT CHAR(63); ) since that character, even if prefixed with an upper-case "N", is certainly not in Code Page 1252. However, that character is in the Korean Code Page and it produces the correct result, even without the "N" prefix, in a Database with a Korean default Collation:

SELECT UNICODE('ツ'); -- 12484

Regarding the first letter affecting the outcome: this is not possible since the first letter for local variables and parameters is always @. The first letter that we get to control for these names is actually the 2nd character of the name.

Regarding why local variable names, parameter names, and GOTO labels cannot be delimited: I suspect this is due to these items being a part of the language itself and not something that will find its way into a system table as data.

I don't think it's Unicode that's causing the problem; in the case of local variable or parameter names, it's that the character isn't a valid ASCII/Unicode 3.2 character (and there isn't any escaping sequence for variables/parameters like there are for other entity types).

This batch works fine, it uses a Unicode character that simply doesn't violate the rules for non-delimited identifiers:

CREATE OR ALTER PROCEDURE dbo.[]
  @ツ int
AS
  CREATE TABLE [#ツ] (ツ int);
  INSERT [#ツ](ツ) SELECT @ツ;
  SELECT ツ+1 FROM [#ツ];
GO
EXEC dbo.[] @ツ = 1;

As soon as you try to use a slash or a dash, both of which are valid ASCII characters, it bombs:

Msg 102, Level 15, State 1, Procedure Incorrect syntax near '-'.

The documentation does not address why these identifiers are subject to slightly different rules than all other identifiers, or why they can't be escaped like the others.

How to create Unicode parameter and variable names

Tags:

Sql Server

Unicode

Stored Procedures

T Sql

Parameter

Related

Recent Posts