How To Strip Hebrew Accent Marks

The trick here is to realize that these characters that you see in the question with the "accents" aren't really the characters (i.e. "These aren't the ~~droids~~characters you are looking for" ;-) ). The "accents" are various types of notations indicating things like:

vowels (lines and dots that are typically under the letters):

base letter "ה" = "h"; "הֶ" = "heh" and "הָ" = "hah"
pronunciation (dots that are usually inside or above letters):

"בּ" = "b" vs "ב" = "v", or "שׂ" = "s" vs "שׁ" = "sh"
punctuation
cantillation (how it should be sung)

The actual Hebrew letters are what is shown in the stripped down version (i.e. the end result of what is being requested here). What we are referring to here as "accents" are known as diacritical marks. The Wikipedia article on Hebrew diacritics has a lot of good information about these marks, including the following image and caption:

^{Gen. 1:9 And God said, "Let the waters be collected".
Letters in black, pointing in red, cantillation in blue}

Getting from those base characters to what the first line (with the vowels, etc) shows is a matter of adding one or more "accents". Unicode (UTF-16 in SQL Server, though default interpretation only handles the UCS-2 / Basic Multilingual Plane (BMP) code points) allows for some characters to overlay another non-overlay character when adjacent to them. These are known as Combining Characters.

Meaning:

SELECT DATALENGTH(N'מַ֖'); -- character taken from original given text

Returns:

not 2 as most people would expect from seeing a single, double-byte character. So maybe we try to find what character is there by doing:

SELECT UNICODE(N'מַ֖');

which returns:

Of course, the UNICODE and ASCII functions only return the INT value of the first character of whatever string they are given. But a value of 1502 only covers 2 bytes, which leaves 4 bytes unaccounted for. Looking at the binary/hex values of that same Hebrew "character":

SELECT NCHAR(1502), CONVERT(BINARY(2), UNICODE(N'מַ֖')), CONVERT(VARBINARY(10), N'מַ֖');

we get:

מ
0x05DE  0xDE05B7059605

Now, 0x05DE is the hex representation of 1502, and the 1502 is only the "מ". The next part can be separated into three 2-byte sets: DE05 B705 9605. Now, Unicode string values are stored in Little Endian, which means the byte-order is reversed. If we switch each of those three sets we get:

05DE (the base character) 05B7 0596 (the unaccounted for 4 bytes).

Ok. So what happens if we remove that base character?

SELECT REPLACE(N'מַ֖' COLLATE Hebrew_BIN2, NCHAR(1502) COLLATE Hebrew_BIN2, '');

That returns the two remaining characters (not easy to see here so I have made the following line a header in order to increase the font size; you can also run the above REPLACE to see them):

Removing the מ from the מַ֖ leaves two characters at the bottom: ַ֖

Hence, we need to strip out each individual code-point that is one of these "extra" combining characters (found at: http://unicode-table.com/en/search/?q=hebrew) and that will leave us with the base characters. We can do that via:

CREATE FUNCTION dbo.RemoveHebrewAccents (@txeTwerbeH NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN

  WITH base (dummy) AS
  (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
  ), nums AS
  (
    -- we will want to generate code points 1425 - 1479
    SELECT TOP (55) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [Num]
    FROM   base b1
    CROSS JOIN base b2
  )
  SELECT @txeTwerbeH = REPLACE(
                               @txeTwerbeH COLLATE Hebrew_BIN2,
                               NCHAR(1424 + nums.[Num]) COLLATE Hebrew_BIN2,
                               ''
                              )
  FROM   nums;

  RETURN @txeTwerbeH;
END;

And then we can test it with the original text as follows:

DECLARE @Hebrew NVARCHAR(200) = N'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ';

SELECT dbo.RemoveHebrewAccents(@Hebrew);

Returns:

בראשית ברא אלהים את השמים ואת הארץ

Additional notes:

Technically, there is a set of code-points between 64298 and 64334 that do have some vowels and pronunciation "accents" built into the character. If those need to be handled, that can be a second step in the function to do a simple replacement of those characters.
It seems that these accent, punctuation, etc code-points only match when using a binary collation. Even using Hebrew_100_CS_AS_KS_WS_SC did not match them. But the following did work: Hebrew_BIN, Hebrew_BIN2, Latin1_General_BIN, and Latin1_General_BIN2. In the function I ended up using Hebrew_BIN2. Please note that when using binary collations, unless you have a specific need to use the older _BIN collations, you should only be using the newer _BIN2 collations.
For anyone who is curious, the Hebrew sample text is actually Bereishis 1:1 (that is also the first word on the right-side as Hebrew is read right-to-left; in English it would be "Genesis 1:1" though that is not a direct translation of the word, just the name of the first book of the Torah / Bible; the direct translation is "in the beginning"):

In the beginning of God's creating the heavens and the Earth
2015-01-19: I found some great resources which explain both Combining Characters and the Hebrew character set:
- http://en.wikipedia.org/wiki/Combining_character (also linked towards the top of this answer)
- http://www.alanwood.net/unicode/combining_diacritical_marks.html
- http://www.alanwood.net/unicode/hebrew.html
- http://people.w3.org/rishida/scripts/pickers/hebrew/ (play with all of the characters in the Hebrew character set; very cool!)

How To Strip Hebrew Accent Marks

base letter "ה" = "h"; "הֶ" = "heh" and "הָ" = "hah"

"בּ" = "b" vs "ב" = "v", or "שׂ" = "s" vs "שׁ" = "sh"

Removing the מ from the מַ֖ leaves two characters at the bottom: ַ֖

בראשית ברא אלהים את השמים ואת הארץ

Tags:

Sql Server

Unicode

Encoding

Collation

Related

Recent Posts