How do I find this character(by unicode search) in notepad++ ﻁ (\uFEC1 and only that character)

To search by Unicode codepoints using UTF-16 you'd use (\x{FEC1}), and it works whether the file is encoded with UTF-8 or UTF-16.

Bear in mind you wouldn't need to search by the UTF-8 code, because you can search by the UTF-16 code. But to address the part of your question that asks how do you search for that character by the UTF-8 code...

You can't. Well, you sort of can, but it's a hideous hack and you really shouldn't.

The obvious thing to try would be to search for \xef\xbb\x81 in your UTF-8 encoded document, but that doesn't work. (Note there's no {} here: Notepad++ expects either \xNN for 2 hex digits, or \x{NNNN} for 4 hex digits). That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints. So you can search for the codepoint U+FEC1, but not for the UTF-8 bytes 0xEF 0xBB 0x81, because Notepad++ "hides" the encoding details from you. (Because in nearly every scenario, someone editing a text file will care far more about finding the actual character than about finding the UTF-8 bytes.)

There's another trick you might try, which is to take that UTF-8 encoded file and choose the Encoding → Encode in ANSI menu option, at which point ﻁﻁﻉﻁﻉﻁﻉ appears to become ï»ï»ï»‰ï»ï»‰ï»ï»‰. (I say "appears to become" rather than "becomes" because... well, read on.) This is because it has taken the UTF-8 text of your file, and reinterpreted it as "ANSI" (which is a terrible encoding name because it's completely wrong, and should really be called "Windows-1252", but that's a different question). (By the way, the reason that ﻁﻁﻉﻁﻉﻁﻉ looks backwards in my text than the way it does in your screenshot: that's because Notepad++ doesn't care that Arabic is written right-to-left, so it shows the characters left-to-right in the order they were pasted into the file. But your browser does care about presenting Arabic in proper right-to-left order, the first two letters of that string (ﻁﻁ) appear on the right-hand side of the string, not on the left-hand side as they seem to in Notepad++). Digressions aside, here's why this will be helpful. In the "ANSI" (really Windows-1252) encoding, each byte is a single character, and so now you're going to be able to search by individual bytes. Now, if you search for \xef\xbb\x81 (which doesn't need to be a regular expression, just an "Extended" search), it will find the characters. Sort of. It will look like it's highlighting the two characters ï», but it's really highlighting three characters: ï, », and an "invisible" 0x81 character that doesn't really exist. (Because there is no character at the 0x81 point in Windows-1252 encoding: see for yourself.) And now you see why I said "appears to become" -- because your UTF-8 encoded text has really become ï»_ï»_ï»‰ï»_ï»‰ï»_ï»‰, where _ represents an "invisible" character that doesn't officially exist in the Windows-1252 codepage. Anyway, now that you've found the sequence of three characters with the byte values 0xEF, 0xBB, and 0x81 in Windows-1252, and Notepad++ has highlighted them, you can choose the Encoding → Encode in UTF-8 menu option, and your text will convert itself back to UTF-8, while Notepad++ will keep the highlight in the same place -- and thus, you'll find that one ﻁ character has been highlighted.

So why do I say that you really shouldn't do this? Because the only reason that it works is that Notepad++ didn't do the right thing when you switched codepages. The right thing to do when you find a missing character is to complain, or insert a character like the Unicode replacement character � (or a simple ? if you're in a legacy codepage that doesn't have � in it), or do something so that the user will know they had an invalid character in their text. Errors should never be silently ignored, and having a 0x81 value in Windows-1252 text is an error. The only reason this trick works is because Notepad++ does the wrong thing with invalid characters (that is, it ignores them). So you really shouldn't rely on this trick: with any update to Notepad++, it could change its undocumented (and wrong) behavior, and start putting proper replacement characters in wrongly-encoded text, at which point this trick would fail. Stick to searching for real Unicode codepoints, and you'll be much better off.

By the way, the reason why your original attempt ([\uFEC1]) failed is because, according to Notepad++'s regular expression syntax, \u means "an uppercase letter". (Remember that in regular expressions, brackets represent "any of these characters"). The docs further say, "See note about lower case [sic] letters," and the note about lowercase letters says "this will fall back on "a word character" if the "Match case" search option is off." As it is in your screenshot. Therefore, the regex [\uFEC1] is searching for "any word character, or F, or E, or C, or 1" -- which matches every single character in your sample text.

Phew, that turned out to be a very long answer for what I said would be "very simple". I hope this helps you understand Unicode a bit better; if so, the hour I spent typing this up will have been worth it.

Take a look: Anyone know how to use Regex in notepad++ to find Arabic characters?

Because Notepad++'s implementation of Regular Expressions requires that you use the

\x{NNNN}

notation to match Unicode characters.

enter image description here

In your example,

\x{FEC1}

How do I find this character(by unicode search) in notepad++ ﻁ (\uFEC1 and only that character)

Tags:

Unicode

Character Encoding

Notepad++

Find And Replace

Related

Recent Posts