Java - removing strange characters from a String

Justin Thomas's was close, but this is probably closer to what you're looking for:

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", ""); 

The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."


To delete non-Latin symbols from the string I use the following code:

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"


You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted(), you have to define what you want.

If you want to remove control characters you can do

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).

EDIT If you want to know the unicode of any 16-bit character you can do

int num = string.charAt(n);
System.out.println(num);

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

Tags:

Java

String