Ruby: Limiting a UTF-8 string by byte-length

I think I found something that works.

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

Usage:

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

Update...

Decomposed behavior without active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

Decomposed behavior with active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

For Rails >= 3.0 you have ActiveSupport::Multibyte::Chars limit method.

From API docs:

- (Object) limit(limit) 

Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.

Example:

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

Rails 6 will provide a String#truncate_bytes that behaves like truncate, but takes a byte count instead of a character count. And, of course, it returns a valid string (it does not cut blindly in the middle of a multibyte char).

Taken from the doc:

>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".size
=> 20
>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".bytesize
=> 80
>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".truncate_bytes(20)
=> "🔪🔪🔪🔪…"

bytesize will give you the length of the string in bytes while (as long as the string's encoding is set properly) operations such as slice won't mangle the string.

A simple process would be to just iterate through the string

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

If you were being crafty you'd copy the first 63 characters directly since any unicode character is at most 4 bytes in utf-8.

Note that this is still not perfect. For example, imagine that the last 4 bytes of your string are the characters 'e' and combining acute accent. Slicing the last 2 bytes produces a string that is still utf8 but in terms of what the user sees would change the output from 'é' to 'e', which could change the meaning of the text. This is probably not a huge deal when you're just naming RabbitMQ queues but could be important in other circumstances. For example, in French a newsletter headline reading 'Un policier tué' means 'A policeman was killed' whereas 'Un policier tue' means 'A policeman kills'.