Delphi XE - RawByteString vs AnsiString

RawByteString is an AnsiString with no code page set by default.

When you assign another string to this RawByteString variable, you'll copy the code page of the source string. And this will include a conversion. Sorry.

But there is one another use of RawByteString, which is to store plain byte content (e.g. a database BLOB field content, just like an array of byte)

To summarize:

  • RawByteString should be used as a "code page agnostic" parameter to a method or function;
  • RawByteString can be used as a variable type to store some BLOB data.

If you want to reduce conversion, and would rather use 8 bit char string in your application, you should better:

  • Do not use the generic AnsiString type, which will depend on the current system code page, and by which you'll loose data;
  • Rely on UTF-8 encoding, i.e. some 8 bit code page / charset which won't loose any data when converted from or to an UnicodeString;
  • Don't let the compiler show warnings about implicit conversions: all conversion should be made explicit;
  • Use your own dedicated set of functions to handle your UTF-8 content.

That exactly what we made for our framework. We wanted to use UTF-8 in its kernel because:

  • We rely on UTF-8 encoded JSON for data transmission;
  • Memory consumption will be smaller;
  • The used SQLite3 engine will store text as UTF-8 in its database file;
  • We wanted a way of handling Unicode text with no loose of data with all versions of Delphi (from Delphi 6 up to XE), and WideString was not an option because it's dead slow and you've got the same problem of implicit conversions.

But, in order to achieve best speed, we write some optimized functions to handle our custom string type:

  {{ RawUTF8 is an UTF-8 String stored in an AnsiString
    - use this type instead of System.UTF8String, which behavior changed
     between Delphi 2009 compiler and previous versions: our implementation
     is consistent and compatible with all versions of Delphi compiler
    - mimic Delphi 2009 UTF8String, without the charset conversion overhead
    - all conversion to/from AnsiString or RawUnicode must be explicit }
{$ifdef UNICODE} RawUTF8 = type AnsiString(CP_UTF8); // Codepage for an UTF8string
{$else}          RawUTF8 = type AnsiString; {$endif}

/// our fast RawUTF8 version of Trim(), for Unicode only compiler
// - this Trim() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Trim(const S: RawUTF8): RawUTF8;

/// our fast RawUTF8 version of Pos(), for Unicode only compiler
// - this Pos() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Pos(const substr, str: RawUTF8): Integer; overload; inline;

And we reserved the RawByteString type for handling BLOB data:

{$ifndef UNICODE}
  /// define RawByteString, as it does exist in Delphi 2009/2010/XE
  // - to be used for byte storage into an AnsiString
  // - use this type if you don't want the Delphi compiler not to do any
  // code page conversions when you assign a typed AnsiString to a RawByteString,
  // i.e. a RawUTF8 or a WinAnsiString
  RawByteString = AnsiString;
  /// pointer to a RawByteString
  PRawByteString = ^RawByteString;
{$endif}

/// create a File from a string content
// - uses RawByteString for byte storage, thatever the codepage is
function FileFromString(const Content: RawByteString; const FileName: TFileName;
  FlushOnDisk: boolean=false): boolean;

Source code is available in our repository. In this unit, UTF-8 related functions were deeply optimized, with both version in pascal and asm for better speed. We sometimes overloaded default functions (like Pos) to avoid conversion, or More information about how we handled text in the framework is available here.

Last word:

If you are sure that you will only have 7 bit content in your application (no accentuated characters), you may use the default AnsiString type in your program. But in this case, you should better add the AnsiStrings unit in your uses clause to have overloaded string functions which will avoid most unwanted conversion.


RawByteString is still an "AnsiString." It is best described as a "universal receiver" which means it will take on whatever the source-string's codepage is at the point of assignment without forcing a codepage conversion. RawByteString was intended to be used only as a function parameter so that you will, as you've discovered, not incur a conversion between AnsiStrings with differing code-page affinities when calling utility functions which take AnsiStrings.

However, in the case above, you're assigning what is essentially an AnsiString to a UnicodeString which will incur a conversion. It must do a conversion because the RawByteString has a payload of 8bit-based characters, whereas a string (UnicodeString) has a payload of 16bit-based characters.

Tags:

Unicode

Delphi