Detecting 'text' file type (ANSI vs UTF-8)

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.

  1. If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
  2. Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
  3. Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)

See also the method used by Notepad.


If we summerize, then:

  • Best solution for basic usage is to use outdated ( if we use IsTextUnicode(); );
  • Best solution for advanced usage is to use function above, then check BOM ( ~ 1KB ), then check Locale info under particual OS and only then get about 98% accuracy?

OTHER INFO PEOPLE MAY FOUND INTERESTING:

https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

function FileMayBeUTF8(FileName: WideString): Boolean;
var
 Stream: TMemoryStream;
 BytesRead: integer;
 ArrayBuff: array[0..127] of byte;
 PreviousByte: byte;
 i: integer;
 YesSequences, NoSequences: integer;

begin
   if not WideFileExists(FileName) then
     Exit;
   YesSequences := 0;
   NoSequences := 0;
   Stream := TMemoryStream.Create;
   try
     Stream.LoadFromFile(FileName);
     repeat

     {read from the TMemoryStream}

       BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
           {Do the work on the bytes in the buffer}
       if BytesRead > 1 then
         begin
           for i := 1 to BytesRead-1 do
             begin
               PreviousByte := ArrayBuff[i-1];
               if ((ArrayBuff[i] and $c0) = $80) then
                 begin
                   if ((PreviousByte and $c0) = $c0) then
                     begin
                       inc(YesSequences)
                     end
                   else
                     begin
                       if ((PreviousByte and $80) = $0) then
                         inc(NoSequences);
                     end;
                 end;
             end;
         end;
     until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
     Result := (YesSequences >= NoSequences);

   finally
     Stream.Free;
   end;
end;

Now testing this function...

In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...

Remarks:

  • WideFileExists() function is taken from TntClasses.pas ( Koders.net source ).

If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;

Otherwise, it is much more difficult.