In Part I, we saw how
Unicode support is a huge benefit for Delphi developers by enabling
communication with all characters set in the Unicode universe. We saw the basics
of the UnicodeString type and how it will be used in Delphi
In Part II, we’ll look at some of the new features of the Delphi Runtime
Library that support Unicode and general string handling.
TCharacter Class
The Tiburon RTL includes a new class called TCharacter, which is found in the Character unit. It is a sealed class
that consists entirely of static class functions. Developers should not create
instances of TCharacter, but
rather merely call its static class methods directly. Those class functions do a
number of things, including:
- Convert characters to upper or lower case
- Determine whether a given character is of a certain type, i.e. is the
character a letter, a number, a punctuation mark, etc.
TCharacter uses the standards set forth by the Unicode consortium.
Developers can use the TCharacter class to do many things previously done
with sets of chars. For instance, this code:
uses
Character;
begin
if
MyChar in
[‘a’...’z’, ‘A’...’Z’] then
begin
...
end
;
end
;
can be easily replaced with
uses
Character;
begin
if
TCharacter.IsLetter(MyChar) then
begin
...
end
;
end
;
The Character unit also
contains a number of standalone functions that wrap up the functionality of each
class function from TCharacter, so
if you prefer a simple function call, the above can be written as:
uses
Character;
begin
if
IsLetter(MyChar) then
begin
...
end
;
end
;
Thus the TCharacter class can
be used to do most any manipulation or checking of characters that you might
care to do.
In addition, TCharacter
contains class methods to determine if a given character is a high or low
surrogate of a surrogate pair.
TEncoding Class
The Tiburon RTL also includes a new class called TEncoding. Its purpose is to define a
specific type of character encoding so that you can tell the VCL what type of
encoding you want used in specific situations.
For instance, you may have a TStringList instance that contains text
that you want to write out to a file. Previously, you would have written:
begin
...
MyStringList.SaveToFile(‘SomeFilename.txt’);
...
end
;
and the file would have been written out using the default ANSI encoding.
That code will still work fine – it will write out the file using ANSI string
encoding as it always has, but now that Delphi supports Unicode string data,
developers may want to write out string data using a specific encoding. Thus,
SaveToFile (as well as LoadFromFile) now take an optional
second parameter that defines the encoding to be used:
begin
...
MyStringList.SaveToFile(‘SomeFilename.txt’, TEncoding.Unicode);
...
end
;
Execute the above code and the file will be written out as a Unicode (UTF-16)
encoded text file.
TEncoding will also convert a
given set of bytes from one encoding to another, retrieve information about the
bytes and/or characters in a given string or array of characters, convert any
string into an array of byte
(TBytes), and other functionality
that you may need with regard to the specific encoding of a given string or
array of chars.
The TEncoding class includes
the following class properties that give you singleton access to a TEncoding instance of the given
encoding:
class
property
ASCII: TEncoding read
GetASCII;
class
property
BigEndianUnicode: TEncoding read
GetBigEndianUnicode;
class
property
Default
: TEncoding read
GetDefault;
class
property
Unicode: TEncoding read
GetUnicode;
class
property
UTF7: TEncoding read
GetUTF7;
class
property
UTF8: TEncoding read
GetUTF8;
The Default property refers to
the ANSI active codepage. The Unicode property refers to UTF-16.
TEncoding also includes the
class
function
TEncoding.GetEncoding(CodePage: Integer): TEncoding;
that will return an instance of TEncoding that has the affinity for the
code page passed in the parameter.
In addition, it includes following function:
function
GetPreamble: TBytes;
which will return the correct BOM for the given encoding.
TEncoding is also interface
compatible with the .Net class called Encoding.
TStringBuilder
The RTL now includes a class called TStringBuilder. Its purpose is revealed
in its name – it is a class designed to “build up” strings. TStringBuilder contains any number of
overloaded functions for adding, replacing, and inserting content into a given
string. The string builder class makes it easy to create single strings out of a
variety of different data types. All of the Append, Insert, and Replace functions return an instance of
TStringBuilder, so they can easily
be chained together to create a single string.
For example, you might choose to use a TStringBuilder in place of a complicated
Format statement. For instance,
you might write the following code:
procedure
TForm86.Button2Click(Sender: TObject);
var
MyStringBuilder: TStringBuilder;
Price: double;
begin
MyStringBuilder := TStringBuilder.Create(''
);
try
Price := 1.49;
Label1.Caption := MyStringBuilder.Append('The apples are $'
).Append(Price).
?Append(' a pound.'
).ToString;
finally
MyStringBuilder.Free;
end
;
end
;
TStringBuilder is also
interface compatible with the .Net class called StringBuilder.
The RTL adds a number of routines that support the use of Unicode
strings.
StringElementSize
StringElementSize returns the
typical size for an element (code point) in a given string. Consider the
following code:
procedure
TForm88.Button3Click(Sender: TObject);
var
A: AnsiString;
U: UnicodeString;
begin
A := 'This is an AnsiString'
;
Memo1.Lines.Add('The ElementSize for an AnsiString is: '
+ IntToStr(StringElementSize(A)));
U := 'This is a UnicodeString'
;
Memo1.Lines.Add('The ElementSize for an UnicodeString is: '
+ IntToStr(StringElementSize(U)));
end
;
The result of the code above will be:
The ElementSize for
an AnsiString is
: 1
The ElementSize for
an UnicodeString is
: 2
StringCodePage
StringCodePage will return the
Word value that corresponds to the
codepage for a given string.
Consider the following code:
procedure
TForm88.Button2Click(Sender: TObject);
type
CyrillicString = type
AnsiString(1251);
var
A: AnsiString;
U: UnicodeString;
U8: UTF8String;
C: CyrillicString;
begin
A := 'This is an AnsiString'
;
Memo1.Lines.Add('AnsiString Codepage: '
+ IntToStr(StringCodePage(A)));
U := 'This is a UnicodeString'
;
Memo1.Lines.Add('UnicodeString Codepage: '
+ IntToStr(StringCodePage(U)));
U8 := 'This is a UTF8string'
;
Memo1.Lines.Add('UTF8string Codepage: '
+ IntToStr(StringCodePage(U8)));
C := 'This is a CyrillicString'
;
Memo1.Lines.Add('CyrillicString Codepage: '
+ IntToStr(StringCodePage(C)));
end
;
The above code will result in the
following output:
The Codepage for
an AnsiString is
: 1252
The Codepage for
an UnicodeString is
: 1200
The Codepage for
an UTF8string is
: 65001
The Codepage for
an CyrillicString is
: 1251
Other RTL Features for Unicode
There are a number of other routines for converting strings of one codepage
to another. Including:
UnicodeStringToUCS4String
UCS4StringToUnicodeString
UnicodeToUtf8
Utf8ToUnicode
In addition the RTL also declares a type called RawByteString which is a string type
with no encoding affiliated with it:
RawByteString = type
AnsiString($FFFF);
The purpose of the RawByteString type is to enable the
passing of string data of any code page without doing any codepage conversions.
This is most useful for routines that do not care about specific encoding, such
as byte-oriented string searches.Normally, this would mean that parameters of
routines that process strings without regard for the strings code page should be
of type RawByteString. Declaring
variables of type RawByteString
should rarely, if ever, be done as this can lead to undefined behavior and
potential data loss.
In general, string types are assignment compatible with each other.
For instance:
MyUnicodeString := MyAnsiString;
will perform as expected – it will take the contents of the AnsiString and place them into a UnicodeString. You should in general be
able to assign one string type to another, and the compiler will do the work
needed to make the conversions, if possible.
Some conversions, however, can result in data loss, and one must watch out
this when moving from one string type that includes Unicode data to another that
does not. For instance, you can assign UnicodeString to an AnsiString, but if the UnicodeString contains characters that
have no mapping in the active ANSI code page at runtime, those characters will
be lost in the conversion. Consider the following code:
procedure
TForm88.Button4Click(Sender: TObject);
var
U: UnicodeString;
A: AnsiString;
begin
U := 'This is a UnicodeString'
;
A := U;
Memo1.Lines.Add(A);
U := 'Добро пожаловать в мир Юникода с использованием Дельфи 2009!!'
;
A := U;
Memo1.Lines.Add(A);
end
;
The output of the above when the current OS code page is 1252is:
This is
a UnicodeString
????? ?????????? ? ??? ??????? ? ?????????????? ?????? 2009!!
As you can see, because Cyrillic characters have no mapping in Windows-1252,
information was lost when assigning this UnicodeString to an AnsiString. The result was gibberish because the
UnicodeString contained characters not representable in the code page of the
AnsiString, those characters were
lost and replaced by the question mark when assigning the UnicodeString to the AnsiString.
SetCodePage
SetCodePage, declared in the
System.pas unit as
procedure
SetCodePage(var
S: AnsiString; CodePage: Word; Convert: Boolean);
is a new RTL function that sets a new code page for a given AnsiString. The optional Convert parameter determines if the
payload itself of the string should be converted to the given code page. If the
Convert parameter is False, then the code page for the string
is merely altered. If the Convert
parameter is True, then the
payload of the passed string will be converted to the given code page.
SetCodePage should be used
sparingly and with great care. Note that if the codepage doesn’t actually match
the existing payload (i.e. Convert
is set to False), then
unpredictable results can occur. Also if the existing data in the string is
converted and the new codepage doesn’t have a representation for a given
original character, data loss can occur.
Getting TBytes from Strings
The RTL also includes a set of overloaded routines for extracting an array of
bytes from a string. As we’ll see in Part III, it is recommended that instead of
using string as a data buffer, you use TBytes instead. The RTL makes it easy by
providing overloaded versions of BytesOf() that takes as a parameter the
different string types.
Tiburon’s Runtime Library is now completely capable of supporting the new
UnicodeString. It includes new classes and routines for handling, processing,
and converting Unicode strings, for managing codepages, and for ensuring an easy
migration from earlier versions.
In Part III, we’ll cover the specific code constructs that you’ll need to
look out for in ensuring that your code is Unicode ready.