Table of Contents Previous Next
Logo
The Ice Run Time in Detail : 32.21 C++ String Conversion
Copyright © 2003-2007 ZeroC, Inc.

32.21 C++ String Conversion

For languages other than C++, Ice encodes strings in their native Unicode representation, so applications can transparently use characters from non-English alphabets. However, for C++, the string encoding depends on which mapping is chosen for a particular string, the default mapping to std::string, or the alternative mapping to std::wstring (see Section 6.6.1).1 This section explains how strings are encoded by the Ice run time, and how you can achieve automatic conversion of strings into a particular encoding.2
On the wire, Ice transmits all strings as Unicode strings in UTF‑8 encoding (see Chapter 38). However, the native C++ representation for strings that contain non-English characters depends on the platform, as well as on locale settings and whether you use the narrow or wide mapping for Slice strings. By default, the Ice run time encodes strings as follows:
• Narrow strings (that is, strings mapped to std::string) are presented to the application in UTF‑8 encoding and, similarly, the application is expected to provide narrow strings in UTF‑8 encoding to the Ice run time for transmission.
With this default behavior, the application code is responsible for converting between the native codeset for 8‑bit characters and UTF‑8. For example, if the native codeset is ISO Latin‑1, the application is responsible for converting between UTF‑8 and narrow (8‑bit) characters in ISO Latin‑1 encoding.
Also note that the default behavior does not require the application to do anything if it only uses characters in the ASCII range. (This is because a string containing only characters in the ASCII range is also a valid UTF‑8 string.)
• Wide strings (that is, strings mapped to std::wstring) are automatically encoded as Unicode by the Ice run time as appropriate for the platform. For example, for AIX in 32‑bit mode, the Ice run time converts between UTF‑8 and UTF‑16 in big-endian representation whereas, for AIX in 64‑bit mode, the Ice run time converts between UTF‑8 and UTF‑32 in big-endian representation.
With this default behavior, wide strings are transparently converted between their one-the-wire representation and their native C++ representation as appropriate, so application code need not do anything special. (The exception is if an application uses a non-Unicode encoding, such as Shift‑JIS, as its native codeset.)
The default behavior of the run time can be changed by providing application-specific string converters. If you install such converters, all Slice strings will be passed to the appropriate converter when they are marshaled and unmarshaled. Therefore, the string converters allow you to convert all strings transparently into their native representation without having to insert explicit conversion calls whenever a strings cross a Slice interface boundary.
You can install string converters on a per-communicator basis when you create a communicator by setting the stringConverter and wstringConverter members of the InitializationData structure (see Section 32.3). Any strings that use the default (std::string) mapping are passed through the specified stringConverter, and any strings that use the wide (std::wstring) mapping are passed through the specified wstringConverter.
The string converters are defined as follows:
namespace Ice {

class ICE_API UTF8Buffer {
public:
    virtual Byte* getMoreBytes(size_t howMany,
                               Byte* firstUnused) = 0;
    virtual ~UTF8Buffer() {}
};

template<typename charT>
class BasicStringConverter : public IceUtil::Shared {
public:
    virtual Byte*
        toUTF8(const charT* sourceStart, const charT* sourceEnd,
               UTF8Buffer&) const = 0;

    virtual void fromUTF8(const Byte* sourceStart,
                          const Byte* sourceEnd,
                          std::basic_string<charT>& target) const;
};

typedef BasicStringConverter<char> StringConverter;
typedef IceUtil::Handle<StringConverter> StringConverterPtr;

typedef BasicStringConverter<wchar_t> WstringConverter;
typedef IceUtil::Handle<WstringConverter> WstringConverterPtr;

}
As you can see, both narrow and wide string converters are simply templates with either a narrow or a wide character (char or wchar_t) as the template parameter.

32.21.1 Converting to UTF‑8

If you have a string converter installed, the Ice run time calls the toUTF method whenever it needs to convert a native string into UTF‑8 representation for transmission. The sourceStart and sourceEnd pointers point at the first byte and one-beyond-the-last byte of the source string, respectively. The implementation of toUTF8 must return a pointer to the first unused byte following the converted string.
Your implementation of toUTF8 must allocate the returned string by calling the getMoreBytes member function of the UTF8Buffer class that is passed as the third argument. (getMoreBytes throws a MemoryLimitException if it cannot allocate enough memory). The firstUnused parameter must point at the first unused byte of the allocated memory region. You can make several calls to getMoreBytes to incrementally allocate memory for the converted string. If you do, getMoreBytes may relocate the buffer in memory. (If it does, it copies the part of the string that was converted so far into the new memory region.) The function returns a pointer to the first unused byte of the (possibly relocated) memory.
Conversion with toUTF8 can fail because no more memory is available, in which case you should throw a MemoryLimitException. Conversion can also fail because the encoding of the source string is internally incorrect. In that case, you should throw a StringConversionFailed exception from toUTF8.
The Ice run time deallocates the returned string once it has marshaled it.

32.21.2 Converting from UTF‑8

During unmarshaling, the Ice run time calls the fromUTF8 member function on the corresponding string converter. The function converts a UTF‑8 string into its native form as a std::string. (The string into which the function must place the converted characters is passed to fromUTF8 as the target parameter.)

1
The remainder of this section is not relevant to languages other than C++.

2
See the demo directory in the Ice for C++ distribution for an example of using string converters.

Table of Contents Previous Next
Logo