Overview
Features
Download
Documentation
Community
Add-Ons & Services

UnicodeConverter differences between Win32 and Linux

Please post support and help requests here.

UnicodeConverter differences between Win32 and Linux

Postby deisenhut » 30 Jul 2013, 23:33

I'm converting an application originally developed for native WinCE/Win32 to utilize Poco and I'm test building it on Linux for cross platform support (with possible move to Linux later). I'm using Poco v1.4.6p1.

I'm running into a problem utilizing the UnicodeConverter to switch from wstring (UTF16) to string (UTF8). My test code converts from UTF16 to UTF8 and back again. The code is as follows (where _tstring is defined as std::wstring):

Code: Select all
   _tstring chinese = _T("\x4F60\x597D\x4E16\x754C");

   for(unsigned int i = 0; i < chinese.length(); ++i)
      printf(" %04x", chinese[i]);
   printf("\n");

   string utf8Chinese ;
   Poco::UnicodeConverter::toUTF8(chinese, utf8Chinese);

   for(unsigned int i = 0; i < utf8Chinese.length(); ++i)
      printf(" %02x", utf8Chinese[i]);
   printf("\n");

   _tstring convBackChinese ;
   Poco::UnicodeConverter::toUTF16(utf8Chinese, convBackChinese);

   for(unsigned int i = 0; i < convBackChinese.length(); ++i)
      printf(" %04x", convBackChinese[i]);
   printf("\n");


When I run this on Win32 (Windows 7 x64), the output that I'm receiving is as follows:

Code: Select all
 4f60 597d 4e16 754c
 ffffffe4 ffffffbd ffffffa0 ffffffe5 ffffffa5 ffffffbd ffffffe4 ffffffb8 ffffff96 ffffffe7 ffffff95 ffffff8c
 4f60 597d 4e16 754c


This output appears correct with the UTF8 version containing 3 bytes per character. The output for the UTF8 is just prepending ff's since it is treating it as a negative number and I didn't bother fixing it since it is just test code. The problem comes when I run this code on Linux (Ubuntu x64). Then I receive the following output:

Code: Select all
 4f60 597d 4e16 754c
 ffffffe4 ffffffbd ffffffa0 00 ffffffe5 ffffffa5 ffffffbd 00 ffffffe4 ffffffb8 ffffff96 00 ffffffe7 ffffff95 ffffff8c 00
 4f60 0000 597d 0000 4e16 0000 754c 0000


For some reason the conversion from UTF16 to UTF8 is creating four bytes per character with the fourth byte being 0. Then this 0 is retained in the conversion back to UTF16.

I would try to use the TextConverter class, but this only take string objects and not wstring objects. I've been working at this for several days now and I'm preparing to look at the latest development version (v1.5.1) that has some expanded UTF support listed in the Changelog. Does anybody know what I'm doing wrong or whether this has been fixed in the development version?

Thanks,
Dan
deisenhut
 
Posts: 11
Joined: 30 Jul 2013, 00:11

Re: UnicodeConverter differences between Win32 and Linux

Postby guenter » 31 Jul 2013, 11:24

As its documentation says, Poco::UnicodeConverter won't be of much use unless wchar_t is 16 bits, as it is on Windows:

A convenience class that converts strings from UTF-8 encoded std::strings to UTF-16 encoded std::wstrings and vice-versa.
This class is mainly used for working with the Unicode Windows APIs and probably won't be of much use anywhere else.


Internally, UnicodeConverter::toUTF8() assumes that wchar_t/std::wstring strings are 2 bytes per character using UTF-16 encoding. This assumption fails on most non-Windows platforms, where wchar_t is 32 bit, with unspecified encoding.
I suggest avoiding std::wstring in portable code and use std::string with UTF-8 instead. If C++11 is an option, use std::u16_string.

The problem with 32-bit wchar_t and the POCO text encoding classes is that the POCO text encoding classes always assume a properly aligned code sequence, e.g. a sequence of UTF-8 bytes or UTF-16 16-bit integers. This cannot be satisfied with 32-bit wchar_t, unless you use UTF-32 encoding (which, however is different from UTF-16 for some characters). So, using UTF-32 with wchar_t on Linux may be an option, but then again you'll have to deal with different encodings depending on platform. POCO 1.5.1 and later has a Poco::UTF32Encoding class. 1.4.6 does not.
guenter
 
Posts: 1153
Joined: 11 Jul 2006, 16:27
Location: Austria

Re: UnicodeConverter differences between Win32 and Linux

Postby deisenhut » 31 Jul 2013, 15:32

Thanks. I'll look at converting the Linux build to use 32-bit for wchar_t. Though it looks like I may be better off converting the whole application to UTF8 on the Linux build as there is no readily available conversion between 32-bit wstrings and UTF8 strings.

Dan
deisenhut
 
Posts: 11
Joined: 30 Jul 2013, 00:11


Return to Support

Who is online

Users browsing this forum: No registered users and 1 guest