Overview
Features
Download
Documentation
Community
Add-Ons & Services

Unicode problem in Linux

Please post support and help requests here.

Unicode problem in Linux

Postby ehinc » 24 Jul 2012, 18:10

Hi there!
Please consider following snippet:
Code: Select all
wstring wstrUTF16 = L"12345";
string strUTF8;
Poco::UnicodeConverter::toUTF8(wstrUTF16, strUTF8);
if ( strUTF8 != "12345" )
{
   printf("Conversion failed!\n");
}
else
{
   printf("Conversion worked!\n");
}

This code prints "Conversion worked!" on Windows and "Conversion failed!" on Linux!
Further investigating the Linux version I noticed that the string strUTF8 contains "1\02\03\04\05\0".
Is this a bug? How can I fix it?
(I'm using Poco-1.4.3p1)
Regards,
Christoph
ehinc
 
Posts: 6
Joined: 24 Jul 2012, 18:00

Re: Unicode problem in Linux

Postby alex » 24 Jul 2012, 21:48

gcc std::wstring is 32 bit by default. Compile your code with -fshort-wchar, then try this and see what you get:
Code: Select all
wstring wstrUTF16 = L"12345";
string strUTF8;
if ( 16 == sizeof(wchar_t) )
{
  Poco::UnicodeConverter::toUTF8(wstrUTF16, strUTF8);
  if ( strUTF8 != "12345" )
  {
     printf("Conversion failed!\n");
  }
  else
  {
     printf("Conversion worked!\n");
  }
}
else printf("std::wstring is not of appropriate size (%d)!\n", sizeof(wchar_t));
alex
 
Posts: 1101
Joined: 11 Jul 2006, 16:27
Location: United_States

Re: Unicode problem in Linux

Postby ehinc » 25 Jul 2012, 13:58

Yes, I know that Linux uses 4 byte wchar_t by default.
However I cannot and don't want to change this compiler setting in my project.
IMHO this is a bug in POCO.
There should at least be something like this:
Code: Select all
if ( sizeof(wchar_t) == 4 )
{
  doThis();
}
else if ( sizeof(wchar_t) == 2 )
{
  doThat();
}


I solved my immediate problems by "shrinking" the 4 byte wchar_t to 2 bytes before calling UnicodeConverter.
Regards,
Christoph.
ehinc
 
Posts: 6
Joined: 24 Jul 2012, 18:00

Re: Unicode problem in Linux

Postby alex » 25 Jul 2012, 14:37

This way of communicating won't take you far.

If you know wchar_t is 32 bit, why are you trying to deal with as if it was 16 bit and why are you surprised when it does not do what you intend?
If you think this is a bug, why don't you file a bug report? Documentation (and parameter names) clearly states that this is a 16-bit value.
This is not a bug any more than non-standardized wchar_t size is a bug. Right or wrong, this is currently a feature. You can try to fix it by submitting a patch in SourceForge. I will personally look into it and do the best to make it more robust.

Thanks.
alex
 
Posts: 1101
Joined: 11 Jul 2006, 16:27
Location: United_States

Re: Unicode problem in Linux

Postby ehinc » 29 Jul 2012, 22:07

alex wrote:This way of communicating won't take you far.

If you know wchar_t is 32 bit, why are you trying to deal with as if it was 16 bit and why are you surprised when it does not do what you intend?
If you think this is a bug, why don't you file a bug report?

Because I wanted to be sure that it really is a bug. Thus my original post: "Is this a bug?"
alex wrote:Documentation (and parameter names) clearly states that this is a 16-bit value.

Actally it doesn't! Using "wchar_t" determines the size of a single data element, not the type of data you choose to store in it. You could well store UTF32 using multiple "char"s.
There simply is no standard character type which is 16bits wide, so you'll have to use the 32bit wchar_t. Putting shorter info in longer datatypes must work - otherwise it would also be a problem to assign a "char" to an "int" variable.
alex wrote:This is not a bug any more than non-standardized wchar_t size is a bug. Right or wrong, this is currently a feature.

Well, I do expect the function to work as described in the documentation - which it does not.
alex wrote:You can try to fix it by submitting a patch in SourceForge. I will personally look into it and do the best to make it more robust.

I'll look into it.
Regards,
Christoph.
ehinc
 
Posts: 6
Joined: 24 Jul 2012, 18:00

Re: Unicode problem in Linux

Postby alex » 29 Jul 2012, 22:44

ehinc wrote:There simply is no standard character type which is 16bits wide, so you'll have to use the 32bit wchar_t.

The problem is that there is no standard character type width, period. I understand your widening data handling logic but storing UTF-16 in 32-bit wchar container would be somewhat unusual way of doing business, IMO. We'll definitely appreciate and look into your proposal/contribution here.
ehinc wrote:Well, I do expect the function to work as described in the documentation - which it does not.

UnicodeConverter documentation wrote:This class is mainly used for working with the Unicode Windows APIs and probably won't be of much use anywhere else.

The code probably should have been #ifdef'd for Windows only and the problem would have been solved (until someone had an itch to port it elsewhere).
ehinc wrote:I'll look into it.

Thanks.
alex
 
Posts: 1101
Joined: 11 Jul 2006, 16:27
Location: United_States

Re: Unicode problem in Linux

Postby ehinc » 30 Jul 2012, 15:12

alex wrote:The problem is that there is no standard character type width, period. I understand your widening data handling logic but storing UTF-16 in 32-bit wchar container would be somewhat unusual way of doing business, IMO.

I agree that this is not beautiful, but it works!
All std::wstring methods work as well. That's why I was surprised that this method doesn't work.
UnicodeConverter documentation wrote:This class is mainly used for working with the Unicode Windows APIs and probably won't be of much use anywhere else.

If my previous posts sounded disrespectful that was purely unintentional - I love the POCO framework and its readable code!

My suggestion:
Code: Select all
...
typedef basic_string<UInt16> UTF16String;
...
void UnicodeConverter::toUTF8(const std::wstring& utf16String, std::string& utf8String)
{   
   utf8String.clear();
   UTF8Encoding utf8Encoding;
   UTF16Encoding utf16Encoding;
   UTF16String tempUtf16String(utf16String.begin(), utf16String.end()); // here the excess is stripped
   TextConverter converter(utf16Encoding, utf8Encoding);
   converter.convert(tempUtf16String.data(), (int) tempUtf16String.size() * sizeof(UInt16), strUTF8);
}

When using Linux with 32bit wchar_t the excess is stripped, when using Windows 16bit wchar_t its just copied over.

Regards,
Christoph.
ehinc
 
Posts: 6
Joined: 24 Jul 2012, 18:00

Re: Unicode problem in Linux

Postby alex » 30 Jul 2012, 16:53

There is UTF32Encoding in the trunk (this will become 1.5), so if you need full set of conversions between UTF-8,16 and 32, trunk would be the best place to contribute.
alex
 
Posts: 1101
Joined: 11 Jul 2006, 16:27
Location: United_States

Re: Unicode problem in Linux

Postby alex » 31 Jul 2012, 21:09

One of our contributors is currently working on this, see rev. 1904.
If you want to contribute to the effort, let me know and I'll put you in touch with him.
alex
 
Posts: 1101
Joined: 11 Jul 2006, 16:27
Location: United_States


Return to Support

Who is online

Users browsing this forum: No registered users and 1 guest