Overview
Features
Download
Documentation
Community
Add-Ons & Services

Unicode int to UTF8 encoded string

Please post support and help requests here.

Unicode int to UTF8 encoded string

Postby deisenhut » 13 Sep 2013, 23:20

I have a UTF8 encoded string and I wish to extract the n-th character within that string and place it into a new string object by itself (such that it can be written onto the UI). If it was ASCII encoded (or a UTF16 wstring), I would just use myString.at(index). Naturally, this doesn't work with UTF8.

I'm trying to use the Poco::TextIterator with a Poco::Utf8Encoding object to iterate over the string in order to find the appropriate character. This seems fine, but when I dereference the iterator, it returns an int of the Unicode character. But I need this as a UTF8 encoded string.

How to I convert this unicode int value into a UTF8 encoded std::string object? This piece seems to be missing. I would of expected this functionality to be in the UTF8 class or Unicode class of Poco. Or is there something out in the standard library that I'm missing?

Thanks,
Dan
deisenhut
 
Posts: 11
Joined: 30 Jul 2013, 00:11

Re: Unicode int to UTF8 encoded string

Postby deisenhut » 16 Sep 2013, 16:04

I have written the following code to find the UTF8 character within the source utf8string at the specified index.
Code: Select all
   Poco::UTF8Encoding utf8Encoding;
   Poco::TextIterator it(utf8String, utf8Encoding);
   Poco::TextIterator end(utf8String); // If no encoding given, produces an end iterator.

   while (it != end)
   {
      --index;
      if (index < 0)
      {
         wchar_t widechars[2];
         widechars[0] = *it;
         widechars[1] = L'\0';
         wstring wideString(widechars);
         cout << "String Size: " << wideString.size() << endl;
         Poco::UnicodeConverter::toUTF8(wideString, retString);
         cout << "String Size: " << retString.size() << endl;
         break;
      }
      else
      {
         ++it;
      }
   }

Which works in finding the correct character within the ut8string, but for some reason the Poco::UnicodeConverter::toUTF8() function is adding an additional 0 at the end.

For example, when the character at index is just a backtick (`) 0x60, the debug statements above output the following:
Code: Select all
String Size: 1
String Size: 2

And a dump of the resulting string is 60 00. When the character at index is (é) 0xE9 (UTF8 \xC3\xA9), the debug statements are:
Code: Select all
String Size: 1
String Size: 3

And the resulting string dump is C3 A9 00.

Why is toUTF8() adding this extra 0? Not the end of the world, but it is causing the unit tests to fail because the resulting strings do not match. Note, this is being build on Linux.

Thanks,
Dan
deisenhut
 
Posts: 11
Joined: 30 Jul 2013, 00:11


Return to Support

Who is online

Users browsing this forum: No registered users and 2 guests

cron