Overview
Features
Download
Documentation
Community
Add-Ons & Services

Special characters in XML

Please post support and help requests here.

Special characters in XML

Postby tnarol » 14 Jun 2007, 13:27

Hi,

I'm experiencing issues with special characters such as "é" in the XML output of the DOMWriter :

The following program outputs child1 as "text1" and child2 as "t?2"
Then if I try to read child2 it has the wrong text inside.
What should I do to be able to write then read back these special characters ?

int main(int argc, char** argv)
{
// build a DOM document and write it to standard output.

AutoPtr pDoc = new Document;

AutoPtr pRoot = pDoc->createElement("root");
pDoc->appendChild(pRoot);

AutoPtr pChild1 = pDoc->createElement("child1");
AutoPtr pText1 = pDoc->createTextNode("text1");
pChild1->appendChild(pText1);
pRoot->appendChild(pChild1);

AutoPtr pChild2 = pDoc->createElement("child2");
AutoPtr pText2 = pDoc->createTextNode("téxt2");
pChild2->appendChild(pText2);
pRoot->appendChild(pChild2);

DOMWriter writer;
ASCIIEncoding encoding;

writer.setEncoding("Default", encoding);
writer.setNewLine("
");
writer.setOptions(XMLWriter::PRETTY_PRINT);
std::ofstream out("out.xml");
writer.writeNode(out, pDoc);
return 0;
}
tnarol
 
Posts: 22
Joined: 21 Mar 2007, 18:56
Location: France

Re: Special characters in XML

Postby tnarol » 14 Jun 2007, 15:36

One more information that might be important. I'm using Visual C++ (under Windows) with the UNICODE characters set.
tnarol
 
Posts: 22
Joined: 21 Mar 2007, 18:56
Location: France

Re: Special characters in XML

Postby guenter » 14 Jun 2007, 18:02

Any characters written to XML documents must be UTF-8 encoded. If you use the Windows default UTF-16 Unicode encoding, you must use the Poco::UnicodeConverter class in Foundation (or something equivalent) to transcode your text from UTF-16 (or any other encoding such as Latin-1) to UTF-8.
guenter
 
Posts: 1129
Joined: 11 Jul 2006, 16:27
Location: Austria

Re: Re: Special characters in XML

Postby tnarol » 14 Jun 2007, 23:43

Thanks for answering. Well actually I don't know if I'm really using an UTF16 encoding. My Visual C++ project uses the Unicode character set and the data that I want to put in the XML document is stored as a std.string and I want to be able to read it back into a std.string.
Is it possible ?

The unicode converter takes std.wstring as input for the "toUTF8" method so I don't know if it's relevant to make a std.wstring from my std.string then back to another std.string using toUTF8()

Is std.string storage convenient for special characters ?
Should I convert the strings that I write to the nodes before using the DOMWriter to write them ?
Should I convert the strings that I read back from the nodes parsed by the DOMParser ?
Which encoding should I chose for the DOMWriter and the DOMParser ?


tnarol
 
Posts: 22
Joined: 21 Mar 2007, 18:56
Location: France

Re: Re: Special characters in XML

Postby tnarol » 15 Jun 2007, 11:05

I managed to perform to expected operation using the TextConverter as follows. Could you confirm this is the most efficient way to proceed ?

XML writing

Latin1Encoding latin1Enc;

UTF8Encoding utf8Enc;
// input text
std::string str = "téxt2";

std::string strConv;

TextConverter textConv(latin1Enc, utf8Enc);

textConv.convert(str, strConv);

AutoPtr pText2 = pDoc->createTextNode(strConv);




XML reading

std::string str1 = node->nodeValue();

Latin1Encoding latin1Enc;

UTF8Encoding utf8Enc;

std::string strConv;

TextConverter textConv(utf8Enc, latin1Enc);
// output text is in strConv
textConv.convert(str1, strConv);

tnarol
 
Posts: 22
Joined: 21 Mar 2007, 18:56
Location: France

Re: Re: Re: Special characters in XML

Postby guenter » 15 Jun 2007, 11:47

Starting with POCO 1.3.0, POCO uses UTF-8 for all strings used internally (e.g., filesystem paths, etc.). So the best way to deal with Unicode in POCO is to convert everything into UTF-8 as soon as it comes into the application, and then only work with UTF-8 strings in your application. The XML parser also uses UTF-8 everywhere (SAX and DOM), so, once you've standardized on UTF-8 in your application, you don't have to think about encodings anymore. So you'd use the TextConverter only when you read some user input or file that's not UTF-8 encoded, as well as when you need to provide output that's not UTF-8 encoded.
guenter
 
Posts: 1129
Joined: 11 Jul 2006, 16:27
Location: Austria

Re: Re: Re: Re: Special characters in XML

Postby tnarol » 15 Jun 2007, 13:36

> Starting with POCO 1.3.0, POCO uses UTF-8 for all strings used internally (e.g., filesystem paths, etc.). So the best way to deal with Unicode in POCO is to convert everything into UTF-8 as soon as it comes into the application, and then only work with UTF-8 strings in your application. The XML parser also uses UTF-8 everywhere (SAX and DOM), so, once you've standardized on UTF-8 in your application, you don't have to think about encodings anymore. So you'd use the TextConverter only when you read some user input or file that's not UTF-8 encoded, as well as when you need to provide output that's not UTF-8 encoded.

OK thanks. You're right it's better to have all strings UTF8 encoded as soon as they enter the application. It means I have some extra work to do in the input/output of string but I'm glad that it won't have a performance drawback on the XML import/export.

tnarol
 
Posts: 22
Joined: 21 Mar 2007, 18:56
Location: France


Return to Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron