Overview
Features
Download
Documentation
Community
Add-Ons & Services

SAX Parser and newlines in data

Please post support and help requests here.

SAX Parser and newlines in data

Postby drowley » 15 Jan 2008, 23:49

I have a question. It appears that the SAX parser is converting CRLF found in data into simply LF. I am using 1.1.2, but it also appears to be the same for 1.3.1. What is the reason for this? I can understand why you might want to do that outside of the tag, but shouldn't the data inside the tag be left alone? The relevant code for 1.1.2 in xmlparse.cpp is:
^
case XML_TOK_DATA_NEWLINE:
if (characterDataHandler) {
XML_Char c = 0xA;
characterDataHandler(handlerArg, &c, 1);
}
else if (defaultHandler)
reportDefault(parser, enc, s, next);
break;
^
This, combined with the way the content tokenizer works, will turn the XML fragment:

<MyTag>Data line 1 ''__CRLF__'' Data line2</MyTag>

into

<MyTag>Data line 1 ''__LF__'' Data line2</MyTag>

Is this a bug? It seems to me that CRLF pairs inside a tag should not be mangled this way.
drowley
 
Posts: 3
Joined: 26 Jul 2006, 20:29
Location: Canada

Re: SAX Parser and newlines in data

Postby guenter » 16 Jan 2008, 12:18

The XML parser shouldn't do any CRLF conversion. How are you invoking the parser? If you create an istream yourself (e.g., a ifstream), do you make sure that the stream is opened in binary mode?
guenter
 
Posts: 1155
Joined: 11 Jul 2006, 16:27
Location: Austria

Re: Re: SAX Parser and newlines in data

Postby drowley » 16 Jan 2008, 21:39

> The XML parser shouldn't do any CRLF conversion. How are you invoking the parser? If you create an istream yourself (e.g., a ifstream), do you make sure that the stream is opened in binary mode?

Thanks for the reply. That was my first thought as well, however I confirmed that the stream was indeed being passed into the parser correctly. In the function PREFIX(contentTok) in xmltok_impl.c, if a CR is found and followed by a LF, it is explicitly stripped.

The offending code is:
Code: Select all

  case BT_CR:
    ptr += MINBPC(enc);
    if (ptr == end)
      return XML_TOK_TRAILING_CR;
    if (BYTE_TYPE(enc, ptr) == BT_LF)
      ptr += MINBPC(enc);
    *nextTokPtr = ptr;
    return XML_TOK_DATA_NEWLINE;


When a CR is encountered, the pointer is incremented. When a following LF is encountered, the pointer is incremented again. The state XML_TOK_DATA_NEWLINE is returned, and as you can see in my original post, only a LF character is passed into the contentDataHandler, rather than a CR followed by a LF.

I patched my own local copy, so that CR are not stripped, and the actually values are passed along. This solved the problem, and CRLF are no longer converted to single LF.

The patch is:

Code: Select all

Index: xmlparse.cpp
===================================================================
--- xmlparse.cpp   (.../trunk/external/poco-1.1.2/XML/src)   (revision 10293)
+++ xmlparse.cpp   (.../branches/tt6523/external/poco-1.1.2/XML/src)   (revision 10293)
@@ -2471,7 +2471,7 @@
       return XML_ERROR_MISPLACED_XML_PI;
     case XML_TOK_DATA_NEWLINE:
       if (characterDataHandler) {
-        XML_Char c = 0xA;
+        XML_Char c = s[0];
         characterDataHandler(handlerArg, &c, 1);
       }
       else if (defaultHandler)
Index: xmltok_impl.c
===================================================================
--- xmltok_impl.c   (.../trunk/external/poco-1.1.2/XML/src)   (revision 10293)
+++ xmltok_impl.c   (.../branches/tt6523/external/poco-1.1.2/XML/src)   (revision 10293)
@@ -802,8 +802,6 @@
     ptr += MINBPC(enc);
     if (ptr == end)
       return XML_TOK_TRAILING_CR;
-    if (BYTE_TYPE(enc, ptr) == BT_LF)
-      ptr += MINBPC(enc);
     *nextTokPtr = ptr;
     return XML_TOK_DATA_NEWLINE;
   case BT_LF:
drowley
 
Posts: 3
Joined: 26 Jul 2006, 20:29
Location: Canada

Re: SAX Parser and newlines in data

Postby guenter » 16 Jan 2008, 22:59

Both the improbability that the underlying expat parser (one of the most used XML parsers) does something that fundamental wrong, and the faint memory of reading something about whitespace handling in XML parsers once, made me refer to the XML specification (http://www.w3.org/TR/xml/), and here it is in 2.11:

^2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.^

So, the parser is right with its behavior. Not that this will help you in your particular case, though...
guenter
 
Posts: 1155
Joined: 11 Jul 2006, 16:27
Location: Austria

Re: Re: SAX Parser and newlines in data

Postby drowley » 16 Jan 2008, 23:27

> Both the improbability that the underlying expat parser (one of the most used XML parsers) does something that fundamental wrong, and the faint memory of reading something about whitespace handling in XML parsers once, made me refer to the XML specification (http://www.w3.org/TR/xml/), and here it is in 2.11:
>
> ^2.11 End-of-Line Handling
>
> XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).
>
> To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.^
>
> So, the parser is right with its behavior. Not that this will help you in your particular case, though...

Hmmm, that appears to be correct. I can understand converting CRLF->LF outside of the tag contents, but inside? That seems completely unnecessary and very intrusive.

Thanks for the help. I guess I am stuck using a patched version of expat.
drowley
 
Posts: 3
Joined: 26 Jul 2006, 20:29
Location: Canada


Return to Support

Who is online

Users browsing this forum: No registered users and 2 guests

cron