Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML to JSON - IOException when parsing XML with XMLMapper #226

Closed
dbories opened this issue Mar 13, 2017 · 11 comments
Closed

XML to JSON - IOException when parsing XML with XMLMapper #226

dbories opened this issue Mar 13, 2017 · 11 comments
Milestone

Comments

@dbories
Copy link

dbories commented Mar 13, 2017

Hello,

I found a weird error when trying to parse XML documents with an com.fasterxml.jackson.dataformat.xml.XmlMapper. If an XML element has no attribute and contains both text nodes and sub-elements, the XML parser fails with the following exception:

 Exception in thread "main" java.io.IOException: Expected END_ELEMENT, got event of type 1
 	at com.fasterxml.jackson.dataformat.xml.deser.XmlTokenStream.skipEndElement(XmlTokenStream.java:180)
 	at com.fasterxml.jackson.dataformat.xml.deser.FromXmlParser.nextToken(FromXmlParser.java:558)
 	at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:223)
 	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:69)
 	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
 	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3798)
 	at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2381)
 	at test.mvn.TestJackson.main(TestJackson.java:18)

Example of XML document that produces the error

<?xml version="1.0" encoding="UTF-8"?>
 <root>
   <a>lorem <b>ipsum</b> dolor</a>
 </root>

If the parent element has at least one attribute, the parsing is OK.
Example of XML document that does not produce any error:

<?xml version="1.0" encoding="UTF-8"?>
 <root>
   <a id="1">lorem <b>ipsum</b> dolor</a>
 </root>

The java code used:

import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

import org.apache.commons.io.FileUtils;

import com.fasterxml.jackson.dataformat.xml.XmlMapper;

public class TestJackson {

  public static void main(
      final String[] args) throws IOException {
    final String iString = FileUtils.readFileToString(new File("C:/test.xml"), StandardCharsets.UTF_8);

    new XmlMapper().readTree(iString);
  }

}

Reproduced with 2.8.0 and 2.8.7
Not reproduced with 2.7.8

@cowtowncoder
Copy link
Member

Exception does not make sense, and I'll see what can be done. But beyond that this usage can not be supported all that well since it is so-called "mixed content" -- that is, contains both text ("cdata" in XML lingo) and elements. Such structure can not be expressed in useful sense with JsonNode.

But I don't think there should be such low level exception.

cowtowncoder added a commit that referenced this issue Apr 7, 2017
@jollygoodfellowz
Copy link

After upgrading from 2.7.8 to 2.8.8 I am also seeing this issue that didn't exist previously. This is the xml that fails for me

<blah>BLAH<blah>BLAH<blah>BLAH<blah>BLAH!</blah></blah></blah></blah>

@pijusn
Copy link

pijusn commented Aug 1, 2017

This comment doesn't provide much information helping the issue but might save someone's time.

At the first glance input does not contain mixed-type elements. However, XML contained zero-width character \u2028 (RegExp: \x{2028}). Removing it fixed the problem. I used Zero Width Characters Locator IntelliJ plugin to identify it. Can also be checked with any editor using RegExp but there are more of these.

Whether Jackson should interpret it as a whitespace or not - arguable. Probably not worth the effort.

@dbories
Copy link
Author

dbories commented Aug 1, 2017

I'm not sure I understand. In which XML did you find this character? The problem in this thread was related to "mixed content" in XML.

@pijusn
Copy link

pijusn commented Aug 1, 2017

In XML that was being parsed - the input. It contained zero width character which got interpreted as CDATA rather than a whitespace. So, Jackson saw that element containing other elements + CDATA (that invisible character) meaning it interpreted it as mixed content element.

@dbories
Copy link
Author

dbories commented Aug 1, 2017

The way you commented I thought you were speaking about my initial problem and got confused because I have absolutely no zero-width characters.

So you mean you found a case where, if some hypothetical XML contains a zero-width character, it is considered as CDATA and we get the same exception?

Maybe that's another problem, worth its own thread.

@cowtowncoder
Copy link
Member

@pijusn For what it is worth zero-width character is not considered white space from XML specification perspective. Jackson's handling could choose to interpret it differently, but the challenge is that there are probably a few similar non-letter/number codepoints that are similar so logic can get unwieldy.

@dbories Yes, this is bit different issue, although I can see similarities (i.e same challenge but only once you realize that what looks like nothing/whitespace is actually considered character data).

@michaelcgood
Copy link

I am still experiencing this error with 2.8.10.

@kchelluri
Copy link

kchelluri commented Aug 22, 2018

Still an issue with 2.9.6. Does anyone have a work around?

@cowtowncoder
Copy link
Member

@kchelluri Mixed content can not be handled with this module at this point; there is no simple fix for that, nor active plans for adding something to work around it.

@cowtowncoder
Copy link
Member

Created #402 as catch-all for ideas to improve mixed content handling.

@cowtowncoder cowtowncoder added this to the 2.12.0 milestone Jun 5, 2020
cowtowncoder added a commit that referenced this issue Jun 5, 2020
alex-bel-apica pushed a commit to ApicaSystem/jackson-dataformat-xml that referenced this issue Sep 4, 2020
alex-bel-apica pushed a commit to ApicaSystem/jackson-dataformat-xml that referenced this issue Sep 4, 2020
# Conflicts:
#	release-notes/VERSION-2.x
#	src/test/java/com/fasterxml/jackson/dataformat/xml/failing/MixedContentTreeRead226Test.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants