Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elements containing <CDATA/> parsed incorrectly when at the end of another element #422

Closed
0xe1f opened this issue Aug 26, 2020 · 5 comments
Milestone

Comments

@0xe1f
Copy link

0xe1f commented Aug 26, 2020

I'm experiencing a really odd issue while trying to deserialize some basic XML. Here's my setup:

        final XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
        xmlInputFactory.setProperty("javax.xml.stream.isNamespaceAware", false);
        final XmlMapper xmlMapper = new XmlMapper(xmlInputFactory);
        xmlMapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);

        final RssDocument doc = xmlMapper.readValue(content, RssDocument.class);

POJOs:

    private static class RssDocument {
        public RssChannel channel = null;
    }

    private static class RssChannel {
        public String title = null;
        public String description = null;
        @JacksonXmlProperty(localName = "link")
        public String siteUrl = null;
        @JacksonXmlProperty(localName = "lastBuildDate")
        public String updated = null;
        @JacksonXmlElementWrapper(useWrapping = false)
        @JacksonXmlProperty(localName = "item")
        public List<RssItem> items = null;
    }

    private static class RssItem {
        @JacksonXmlProperty(localName = "guid")
        public String localId = null;
        @JacksonXmlProperty(localName = "pubDate")
        public String updated = null;
        public String title = null;
        @JacksonXmlProperty(localName = "link")
        public String articleUrl = null;
        @JacksonXmlProperty(localName = "dc:creator")
        public String author = null;
        @JacksonXmlProperty(localName = "content:encoded")
        public String encodedContent = null;
        @JacksonXmlProperty(localName = "description")
        public String content = null;
    }

XML I'm trying to parse:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.arstechnica.com/~d/styles/itemcontent.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" version="2.0">
  <channel>
    <title>Ars Technica</title>
    <link>https://arstechnica.com</link>
    <description>Serving the Technologist for more than a decade. IT news, reviews, and analysis.</description>
    <lastBuildDate>Sun, 23 Aug 2020 23:48:25 +0000</lastBuildDate>
    <language>en-US</language>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <generator>https://wordpress.org/?v=4.9.15</generator>
    <image>
      <url>https://cdn.arstechnica.net/wp-content/uploads/2016/10/cropped-ars-logo-512_480-32x32.png</url>
      <title>Ars Technica</title>
      <link>https://arstechnica.com</link>
      <width>32</width>
      <height>32</height>
    </image>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.arstechnica.com/arstechnica/index" />
    <feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="arstechnica/index" />
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" />
    <item>
      <title>Trump announces a COVID-19 Emergency Use Authorization for blood plasma</title>
      <link>https://arstechnica.com/?p=1700815</link>
      <pubDate>Sun, 23 Aug 2020 23:26:43 +0000</pubDate>
      <dc:creator><![CDATA[John Timmer]]></dc:creator>
      <category><![CDATA[Policy]]></category>
      <category><![CDATA[Science]]></category>
      <category><![CDATA[antibodies]]></category>
      <category><![CDATA[COVID-19]]></category>
      <category><![CDATA[Emergency Use Authorization]]></category>
      <category><![CDATA[fda]]></category>
      <category><![CDATA[plasma]]></category>
      <category><![CDATA[SARS-CoV-2]]></category>
      <category><![CDATA[Trump]]></category>
      <guid isPermaLink="false">https://arstechnica.com/?p=1700815</guid>
      <description><![CDATA[Move reportedly happens over objections of health officials about limited data.]]></description>
      <content:encoded><![CDATA[<div id="rss-wrap">
<figure class="intro-image intro-left"><img src="https://cdn.arstechnica.net/wp-content/uploads/2020/08/GettyImages-1220853125-800x604.jpg" alt="Image of a man gesturing towards another."><p class="caption" style="font-size:0.8em"><a href="https://cdn.arstechnica.net/wp-content/uploads/2020/08/GettyImages-1220853125.jpg" class="enlarge-link" data-height="773" data-width="1024">Enlarge</a> <span class="sep">/</span> Donald Trump gestures to Stephen Hahn, head of the FDA, at an earlier press conference. (credit: <a rel="nofollow" class="caption-link" href="https://www.gettyimages.com/detail/news-photo/president-donald-trump-and-stephen%C2%A0hahn-director-of-the-news-photo/1220853125?adppopup=true">Drew Angerer</a>)</p>  </figure><div><a name="page-1"></a></div>
<p>Today, President Trump held a news conference to announce that the FDA has granted an Emergency Use Authorization for the treatment of COVID-19 cases using blood plasma from those formerly infected. The move comes despite significant uncertainty regarding just how effective this treatment is, and comes just days after Trump attacked the FDA for delaying its work as part of a plot to sabotage his re-election.</p>
<h2>In the blood</h2>
<p>Plasma is the liquid portion of the blood, which (among other things) contains antibodies. It has been used to treat other infections, as some antibodies can be capable of neutralizing the infecting pathogen—binding to the bacteria or virus in a way that prevents it from entering cells. Early studies have indicated that it's relatively common for those who have had a SARS-CoV-2 infection to generate antibodies that can neutralize the virus in lab tests, although the antibody response to the virus is also <a href="https://arstechnica.com/science/2020/06/antibody-testing-suggests-immune-response-post-covid-is-very-variable/">highly variable</a>.</p>
<p>In the absence of any effective treatments, people started testing this "convalescent plasma" <a href="https://arstechnica.com/science/2020/03/hospitals-in-nyc-will-start-testing-therapy-using-plasma-of-those-infected/">as early as March</a>, and testing has been expanded as the pool of post-infected individuals has continued to grow. But so far, the evidence has been mixed. One of the largest studies, led by researchers at the Mayo Clinic and including over 35,000 patients, <a href="https://doi.org/10.1101/2020.08.12.20169359">did see an effect</a>, but it was a very mild one: mortality dropped from 11.9 percent in people who received plasma four days or more after starting treatment, compared with 8.7 percent if treatment was started earlier than that. But, critically, the study lacked a control group, leaving its authors talking about "signatures of efficacy," rather than actual evidence of efficacy.</p></div><p><a href="https://arstechnica.com/?p=1700815#p3">Read 7 remaining paragraphs</a> | <a href="https://arstechnica.com/?p=1700815&comments=1">Comments</a></p><div class="feedflare">
<a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=P614tPda1Eo:cpKHHEXnQ5w:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?i=P614tPda1Eo:cpKHHEXnQ5w:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=P614tPda1Eo:cpKHHEXnQ5w:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?i=P614tPda1Eo:cpKHHEXnQ5w:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=P614tPda1Eo:cpKHHEXnQ5w:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=P614tPda1Eo:cpKHHEXnQ5w:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?d=yIl2AUoC8zA" border="0"></img></a>
</div>]]></content:encoded>
    </item>
  </channel>
</rss>

Attempting to run this as-is yields the following error:

Cannot construct instance of `readerless.service.feed.Rss2Parser$RssItem` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Move reportedly happens over objections of health officials about limited data.')
 at [Source: (StringReader); line: 39, column: 111] (through reference chain: readerless.service.feed.Rss2Parser$RssDocument["channel"]->readerless.service.feed.Rss2Parser$RssChannel["item"]->java.util.ArrayList[1])
com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `readerless.service.feed.Rss2Parser$RssItem` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Move reportedly happens over objections of health officials about limited data.')
 at [Source: (StringReader); line: 39, column: 111] (through reference chain: readerless.service.feed.Rss2Parser$RssDocument["channel"]->readerless.service.feed.Rss2Parser$RssChannel["item"]->java.util.ArrayList[1])

If I remove <description/>, the same error will be thrown for <content:encoded/>. If I remove both elements, the errors go away - even though there are multiple other CDATA elements further up. If I move the two elements so that they occur before <guid/>, the error goes away and both elements are parsed without any issues.

If, to resolve the error, I add a constructor to RssItem that accepts a String, that constructor will be called for both <description/> and <content:encoded/>, which seems highly unusual - it's almost as if those two are being parsed as separate <item/> nodes themselves.

I'm using version 2.11.2, though I've tried multiple other versions. I've also tried using woodstox-core, version 5.2.1.0, but the error persists. I've also tried marking up both content and encodedContent @JacksonXmlCData and @JacksonXmlText, to no avail.

@cowtowncoder
Copy link
Member

cowtowncoder commented Sep 10, 2020

(NOTE: updated on 13-Nov-2020)

I don't think this is due to CDATA sections, and you definitely should not have to use @JacksonXmlCData (it only affects serialization) and probably not @JacksonXmlText (unless I misread content and model).

One quick note: this

@JacksonXmlProperty(localName = "dc:creator")

seemed incorrect: local name can and should not contain colon (it is namespace separator so local name should be "creator" -- "dc" is just prefix mapping to namespace URI) -- however, code does disable namespace handling so it might be ok.

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 13, 2020

Another note: a potential problem is this:

xmlMapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);

since doing that will tend to hide problems that would otherwise be easier to troubleshoot. So in future it makes sense to NOT disable this feature, at least during figuring out problems: once things work it is ok to disable it for production.

However I can see that this was done to skip "xmlns" declarations so that may be reasonable, although it is possible to alternatively use @JsonIgnoreProperties(ignoreUnknown = true) for individual classes.

@cowtowncoder
Copy link
Member

Looks like this passes for 2.12: it looks like skipping of category may have been problematic with 2.11.x or earlier.

@cowtowncoder cowtowncoder added this to the 2.12.0 milestone Nov 13, 2020
cowtowncoder added a commit that referenced this issue Nov 13, 2020
@0xe1f
Copy link
Author

0xe1f commented Nov 14, 2020

Thanks!

Another note: a potential problem is this:
`xmlMapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);`

Agreed, though unfortunately, not disabling isn't an option, since I don't control the quality of the source XML, and I need to read what I can.

@cowtowncoder
Copy link
Member

@0xe1f understood, that's why I mentioned "during testing / troubleshooting". For production it may make sense to be more forgiving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants