Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonLocation consistently off by one character for many invalid JSON parsing cases #1173

Closed
hal7df opened this issue Dec 15, 2023 · 9 comments
Labels
2.17 Issues planned (at earliest) for 2.17

Comments

@hal7df
Copy link
Contributor

hal7df commented Dec 15, 2023

Issue

The JsonLocation attached to a JsonProcessingException thrown when parsing an invalid JSON string is consistently one character to the right of the invalid character, except in cases where the error is due to an unexpected EOF. This affects both JsonLocation.getCharOffset() and JsonLocation.getColumnNr() (presumably JsonLocation.getByteOffset() as well, although I haven't tested that explicitly); because this information is used to construct the exception message, that is affected as well.

I first noticed this on Jackson 2.10.0, but the issue persists on Jackson 2.16.0.

Minimally reproducible example

package example;

import com.fasterxml.jackson.core.JsonLocation;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;

public class JsonLocationTest {
    public static void main(String[] args) {
        for (String input : INVALID_JSON) {
            try {
                MAPPER.readTree(input);
            } catch (JsonProcessingException jpe) {
                JsonLocation location = jpe.getLocation();

                System.out.printf(
                    "%s (at line %d, column %d):%n  %s%n  ",
                    jpe.getOriginalMessage(),
                    location.getLineNr(),
                    location.getColumnNr(),
                    input
                );

                // annotate the location of the error on the previous line
                for (int i = 0; i < location.getCharOffset(); ++i) {
                  System.out.print(' ');
                }

                System.out.println('^');
            }
        }
    }

    private static final String[] INVALID_JSON = new String[] {
        "{\"invalid\" \"json\"}",
        "{\"also\", \"invalid\"}",
        "{\"still\":[\"invalid\"[]]}",
        "[\"json\":\"doesn't\",\"work\":\"like\",\"this\"]",
        "{\"really\": \"funky💃\" \"unicode\"}",
        "{",
        "}"
    };
    private static final ObjectMapper MAPPER = new ObjectMapper();
}

Expected output

Unexpected character ('"' (code 34)): was expecting a colon to separate field name and value (at line 1, column 12):
  {"invalid" "json"}
             ^
Unexpected character (',' (code 44)): was expecting a colon to separate field name and value (at line 1, column 8):
  {"also", "invalid"}
         ^
Unexpected character ('[' (code 91)): was expecting comma to separate Array entries (at line 1, column 20):
  {"still":["invalid"[]]}
                     ^
Unexpected character (':' (code 58)): was expecting comma to separate Array entries (at line 1, column 8):
  ["json":"doesn't","work":"like","this"]
         ^
Unexpected character ('"' (code 34)): was expecting comma to separate Object entries (at line 1, column 22):
  {"really": "funky💃" "unicode"}
                       ^
Unexpected end-of-input: expected close marker for Object (start marker at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 1]) (at line 1, column 2):
  {
   ^
Unexpected close marker '}': expected ']' (for root starting at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1]) (at line 1, column 1):
  }
  ^

Actual output

(The unexpected end-of-input case here is correct IMO, just including it to be thorough)

Unexpected character ('"' (code 34)): was expecting a colon to separate field name and value (at line 1, column 13):
  {"invalid" "json"}
              ^
Unexpected character (',' (code 44)): was expecting a colon to separate field name and value (at line 1, column 9):
  {"also", "invalid"}
          ^
Unexpected character ('[' (code 91)): was expecting comma to separate Array entries (at line 1, column 21):
  {"still":["invalid"[]]}
                      ^
Unexpected character (':' (code 58)): was expecting comma to separate Array entries (at line 1, column 9):
  ["json":"doesn't","work":"like","this"]
          ^
Unexpected character ('"' (code 34)): was expecting comma to separate Object entries (at line 1, column 23):
  {"really": "funky💃" "unicode"}
                        ^
Unexpected end-of-input: expected close marker for Object (start marker at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 1]) (at line 1, column 2):
  {
   ^
Unexpected close marker '}': expected ']' (for root starting at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1]) (at line 1, column 2):
  }
   ^
@cowtowncoder cowtowncoder added the 2.16 Issue planned (at earliest) for 2.16 label Dec 18, 2023
@cowtowncoder
Copy link
Member

Sounds like a flaw indeed. This is likely due to code not compensating for already read (invalid) character.
Unfortunately this may require fixing in quite a few places, both wrt:

  1. Cases of unexpected/invalid character (dozens of places per parser)
  2. 4 different parsing backends (... for JSON); byte-based, char-based, DataInput-backed, async

One thing that would be useful as the first step would be addition of (failing) test cases.
There are a few location-checking tests under src/test/java/com/fasterxml/jackson/core/read/; tests with names like LocationXXX.java

@hal7df
Copy link
Contributor Author

hal7df commented Dec 22, 2023

I've taken a stab at adding some tests in #1175.

@cowtowncoder
Copy link
Member

@hal7df Thank you for the tests!

One thing I forgot to ask earlier: if you haven't yet done so (I don't think you have but just in case if you did just LMK), we'd need a CLA:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

which is needed for the first contribution (but is good for all future contributions, so one time chore).
It's usually easiest to do by printing, filling & signing, scanning/photo, email document to cla at fasterxml dot com.

If you could send it whenever you have a change that'd be great. Apologies for forgetting to mention this earlier; looking forwards to all the contributions!
And thank you once again for the unit tests.

cowtowncoder added a commit that referenced this issue Dec 24, 2023
@hal7df
Copy link
Contributor Author

hal7df commented Dec 24, 2023

@cowtowncoder I did notice the mention of the CLA in the contributing doc in the main Jackson repo, but it appeared to make an exception for test code changes. If you'd like one nonetheless I can probably get one to you after sometime the holidays (my employer is involved, so I'd have to run it up the chain when everyone gets back).

@cowtowncoder
Copy link
Member

@hal7df Yeah you are right that CLA is really needed for code that we ship, and test code isn't covered.
So no hurry -- we just need it for non-test code contributions. We can until there's a PR for such.

In the meantime I've been able to fix many cases, as well as realized that there's one trickier case (that of unrecognized tokens).

@cowtowncoder cowtowncoder changed the title JsonLocation consistently off by one character when attempting to parse invalid JSON strings (except for unexpected EOF) JsonLocation consistently off by one character for many invalid JSON parsing cases Dec 28, 2023
@cowtowncoder cowtowncoder modified the milestones: 2.10, 2.16.2 Dec 28, 2023
@cowtowncoder
Copy link
Member

Fixed most cases; 3 fail in a way that is bit trickier to fix and may need refactoring of decoding -- those I'll tackle for 2.17.
And one case wrt Unicode characters that affects byte-backed and async will also need bit more thought.

But the bulk have been resolved for 2.16 branch, to be included in eventual 2.16.2.

@cowtowncoder
Copy link
Member

@hal7df Ok. So, turns out that about 2/3 are simple cases where last character just needs to be "unread" (offset adjusted) and things will work.

3 cases are bit more involved since they read invalid token; and while in theory location could be constructed before the first character is read, that'd be wasteful for the common case of not needing location. So will preferably try to adjust location after the fact. This, however, may be challenging for case of buffer boundaries etc.

But one failing case cannot be solved per se but needs to be document (if not already done): one showing case of non-ASCII Unicode character, Column number for byte-based input.
For byte-based input, column offset is (and has to remain) byte-based offset and not character based.

@cowtowncoder
Copy link
Member

Ugggh. Fix here breaks something else -- ability to continue parsing in some cases with error recovery (single-character syntax problems). Only caught by jackson-jr test, will need to add a test or two in jackson-core itself and figure out if there is a way to resolve this problem.

cowtowncoder added a commit that referenced this issue Feb 11, 2024
cowtowncoder added a commit that referenced this issue Feb 12, 2024
@cowtowncoder cowtowncoder reopened this Feb 12, 2024
@cowtowncoder cowtowncoder removed this from the 2.16.2 milestone Feb 12, 2024
@cowtowncoder cowtowncoder removed the 2.16 Issue planned (at earliest) for 2.16 label Feb 12, 2024
@cowtowncoder cowtowncoder added the 2.17 Issues planned (at earliest) for 2.17 label Feb 12, 2024
@cowtowncoder
Copy link
Member

Had to revert initial fix; this will also mean eventual fix will only go in 2.17, not earlier.

Added new test against regression wrt error recovery.

Hoping to fix in 2.17 very soon now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.17 Issues planned (at earliest) for 2.17
Projects
None yet
Development

No branches or pull requests

2 participants