Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

Closed
DaveCTurner opened this issue Feb 18, 2022 · 8 comments
Closed
Labels
2.14 cbor has-failing-test Indicates that there exists a test case (under `failing/`) to reproduce the issue smile
Milestone

Comments

@DaveCTurner
Copy link

The following object apparently cannot be deserialised in CBOR and SMILE:

{"\u0000":"","\u0000\u0000":""}

The following are base64-encoded copies of its representation in various formats:

JSON: eyJcdTAwMDAiOiIiLCJcdTAwMDBcdTAwMDAiOiIifQ==
YAML: LS0tCiJcMCI6ICIiCiJcMFwwIjogIiIK
SMILE: OikKBfqAACCBAAAg+w==
CBOR: v2EAYGIAAGD/

As far as I can tell these all look correct, but on deserialization the field names are reported as duplicates:

com.fasterxml.jackson.core.JsonParseException: Duplicate field '�'
 at [Source: (byte[])":)
���� ��� �"; line: -1, column: 11]

	at __randomizedtesting.SeedInfo.seed([29F6747C61080477:49491DCB06A6B558]:0)
	at com.fasterxml.jackson.core.json.JsonReadContext._checkDup(JsonReadContext.java:204)
	at com.fasterxml.jackson.core.json.JsonReadContext.setCurrentName(JsonReadContext.java:198)
	at com.fasterxml.jackson.dataformat.smile.SmileParser._handleFieldName(SmileParser.java:1346)
	at com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:374)
	at com.fasterxml.jackson.core.JsonGenerator._copyCurrentContents(JsonGenerator.java:1935)
	at com.fasterxml.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:1914)
	at org.elasticsearch.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:396)

Relates elastic/elasticsearch#84146

@DaveCTurner
Copy link
Author

No idea how one might fix this but here's a test for it at least (against 45e8780):

diff --git a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
index 26bfd377..a31feba8 100644
--- a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
+++ b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
@@ -478,6 +478,30 @@ public class BasicParserTest extends BaseTestForSmile
         p.close();
     }

+    public void testNULInFieldNames() throws IOException
+    {
+        ByteArrayOutputStream bytes = new ByteArrayOutputStream(1000);
+        JsonGenerator jgen = new SmileFactory().createGenerator(bytes);
+        jgen.writeStartObject();
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            jgen.writeStringField(fieldName, "");
+        }
+        jgen.writeEndObject();
+        jgen.close();
+
+        SmileParser p = _smileParser(bytes.toByteArray());
+
+        assertToken(JsonToken.START_OBJECT, p.nextToken());
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            assertToken(JsonToken.FIELD_NAME, p.nextToken());
+            assertEquals(fieldName, p.getCurrentName());
+            assertToken(JsonToken.VALUE_STRING, p.nextToken());
+            assertEquals("", p.getValueAsString());
+        }
+        assertToken(JsonToken.END_OBJECT, p.nextToken());
+        assertNull(p.nextToken());
+        p.close();
+    }

     public void testBufferRelease() throws IOException
     {

@cowtowncoder
Copy link
Member

@DaveCTurner first of all, thank you for reporting this and especially contributing a test to reproduce!

I have a suspicion that the issue is with zero-padding of symbol table used; I vaguely recall something similar with JSON (with escaping, null bytes are legal in JSON names too).
But since Smile and CBOR codecs are bit different (with more straight-forward name decoding due to lack of escaping) it is possible that handling has non working edge case.

But the gist of this is that symbol table lookups are keyed by "quads" (int containing up to 4 bytes from name; last quad possibly being incomplete). Special padding is used to avoid conflicts for case of "unused" bits at the end; this only occurs in case of raw 0 bytes.
I forget the exact details of how this was handled in JSON but given time should be able to refresh my memory.

Now: I hope to look into this in near future, but I do have bit of a backlog. I just wanted to add a note saying above. :)

@cowtowncoder cowtowncoder added 2.13 cbor smile has-failing-test Indicates that there exists a test case (under `failing/`) to reproduce the issue labels Feb 19, 2022
@cowtowncoder
Copy link
Member

Ok, yes, I can reproduce this. Somehow looks like two-null-byte case gets decoded as single null byte, if I got that right.

@cowtowncoder
Copy link
Member

cowtowncoder commented Feb 26, 2022

Interestingly enough, JSON parser from jackson-core does NOT suffer from this. Could be due to escaping required, maybe decoding goes through different route.

Actually, hmmh. Yes, it goes through different decoder but ends up with similar look up, with a call to _padLastQuad; to fix:

FasterXML/jackson-core#148

and looks like JSON backend has no mismatch. Need to see if something similar would help with CBOR, Smile.

@cowtowncoder
Copy link
Member

Hmmmh. Ok... this is tricky. Code in Smile, CBOR codecs does not keep track of "unused" bytes in quads, unlike JSON codec. But we do need to distinguish between "unused" and 0 bytes since only latter is part of actual decoded value.
Value of 0xFF (or -1 as signed) byte works because it is not a legal UTF-8 byte; that's what JSON codec uses.

Two ways to go about it, really:

  1. Fill last possibly incomplete quad with -1s when starting (so anything unshifted remains so)
  2. Pad at the end

JSON codec uses (2) for what that is worth.

@DaveCTurner
Copy link
Author

👍 that ties in with what I thought I was seeing yes. I'd completely forgotten that 0xFF wasn't a legal byte here, but yes that's just what we need. Option 1 seems simpler:

diff --git a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
index ea4fe91c..72560434 100644
--- a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
+++ b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
@@ -1825,7 +1825,7 @@ versionBits));
         if (len < 5) {
             int inPtr = _inputPtr;
             final byte[] inBuf = _inputBuffer;
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {
@@ -1849,7 +1849,7 @@ versionBits));
         q1 =  (q1 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 9) {
-            int q2 = (inBuf[inPtr++] & 0xFF);
+            int q2 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 5;
             if (left > 0) {
                 q2 = (q2 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1871,7 +1871,7 @@ versionBits));
         q2 =  (q2 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 13) {
-            int q3 = (inBuf[inPtr++] & 0xFF);
+            int q3 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 9;
             if (left > 0) {
                 q3 = (q3 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1920,7 +1920,7 @@ versionBits));
         } while ((len -= 4) > 3);
         // and then leftovers
         if (len > 0) {
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {

That fixes it for me, but I see other similar code that might also have the same problem.

cowtowncoder added a commit that referenced this issue Feb 26, 2022
@cowtowncoder
Copy link
Member

@DaveCTurner ah. I was looking at CBOR end of things. There are indeed a few other code paths to consider... partly for different lengths, partly for boundary conditions. So I'll need to improve the tests I think.

I'll see if I can fix this but just for sake of safety, for 2.14 branch. I have a feeling that there's non-trivial chance for regression for either correctness or performance here.

cowtowncoder added a commit that referenced this issue Feb 26, 2022
@cowtowncoder cowtowncoder added 2.14 and removed 2.13 labels Feb 26, 2022
@cowtowncoder cowtowncoder added this to the 2.14.0 milestone Feb 26, 2022
@cowtowncoder
Copy link
Member

Ok, fixed; yes, pre-padding is probably bit simpler although both work (and had to use post-padding for one case in Smile).
Curious as to whether there's any measurable effect on decoding speed; probably worth checking 2.13 vs 2.14 decoding speed with jackson-benchmarks (https://github.com/FasterXML/jackson-benchmarks/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.14 cbor has-failing-test Indicates that there exists a test case (under `failing/`) to reproduce the issue smile
Projects
None yet
Development

No branches or pull requests

2 participants