Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism for processing invalid XML names (transforming to valid ones) #531

Merged
merged 1 commit into from Sep 12, 2022

Conversation

mensinda
Copy link
Contributor

@mensinda mensinda commented Jun 2, 2022

This commit introduces the PROCESS_ESCAPED_MALFORMED_TAGS and
ESCAPE_MALFORMED_TAGS features that control whether invalid
tag names will be escaped with an attribute.

fixes #523
fixes #524

@cowtowncoder
Copy link
Member

First of all, thank you for contributing this PR.

But as to proposed changes.. Hmmh. I really don't like the approach of magic names like this. While I understand it could be used to retain information it results in kind of special-purpose XML only processable by Jackson. I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

@mensinda
Copy link
Contributor Author

mensinda commented Jun 2, 2022

While I understand it could be used to retain information

Retaining information is an absolute must-have in my use case (serializing and deserializing data).

I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

So, basically something like my base32 proposal from #523? Or do you mean something that isn't reversible like replacing those chars with _?

@mensinda
Copy link
Contributor Author

mensinda commented Jun 2, 2022

If you also don't like something like base32 tags, would it be OK to have the default being replacing invalid chars with _ and then having an option that either does the current logic from this PR or base32? I really do need 100% reversibility and would like to have an upstream solution for this...

@mensinda
Copy link
Contributor Author

mensinda commented Jun 7, 2022

I have refactored the PR a bit, and now there is support for multiple strategies for escaping tags:

Key Description Why
NONE No escaping - can produce invalid XML Backwards compatibility
REPLACE Replaces invalid characters with _ No jackson specific magic in the output
ATTRIBUTE_ESCAPE The real tag name in an attribute Valid XML and the tags are human readable
BASE64 base64url encoded tag with a base64_tag_ prefix. Valid XML, 100% reversible, and no magic attributes.

Each strategy has its pros and cons. So instead of us choosing what tradeoffs to make, why not let the user decide?

@mensinda
Copy link
Contributor Author

Anything that else that is still needed? Is it OK to include multiple strategies for escaping tag names or do you only want one that can be toggled with a flag?

@cowtowncoder
Copy link
Member

I have nothing against configuration, but I am not going to accept structure-changing modifications (adding wrapping element, additional attributes and so on). These will be unlikely to work reliably and will end up maintenance nightmares.
I will also not consider 100% information retaining a goal in the sense that one could do transformations only using XML content -- XML module is built to assume there is certain amount of property metadata that comes from Java types (POJOs). This is why name-mangling is acceptable: binding to/from POJO properties can be made even if name transformation itself was lossy.
All that is needed is that there is a one-way transformation from logical property name (which may not be valid XML name) into valid XML name; reverse transform is not needed.
This is how XML module is designed to work: it may not be what you would prefer but it is the design that I will follow.

Having said all of that I am not against one-way transformations that also happen to be reversible. This could include base64 (and similar) encoding with prefix. SInce this is not a structural transformation and can work with existing handlers it is acceptable to me.

@mensinda
Copy link
Contributor Author

So, for this PR specifically, do you want me to just drop the ATTRIBUTE_ESCAPE in this case?

@cowtowncoder
Copy link
Member

cowtowncoder commented Jun 15, 2022

Sigh. I am not sure I will want to go with this approach at all. My main concern is complexity it adds to already fragile processing.
If I was to go with this approach, yes, dropping ATTRIBUTE_ESCAPE would be needed.

I guess backtracking a bit, the way I would see transformations would not operate at streaming level at all, but at property handling (databind). That's where it is possible (and necessary) to use one-way transform from logical property name to physical to match. This leaves streaming parser/generator as-is without knowing anything about changes.
One challenging case there would be that of wrappers, I think, but other than that modifying property names with Bean[De]SerializerModifier would be relatively straight-forward.

Translating things at lower level would have some benefits but my worry is that handling is already quite convoluted. I see why reversible transformation would be necessary there.
But doing it there also adds significant overhead, per-element/attribute processing that is not needed if bindings are defined at databind level, where (de)serializers are constructed ones and name translation similarly.

So: I don't think I will accept approach as a whole as defined here.

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible. This may even include out-of-the-box implementations; but also needs to allow custom implementation of converter(s). So instead of Enum selection, there'd have to be something like XmlNameConverter to be implemented by user (but possibly also with one or more standard implementations).
It would be ok to have default "no-op" implementation invoked.

... but not sure if this can be done without adding overhead of keeping QName and so on.
That is another part I would want to avoid.

@mensinda
Copy link
Contributor Author

So: I don't think I will accept approach as a whole as defined here.

Fine by me, bit (I know we have different priorities here) in the end I need something where I can put in an arbitrary map convert it to XML and get the same map back out. Ideally, I would also like to avoid having to maintain a proprietary fork :)

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible.

Would that be something dataformat XML specific or would be a generic data-bind solution preferred?

... but not sure if this can be done without adding overhead of keeping QName and so on.
That is another part I would want to avoid.

I can refactor this PR into a generic extension point (I am also fine with dropping support for extra magic attributes), but could you then specify more clearly what would be acceptable and what not? Do you just want me to just map on Strings or do you want me to convert Strings to QName and vice versa?

@cowtowncoder
Copy link
Member

@mensinda So: extension I was thinking of would be XML-specific, and registered similar to what PR does but instead of pre-defined enum, with actual (stateless) handler.

One tricky part is that I do not want additional conversions to/from QName in case handler is not registered (or default no-op one is used). So ideally default case without new handler would not have additional overhead. I am fine having some sort of opaque state/storage if need be, for handler to provide/take, if that is needed. Just not additional processing when custom handler not needed.

Does this make more sense?

As to databind-level approach: that could be pursued separately and would probably just allow replacing all invalid-in-QName characters with underscore (or maybe some other configurable character). I think that is not something that would work for your use case.

This commit adds an extendable `XmlTagProcessor` that is used for
escaping invalid characters in XML tag names.

fixes FasterXML#523
fixes FasterXML#524
@mensinda
Copy link
Contributor Author

@cowtowncoder I have updated the PR to use a new extension point (XmlTagProcessor) as requested.

@mensinda
Copy link
Contributor Author

ping :)

@cowtowncoder
Copy link
Member

Hi there! Sorry, haven't had any time to look into this. It's on my list, hoping to get back to it in near future.

@mensinda
Copy link
Contributor Author

Hi, because of the timeline "The plan is to get the first Release Candidate (2.14.0-rc1) out during August 2022" from https://cowtowncoder.medium.com/jackson-2-14-sneak-peek-79859babaa4, I was wandering how likely it would be that this PR could go into the 1.14 release?

@cowtowncoder
Copy link
Member

@mensinda I have my long (but shortening slowly) list of things to work through prior to release; this is an entry. So timing of RC1 may well move but I will have a look here before that, or at least final release (possible to have multiple RCs, even with new features).

@cowtowncoder
Copy link
Member

@mensinda Hi there! Apologies for this taking so long but I FINALLY had a chance to go back and read the PR.
I like the approach and hope we can get it in 2.14.

I will be adding some smaller notes as comments but I have only one bigger thing I'd like to change: avoiding construction of XmlTag container for every element. I think there are 3 possibilities for this:

  1. Since this is sort of "internal" value, could just change it to be mutable value class, passed by stream to processor and processor can modify local name and/or namespace URI as it sees fit.
  2. Or avoid helper class completely: either by only allowing change of local name (or is namespace URI needed as marker in some cases?), or by having separate calls.
  3. Or make tag processor optional so instead of "no-op" default instance, leave it as null and only call mutation methods if non-null one defined

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I know this may sound like over-optimizing but I hope this can be done since most use cases will probably not configure mutation so allocations are unnecessary overhead.

return _tagProcessor;
}

public void setXmlTagProcessor(XmlTagProcessor _tagProcessor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless something in setup requires this, let's leave out this mutator: being a new feature should be possible to just use Builder approach. 3.0 (master) will have to remove it otherwise.

/**
* @since 2.14
*/
public void setXmlTagProcessor(XmlTagProcessor tagProcessor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also remove this mutator; should be passed via XmlFactory (and then avoids requiring the other setter in factory)

@cowtowncoder
Copy link
Member

@mensinda I think I could just merge this and make change I want (wrt mutability of tag info to avoid construction of instances). But I realized there is one practical thing to do first if we haven't done it yet (apologies if we did and I just forgot): I need the CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(there is also alternate Corporate CLA if individual one linked above doesn't work)

and it's a one-time thing (good for all future contributions).
The easiest way is usually to print it, fill, sign & scan/photo, then email to info at fasterxml dot com.
There are other possibilities (if you can't scan, modifying PDF with info + name as signature) too.

I would really like to get this in the first 2.14.0-rc1 if possible!

@mensinda
Copy link
Contributor Author

mensinda commented Sep 6, 2022 via email

@cowtowncoder
Copy link
Member

@mensinda No problem, that's fine & thank you for the quick reply. It's OSS so availability expected to be variable.

Copy link
Contributor Author

@mensinda mensinda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I, don't see any reason for or against processing the URI. I personally don't have a use case for this (and can think of one, to be honest), but it was easy to include and someone might have a use case for it, so I included it.

Would you be OK with just dropping URI processing for now?

@cowtowncoder
Copy link
Member

@mensinda Ah. If only single copy created per-mapper yeah no need to optimize, ignore that suggestion.

As to other suggestions I guess my thinking was that if new methods were to be added, it'd be easier to have a "base" implementation (empty). But with just 2 methods maybe that's overthinking things. It does not look like this interface was likely to need expansion; and if it does, can use default method implementations for backwards compatibility.

@cowtowncoder
Copy link
Member

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I, don't see any reason for or against processing the URI. I personally don't have a use case for this (and can think of one, to be honest), but it was easy to include and someone might have a use case for it, so I included it.

Would you be OK with just dropping URI processing for now?

I think that actually my preferred choice is to make XmlTag mutable, simple value class.
So caller sets up namespace and local name; calls processor; uses values it finds. Just working around the lack of return Tuples in Java.
I do think it plausible that someone might want to base (part of) processing on namespace URI in future.

@cowtowncoder
Copy link
Member

@mensinda I think that I can easily make the minor change wrt mutability after merging. So given that I think you sent CLA (will check that), I think we are good now.

Thank you for this contribution!

@cowtowncoder cowtowncoder merged commit 1f7d83d into FasterXML:2.14 Sep 12, 2022
@cowtowncoder
Copy link
Member

Merged, will change XmlTagName handling.

I also realized that ReplaceTagProcessor Javadoc needs some changing: it does not really replace only invalid characters but rather more -- all non-ASCII character, for example. That is fine for many users but not for international users. This is fine as long as explanation is accurate.

@cowtowncoder
Copy link
Member

... and I don't think this PR handles attribute names either (should have spent bit more time reading details).
I'll add that.

@cowtowncoder cowtowncoder changed the title Fix generating invalid XML tag names Add mechanism for processing invalid XML names (transforming to valid ones) Sep 12, 2022
cowtowncoder added a commit that referenced this pull request Sep 12, 2022
@mensinda mensinda deleted the tagEscape branch September 12, 2022 18:30
cowtowncoder added a commit that referenced this pull request Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dollars in POJO property names are not escaped on serialization Bad map keys can not be unmarshaled
2 participants