Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Avro] Add support for @Union and polymorphic types #60

Merged
merged 2 commits into from Mar 21, 2017

Conversation

baharclerode
Copy link
Contributor

@baharclerode baharclerode commented Mar 10, 2017

Alright, another stab at the native type ID stuff to support Generics / unions / polymorphic types. I think I managed to take care of the remaining comments that were left on the original PR, and additionally fixed handling of primitives when deserializing into Object

(Note, this was quickly rebased at 2:20am... make sure I didn't accidentally revert your performance fixes. I think I preserved them...)

@@ -382,6 +390,17 @@ public final void writeStartObject() throws IOException {
}

@Override
public void writeStartObject(Object forValue) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented because we need to know the actual value type to pick the correct branch of a union

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good idea. While method was originally added for slightly different purpose (simply to allow keeping track of "current POJO"), it makes sense, and is great additional use.
The only (?) potential concern is to ensure that this method gets always called; mostly concerned wrt traversal through array/Collection serializers or such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to maintain the old codepath as well, so if you call writeStartObject() instead of writeStartObject(Object) on a non-union, it shouldn't care.

@cowtowncoder
Copy link
Member

Interesting. I think I'll have to contemplate bit more on this. It's potentially pretty cool, but I just want to convince myself that @Union handling works like this. My main/only concern is just wrt whether

  1. This works for expected cases; considering that Avro allows true union types, whereas Jackson expects subtype relationships (perhaps this is why it only applies to Object?)
  2. Is it ever possible that use of this annotation should NOT imply polymorphism? A common use of making things nullable requires use of union in schema, for example, but generally wouldn't be polymorphic usage.

I do hope to merge this in for 2.9, and I am very excited about all the great work you have done so far. This really makes 2.9 release quite interesting I think.

@baharclerode
Copy link
Contributor Author

The annotation always implies polymorphism, but a union in a schema does not imply polymorphism (as you have pointed out with nullable fields), and the Apache implementation does not enforce this during schema gen even though the deserializer assumes it to be true. This is one of many bugs in the Apache implementation that drove me to switch to Jackson for Avro encoding/decoding :)

With the Apache implementation, the following is legal:

@Union({Bird.class, Cat.class, Integer.class})
public interface Animal {}

the resulting schema is Bird|Cat|Integer, which is blindly substituted anywhere that it encounters an Animal field, and if you create a message that has an Integer for that field (via a GenericRecord, for example), the Apache implementation will happily use sun.misc.Unsafe to shove an Integer into an Animal field when deserializing it back into a POJO. This of course works great right up until you try to access the field and call a method on it, at which point your JVM will crash.

The way I handled this in my implementation was:

  1. Check if Integer is a subtype of _baseType (Animal), and if so, just deserialize.
  2. If not, check if we have a deserializer for _baseType; If so, ignore the native type ID, and use this deserializer instead
  3. If there is no deserializer for the _baseType, then follow the usual procedure for unknown type IDs.

This provided the expected behavior in cases where types match up with Type IDs correctly, and sane behavior when they don't (assume that I'm mapping to a different class that has similar or the same structure).

@cowtowncoder
Copy link
Member

@baharclerode Ok that works, and if need be can be changed as necessary. Scary thing about Avro impl just shoving value... did not know this occurs. :)

@cowtowncoder cowtowncoder merged commit dccd551 into FasterXML:master Mar 21, 2017
@cowtowncoder cowtowncoder added this to the 2.9.0.r milestone Mar 21, 2017
@baharclerode
Copy link
Contributor Author

Alright, cool. I'll get the final PR opened for the @AvroEncode support now that this has been merged 😃

@cowtowncoder
Copy link
Member

cowtowncoder commented Apr 14, 2017

Unfortunately I think I will need to undo much of this PR. The way polymorphic handling is changed here is against the way Jackson handling is designed to be used, and although I understand it would fuilfill your needs I don't think it should be done this way.

I don't know if there is a good way to solve the problem: fundamentally Jackson divides handling into two layers (which are mixed here):

  1. At low-level streaming parser/generator schema object is used for encoding/decoding to/from encoded data into logical token stream (compatible with JSON model). Schema is only used at this level; it could, perhaps, be used to provide type id.
  2. At databinding schema is not available for constructing handlers because schema is handled by parser/generator and because it is not statically available (only based on types being involved, and not actively used parser/generator). As such decision to use (or not) of polymorphic handling must be based solely on annotation and static configuration (as is the case for "default typing")

In case of Avro the challenge is that schema would be needed as extra information for databinding. This patch works around this by always enabling potential polymorphic handling and rerouting handling with this assumption.

So... without any other changes what I think of doing is to enabling polymorphic handling if @Union annotation is used: this seems like reasonable intent. It will not be sufficient for general handling of, say, java.lang.Object types.

This will also result in massive set of test failures, which I am trying to resolve now...

@cowtowncoder
Copy link
Member

One thing that I would accept, however... having a configurable setting (defaulting to disabled) which does allow aggressive polymorphic handling (that is, what this PR adds)?
If so, users can opt in for enhanced handling with overhead that results (surprisingly tho my test case seemed to only get -10% for throughput which is much less than what I expected).

@cowtowncoder
Copy link
Member

@baharclerode Actually... looks like maybe there is a simpler compromise. Looks like enabling equivalent of "default typing", but just for java.lang.Object, will work to the degree unit tests check.
If so, two ways polymorphic handler enabled in Avro context:

  1. Use of @Union with non-empty value types
  2. Declaring property as java.lang.Object

What do you think? I'll go ahead and check this variant in; changes so far have been minor.

I will go ahead and some more profiling; I think this approach should have very little overhead.

@baharclerode
Copy link
Contributor Author

I definitely see where you're coming from, and have been thinking about this since you first brought it up in the custom encoder PR. My thoughts were about just reducing the places where the TypeResolver is used; Specifically, only if the following criteria are met:

  1. No registered deserializer for base type
  2. baseType must either be a) java.lang.Object, b) abstract, or c) have resolved subtypes

With these, we're only installing a type resolver in places that polymorphic type handling is mandatory, and letting the databinder do its normal behavior in the remaining places. The following update to the AvroAnnotationIntrospector I think implements the above requirements, and doesn't fail any of the tests:

    protected TypeResolverBuilder<?> _findTypeResolver(MapperConfig<?> config, Annotated ann, JavaType baseType) {
        // If there's a custom deserializer, then it's responsible for handling any type information
        if (config.getAnnotationIntrospector().findDeserializer(ann) != null) {
            return StdTypeResolverBuilder.noTypeInfoBuilder();
        }
        Collection<NamedType> subtypes = null;
        if (ann instanceof AnnotatedClass) {
            subtypes = config.getSubtypeResolver().collectAndResolveSubtypesByTypeId(config, (AnnotatedClass) ann);
        } else if (ann instanceof AnnotatedMember){
            subtypes = config.getSubtypeResolver().collectAndResolveSubtypesByTypeId(config, (AnnotatedMember) ann, baseType);
        }
        if (!baseType.isAbstract() && (subtypes == null || subtypes.isEmpty())) {
            return StdTypeResolverBuilder.noTypeInfoBuilder();
        }
        TypeResolverBuilder<?> resolver = new AvroTypeResolverBuilder();
        JsonTypeInfo typeInfo = ann.getAnnotation(JsonTypeInfo.class);
        if (typeInfo != null && typeInfo.defaultImpl() != JsonTypeInfo.class) {
            resolver = resolver.defaultImpl(typeInfo.defaultImpl());
        }
        return resolver;
    }

@cowtowncoder
Copy link
Member

My assumption here is that if support for @JsonTypeInfo is needed, it should be achieved by using AnnotationIntrospectorPair, so that JacksonAnnotationIntrospector detects that.
Also note that returning of null is preferable in cases where no handling is expected because the way (de)serializers are done there is bit of additional work if resolver exists, so I think first part can be simplified.

Please let me know if some tests are failing after my latest commit(s): locally everything passes, but some tests seem to be skipped. And obviously I don't understand intent or (more) advanced use cases that well from Avro perspective.

One thing on abstract types: I don't know if that is a valid assumption -- there are a few mechanisms that allow for mapping. For example, List becomes ArrayList (or, possibly something else); JsonDeserialize(as=....) allows mapping to implementation, and Mr Bean module could just create matching bean type.
This is why it seems preferable to just require java.lang.Object or annotation.

@baharclerode
Copy link
Contributor Author

I just ran everything locally myself and they all passed. The 50 skipped tests represent features not supported by the apache implementation (but supported and tested in Jackson), or bugs in the apache implementation-- something to revisit when there's a new version of the apache implementation available, but otherwise skipping the tests is the easiest way to ignore them without making the test bed much more complex. Since your updates also fix the issue encountered in #71 I'll update that PR to just include the new tests (which also pass with your latest changes) so that those can be merged in.

I think with your changes, this still covers my requirements at the moment, since all of my stuff will be annotated for the time being. Thank you!

@baharclerode baharclerode deleted the bah.UnionSerialization branch April 19, 2017 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants