Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Avro] Add logicalType support for some java.time types; add AvroJavaTimeModule for native ser/deser #283

Conversation

MichalFoksa
Copy link
Contributor

@MichalFoksa MichalFoksa commented Jun 8, 2021

Functionality to support Avro schema with logicaltype for few java.time types. It is achieved by new DateTimeVisitor, new module AvroJavaTimeModule and couple of serializers and deserializers.

AvroJavaTimeModule works in milliseconds precision.

AvroMapper.builder()
    .addModule(new AvroJavaTimeModule())
    .build();    

Supported java types:

  • java.time.OffsetDateTime
  • java.time.ZonedDateTime
  • java.time.Instant
  • java.time.LocalDate
  • java.time.LocalTime
  • java.time.LocalDateTime

Risk

I think risk comes for backwards compatibility.

In 2.11 these Java types were somehow supported, they have got serialized either into array. If JSR310 JavaTimeModule with some combination of WRITE_DATE_TIMESTAMPS_AS_NANOSECONDS it coudd have been serialize also into string or long.

Since 2.12 an exception is throw on serialization of any of these types on out of the box configuration, AvroMapper.builder().build(). (I did not investigate further what might be idea behind this new behavior.)

Exception after OffsetDateTime serialization in 2.12:

InvalidDefinitionException: "Any" type (usually for `java.lang.Object`) not supported: `expectAnyFormat` 
called with type [simple type, class java.time.OffsetDateTime]`

DateTimeVisitor cannot be turned off. Anyone already using JSR310 JavaTimeModule and disabling WRITE_DATE_TIMESTAMPS_AS_NANOSECONDS will suddenly receive logicalType instead of java-class in Avro schema.

java.time.OffsetDateTime schema before PR:

:
{
  "type" : "long",
  "java-class" : "java.time.OffsetDateTime"
}

java.time.OffsetDateTime schema after PR:

{
  "type" : "long",
  "logicalType" : "timestamp-millis"
}

It could be a problem for someone who already dealt with java.time types and found this workaround, but that somebody can adapt to in my opinion more Avro compatible solution.

Open points

Precision

Avro supports milliseconds and microseconds previsions for date and time related logicalType(s). I have implemented milliseconds precision only (based on my needs).

It is easy to change PR to microseconds prevision only. To support both previsions I think a new Feature switch have to be created - also not a problem.

java.util.Date

I have decided to remove java.util.Date from supported types due to potential backwards compatibility issues. At the moment schema for java.util.Date looks like this:

{
  "type" : "long",
  "java-class" : "java.util.Date"
}

After it would look like:

{
  "type" : "long",
  "logicalType" : "timestamp-millis"
}

I feel like it could cause issue to more people than issue with java.time types.

@MichalFoksa MichalFoksa force-pushed the feature/2.13/avro/logicalType_support_for_date_time_types branch from b29c981 to 1bd02ba Compare June 8, 2021 09:54
@MichalFoksa MichalFoksa marked this pull request as ready for review June 8, 2021 09:55
@cowtowncoder
Copy link
Member

Looks good; I added some suggestions.

Couple of things I would need:

  1. Addition(s) to avro/README.md to explain usage would be good
  2. Unless I have gotten CLA (apologies if I missed that), I'd need one (see https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf) -- the usual way is to print, fill & sign, scan/photo, email to info at fasterxml dot com

and with that I would be able to merge this PR to add neat new functionality in for 2.13!

Also I have one question: do you think there is any likelihood that some usage might be broken due to addition of date/time type detection? If there was, adding a configurability option for AvroMapper would make sense.

… anything, expected pattern is to let exception pass through if any.
@MichalFoksa
Copy link
Contributor Author

@cowtowncoder Thank you for your time. I am sorry for later response, I had a short time off.

  • avro/README.md is updated

  • Contributor License Agreement is submitted (since today).

  • Since 2.12 java.time types cannot be serialized anyway. I believe it will be more helpful then not.

@MichalFoksa
Copy link
Contributor Author

@cowtowncoder

  • How to submit additional commits to PR?
  • Is commit modification (branch rebase / commit squash) OK?
  • Do you squash branch at merge?

@cowtowncoder
Copy link
Member

First of all: absolutely no problem if things take time -- I take my time and this is all voluntary! Great to get contributions how much trouble it can be to get submission "just right" so I appreciate your work here.

Additional PRs: I think one possibility is to just close existing one(s), push new? Could also be that you can push to your original forked branch (I actually do not know for sure).
Commit modification fine from my end, whatever github allows.

I do try to squash merges where applicable, but I am not the tidiest git committer myself.

I did receive CLA, will check that in.

@MichalFoksa MichalFoksa force-pushed the feature/2.13/avro/logicalType_support_for_date_time_types branch from 59f82a3 to ba76375 Compare June 15, 2021 06:56
@cowtowncoder
Copy link
Member

Looks good; hoping to review again today or tomorrow, get merged.

avro/README.md Outdated Show resolved Hide resolved
}

@Test
public void testSchemaCreation() throws JsonMappingException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question here: since changes to schema visitor are applied regardless of module (I think?), would there be problems/changes to logic for existing users? That is, is there any possible backwards-compatibility concerns wrt creating more granular Avro schema types?
If there are, it might make sense to add configurability to allow disabling use of date/time types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about it.

Hmmm, is there a way how to control such switch from a module ?
I guess not.

Copy link
Member

@cowtowncoder cowtowncoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, looks quite good; some minor issues wrt module metadata (added notes on how to change) but nothing major.

One question I have however is wrt definition/use of Local date/time types: this is a perennial challenge as users are VERY confused about what "local" means, assuming it means "their timezone" (for example). My understanding is that these are rather "abstract" date/time/datetime values which do not have concrete time point without addition time zone or offset.

But to serialize such values numerically, one either needs to use more complex notation (separate fields) OR bind them to a concrete timezone/offset. In latter case it's common to just use UTC, and I think this is what is done here?
That is fine in itself, esp. if that's what typical Avro usage (with apache lib, frameworks) is.
I am only concerned about documentation aspects, explaining how it works.

So... thought it's good to be explicit on exact behavior, and trying to make things as straight-forward as possible (which is hard; this is Date/Time handling which as a domain is complicated).

Anyway: just wanted to ask about this before merging. It's always possible add improvements, changes (doc and otherwise) but thought I'll mention these before merging just in case.

`com.fasterxml.jackson.dataformat.avro.PackageVersion`
instead of
'com.fasterxml.jackson.core.json.PackageVersion'

With correct parent constructor call version(), getModuleName() and setupModule() can be removed.
@MichalFoksa
Copy link
Contributor Author

My understanding is that these are rather "abstract" date/time/datetime values which do not have concrete time point without addition time zone or offset.

This is correct and now I understand how mention of "local" might be confusing.
So in essence I will replace mentions of "local" with some "without timezone" meaning in documentation.
BTW: According to Avro specification logicalType for datetime without timezone is local-timestamp-millis.

Next I will think of a switch to enable/disable logicalType functionality.

Copy link
Member

@dinomite dinomite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat that I've worked a fair bit with Java time types, but very little with Avro.

avro/README.md Outdated Show resolved Hide resolved
avro/README.md Outdated
Comment on lines 124 to 125
Please note that time zone information is at serialization. Serialized values represent point in time,
independent of a particular time zone or calendar. Upon reading a value back time instant is reconstructed but not the original time zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of OffsetDateTime and ZonedDateTime, aren't they deserialized using whatever defaultZoneId is passed to AvroInstantDeserializer.fromLong()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is correct.
They are deserialized using whatever `defaultZoneId1 is passed to AvroInstantDeserializer.fromLong()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to say something like, "Note that time zone & offset information is not serialized—the serialized representation is only a point in time. For local time types (OffsetDateTime and ZonedDateTime) the time zone defaultZoneId in the Avro context is used for converting to & from the serialized representation. The same defaultZoneId must be used at serialization & deserialization time to obtain meaningful results."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having written that, the behavior feels weird to me. Would it be possible to store the offset/zoneId for the OffsetDateTime and ZonedDateTime types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, zoned things absolutely should retain time zone and/or offset, and changing that to something else feels very much Wrong.

For local variants it may be necessary to do interim binding if (but only if) representation uses a fixed timepoint (like long for "milliseconds since 1. 1. 1970") -- but if so, it should be something really fixed like UTC and not whatever user happens to configure (because "default" zone id would otherwise have to match on writer and reader).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cowtowncoder

To me, zoned things absolutely should retain time zone and/or offset, and changing that to something else feels very much Wrong.

Avro specification does not aim to preserve time zone for non local-XYZ logical types.

For correct deserialization into OffsetDateTime or ZonedDateTime reader and writer have to agree on time zone.

Support of OffsetDateTime and ZonedDateTime is here for convenience - I can drop it.

For local variants it may be necessary to do interim binding if (but only if) representation uses a fixed timepoint (like long for "milliseconds since 1. 1. 1970") -- but if so, it should be something really fixed like UTC and not whatever user happens to configure (because "default" zone id would otherwise have to match on writer and reader).

For local variants, contextual time zone is not used at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichalFoksa Ok, I think I better go over the changes once again & try to find what Avro specification says. Although handling of local/zoned types has expected semantics in Java 8, I vaguely recall Avro proscribing behvavior that seemed to differ... and I think for interoperability the letter of Avro spec should usually have precedence (even if I disagreed with how it was defined).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after reading and re-reading what Avro spec says about timestamp, regular and local, I have no idea what are those supposed to mean -- it seems nonsense to be blunt.

Not the part about physical storage itself (although why on earth are there separate milli- vs micro-second types?) but ... well, if NEITHER stores timezone information NOR is there ANY WAY to sync send/receiver zones, then... there seems to be no actual reason for 2 types. At all. I mean, timestamp in this sense can NEITHER be local (it is concrete physical time offset) NOR non-local (no time offset or time zone!). It is much like java.util.Date yet named in a confusing way, using 4 different but rather non-distinct types.
What a mess!

But to try to untangle the mess I guess there is only the one question of how would read and write operations handle these differently.

On writing side there probably cannot be any difference: physical timestamp is what it is. Whatever "local" timezone could be thought to be does not matter; change of zone/offset would not change that value.

On reading side timezone/offset is sort of arbitrary as well: value itself is concrete, although we can use whatever zone we might want.
I guess use of "context timezone" could make sense for "local" variant? For non-local I don't have strong opinion: either UTC or context timezone would be fine as far as I see it.

@MichalFoksa WDYT? Apologies for this taking long -- but I have some time now and will get this merged during this week :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess use of "context timezone" could make sense for "local" variant?

"local" variants do not contain time zone.

For non-local I don't have strong opinion: either UTC or context timezone would be fine as far as I see it

I would leave it on user.

Yeah, Avro is Avro ... :) But it is not bad.

MichalFoksa and others added 2 commits June 26, 2021 14:21
Co-authored-by: Drew Stephens <drew@dinomite.net>
Co-authored-by: Drew Stephens <drew@dinomite.net>
@MichalFoksa MichalFoksa force-pushed the feature/2.13/avro/logicalType_support_for_date_time_types branch from 02ed97f to 707f3a4 Compare June 27, 2021 06:00
avro/README.md Outdated
```

#### Note
Please note that time zone information is at serialization. Serialized values represent point in time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "is at serialization" instead be "is not included at serialization" (or something like that)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right,
I meant "is lost at serialization"

@cowtowncoder
Copy link
Member

@MichalFoksa I think I am mostly happy here; added minor notes on things to mention in README but I think I agree wrt ser/deser aspects given how Avro spec defines things (which is bit confusing wrt "local"/"non-local" timestamp but whatever).

The one last thing I noticed was your mention of allowing disabling of generation of logical type, as an option.
If you want to do that, I could first merge this PR, and let you work on that separately?
There are couple of ways to do that, f.ex:

  1. Add AvroGenerator.Feature(s) (since schema generation is done on generation side, for Jackson 2.x at least)
  2. Add new AvroMapper.Feature (since it's bit of cross-cutting concern)
  3. Configure via AvroModule, instead of mapper -- ideally adding Builder (similar to how AvroMapper etc are configured)

@MichalFoksa MichalFoksa force-pushed the feature/2.13/avro/logicalType_support_for_date_time_types branch from db256e8 to a05b4ab Compare June 29, 2021 06:00
Copy link
Contributor Author

@MichalFoksa MichalFoksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cowtowncoder
Here you are requested improvements.

Rather worry that sorry - I am really happy you review this and give feedback.

allowing disabling of generation of logical type, as an option

I want to do it. Doing it in another PR would be best.

avro/README.md Outdated
```

#### Note
Please note that time zone information is at serialization. Serialized values represent point in time,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right,
I meant "is lost at serialization"

avro/README.md Outdated
Comment on lines 124 to 125
Please note that time zone information is at serialization. Serialized values represent point in time,
independent of a particular time zone or calendar. Upon reading a value back time instant is reconstructed but not the original time zone.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess use of "context timezone" could make sense for "local" variant?

"local" variants do not contain time zone.

For non-local I don't have strong opinion: either UTC or context timezone would be fine as far as I see it

I would leave it on user.

Yeah, Avro is Avro ... :) But it is not bad.

avro/README.md Show resolved Hide resolved
@cowtowncoder cowtowncoder added this to the 2.13.0 milestone Jun 29, 2021
@cowtowncoder cowtowncoder changed the title [Avro] logicalType support for few java.time types. [Avro] logicalType support for some java.time types. Jun 29, 2021
@cowtowncoder cowtowncoder merged commit c4ed624 into FasterXML:2.13 Jun 29, 2021
@cowtowncoder cowtowncoder changed the title [Avro] logicalType support for some java.time types. [Avro] Add logicalType support for some java.time types; add AvroJavaTimeModule for native ser/deser Jun 29, 2021
cowtowncoder added a commit that referenced this pull request Jun 29, 2021
@cowtowncoder
Copy link
Member

Finally merged! Thank you @MichalFoksa and @dinomite -- will go in 2.13, although there can still be improvements.
I also realized that BigDecimal part of logical types is missing but nice to get in.

@MichalFoksa
Copy link
Contributor Author

@cowtowncoder
Great :) Thank you.

@MichalFoksa MichalFoksa deleted the feature/2.13/avro/logicalType_support_for_date_time_types branch June 30, 2021 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants