Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Changed logic in Parser _readHeaderLine #89

Merged
merged 2 commits into from Aug 18, 2015
Merged

Changed logic in Parser _readHeaderLine #89

merged 2 commits into from Aug 18, 2015

Conversation

ghost
Copy link

@ghost ghost commented Aug 17, 2015

Hi,

I would like to propose a sightly different way to process headers during parsing. In some cases, the CSV file consumer does know the columns a file may contain, but not necessarily in the exact order.

In this scenario, the consumer sets the schema but it also ask to process the header, in such way that actual order of the columns in the schema is determined by the header itself.

This patch tweaks sightly the method _readHeaderLine() in such way that the header is always processed, ensuring the proper order of the columns in the schema. Previously, if a schema was defined, the header line was skipped altogether.

If no schema is provided, a default schema is built (as it did previously).

All existing tests run correctly after this change.

Thanks,

Justo.

When the header line is present and the settings ask for it  to be processed, two different options are possible:

a) The schema has been populated.  In this case, build a new  schema where the order matches the *actual* order in which the given CSV file offers its columns; there cases the consumer of the csv file knows about the columns but not necessarily the order in which they are defined.  A further check can be done in this case, by not permitting new columns that are not defined in the schema, by adding a new flag to the schema (say, strict) and reporting an error when a new column is found in the header.

b) The schema has not been populated.  In this case, build a default schema based on the columns found in the header.
@cowtowncoder
Copy link
Member

Very cool. I think this makes sense.

The only thing that I am slightly concerned about is whether there is any possibility that someone might want to force different order; and/or just ignore header line. It is difficult to know if such usage exists, but if it does, this change in 2.7 would break that usage. And it would seem that there also would be no way to revert to old handling if so.

So: would it perhaps make sense to add either a CsvParser.Feature, or boolean property for CsvSchema that would enable merging of existing schema, the way you describe. And if that property was disabled, merging would NOT occur (that is, old 2.6 behavior).
Usually I would suggest defaulting to backwards compatible method, but for this particular case I think benefits of new behavior as default would seem sufficient to do that.
If so, why add feature at all? So that if anyone is relying on old logic, they could just set feature/schema value properly, if it proves to be a problem

Given that it seems possible that feature would not be heavily used, perhaps CsvParser.Feature would make most sense, as that is bit less work than adding code for CsvSchema.

@cowtowncoder
Copy link
Member

One practical thing as well: I am happy to merge the change, but before first contribution, we will need a filled-in Contributor License Agreement. It can be found at:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and most developers print it, fill & sign, scan, and email to info at fasterxml dot com.
We just need it to make sure Jackson and its modules can be freely redistributed by others (Eclipse foundation and Linux distributions are quite strict about this).
But once we have CLA we can accept any and all contributions for all modules, so this only needs to be done once.

Thank you once again for your contribution; looking forward to merging it!

@ghost
Copy link
Author

ghost commented Aug 18, 2015

Hi,

Thanks for you comments. I should amend my request and add a few unit tests. To be clear this will be the new functionality:

If ENCODING_FEATURE_USE_HEADER is set to false, no effect (same as before, since the method _readHeaderLine() at CsvParser won't be invoked.

Otherwise,

  • if schema is given and the new reorder flag, ENCODING_FEATURE_REORDER_COLUMNS, is set to true, columns in the schema will be reordered as they appear in the header.
  • if schema is given and the new reorder flag is not set, the header will be skipped, as it used to do.
  • if no schema is given, a new one will be created (as before) using the information in the header, independently of the new reordering flag.

I should send the sign agreement throughout the day.

Thanks!

@cowtowncoder
Copy link
Member

Yes, I think your explanation is correct.

…er is present and processed. Flag is called ENCODING_FEATURE_REORDER_COLUMNS

Added tests to ensure previous functionality was intact and the new flag behaves as expected when set to true.
@ghost
Copy link
Author

ghost commented Aug 18, 2015

Hi,

All done (I believe!) from my end. I've also send the CLA signed as requested. Let me know if code / test reads correctly.

Cheers!!

@cowtowncoder cowtowncoder added this to the 2.7.0 milestone Aug 18, 2015
@cowtowncoder
Copy link
Member

Excellent!

cowtowncoder added a commit that referenced this pull request Aug 18, 2015
Changed logic in Parser _readHeaderLine
@cowtowncoder cowtowncoder merged commit f615d77 into FasterXML:master Aug 18, 2015
@cowtowncoder
Copy link
Member

Oh actually I do have one question: as I mentioned, the safest way is to default to exact settings from before. But in this particular case, it seems that by default allowing reordering might make most sense for users. So I would be ok with default being "yes, do reorder columns". What do you think?

@ghost
Copy link
Author

ghost commented Aug 19, 2015

IMHO, I think it will make sense to have it on as default; if you think about it, the columns should be the same "type", regardless of its position, allowing a better experience whilst parsing... however I am using the low level interface (raw parsers, no mappers or bindings) so I cannot really say if a change of order may affect other parts of the system.

If any, a check that I would add would be to a "strict" mode to the schema: whilst allowing column reorder, it will prevent any column creation and all the columns should be satisfied by the file (in the presence of a header and the schema asking to process it).

@TuomasKiviaho
Copy link

I noticed that elementSeparator is not preserved when reordering is done.

I think that there should in my opinion be mode allows/prevents new columns to be introduced and old columns to be omitted.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants