Changed logic in Parser _readHeaderLine #89
Conversation
When the header line is present and the settings ask for it to be processed, two different options are possible: a) The schema has been populated. In this case, build a new schema where the order matches the *actual* order in which the given CSV file offers its columns; there cases the consumer of the csv file knows about the columns but not necessarily the order in which they are defined. A further check can be done in this case, by not permitting new columns that are not defined in the schema, by adding a new flag to the schema (say, strict) and reporting an error when a new column is found in the header. b) The schema has not been populated. In this case, build a default schema based on the columns found in the header.
Very cool. I think this makes sense. The only thing that I am slightly concerned about is whether there is any possibility that someone might want to force different order; and/or just ignore header line. It is difficult to know if such usage exists, but if it does, this change in 2.7 would break that usage. And it would seem that there also would be no way to revert to old handling if so. So: would it perhaps make sense to add either a Given that it seems possible that feature would not be heavily used, perhaps |
One practical thing as well: I am happy to merge the change, but before first contribution, we will need a filled-in Contributor License Agreement. It can be found at: https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf and most developers print it, fill & sign, scan, and email to Thank you once again for your contribution; looking forward to merging it! |
Hi, Thanks for you comments. I should amend my request and add a few unit tests. To be clear this will be the new functionality: If Otherwise,
I should send the sign agreement throughout the day. Thanks! |
Yes, I think your explanation is correct. |
…er is present and processed. Flag is called ENCODING_FEATURE_REORDER_COLUMNS Added tests to ensure previous functionality was intact and the new flag behaves as expected when set to true.
Hi, All done (I believe!) from my end. I've also send the CLA signed as requested. Let me know if code / test reads correctly. Cheers!! |
Excellent! |
Changed logic in Parser _readHeaderLine
Oh actually I do have one question: as I mentioned, the safest way is to default to exact settings from before. But in this particular case, it seems that by default allowing reordering might make most sense for users. So I would be ok with default being "yes, do reorder columns". What do you think? |
IMHO, I think it will make sense to have it on as default; if you think about it, the columns should be the same "type", regardless of its position, allowing a better experience whilst parsing... however I am using the low level interface (raw parsers, no mappers or bindings) so I cannot really say if a change of order may affect other parts of the system. If any, a check that I would add would be to a "strict" mode to the schema: whilst allowing column reorder, it will prevent any column creation and all the columns should be satisfied by the file (in the presence of a header and the schema asking to process it). |
I noticed that I think that there should in my opinion be mode allows/prevents new columns to be introduced and old columns to be omitted. |
Hi,
I would like to propose a sightly different way to process headers during parsing. In some cases, the CSV file consumer does know the columns a file may contain, but not necessarily in the exact order.
In this scenario, the consumer sets the schema but it also ask to process the header, in such way that actual order of the columns in the schema is determined by the header itself.
This patch tweaks sightly the method _readHeaderLine() in such way that the header is always processed, ensuring the proper order of the columns in the schema. Previously, if a schema was defined, the header line was skipped altogether.
If no schema is provided, a default schema is built (as it did previously).
All existing tests run correctly after this change.
Thanks,
Justo.