Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Inconsistent quoting of header columns #26

Closed
benson-basis opened this issue Nov 17, 2013 · 6 comments
Closed

Inconsistent quoting of header columns #26

benson-basis opened this issue Nov 17, 2013 · 6 comments
Milestone

Comments

@benson-basis
Copy link

I instructed the writer to write the schema as a header line. Why are two columns quoted and the other three not? This is with 2.1.1.

threads,"fundamentalMeter-count","fundamentalMeter-meanRate",opTimer-count,opTimer-meanRate
1,100000,9788.112806825524,100000,9786.141492731049
2,300000,14731.272857104346,300000,14736.13115156725
@cowtowncoder
Copy link
Member

Code uses simple heuristics to determine whether quoting might be needed, and conservatively quotes Strings it thinks may require quoting.
But my first guess (that hyphen triggers it) seems unlikely.

I am open to improvement ideas, if this is a problem; for example forcing quoting for all header names. But streaming nature of API makes it difficult to add limitations like "all-or-nothing".

@whitlockjc
Copy link

I'm noticing this as well but it doesn't seem specific to headers. I see some headers and some values that from a content perspective, would not require quotes around them but they have quotes around them. The resulting output is ugly unnecessarily. I'll look around and see if there is anything I can help with.

@dmitry87
Copy link

date,fcts_transaction_id,fcts_pos_fx_rate_id,fcts_pos_fx_rate,"fcts_provider_fx_effective_rate","fcts_pos_fx_rate_noteq_fcts_provider_fx_effective_rate"
"2014-05-04 15:05:05","c7c9c0af-71b2-4190-9d2f-5e209bcb8c14",161770,1.4102062742056063,1.4100802082316688,1
"2014-05-04 15:05:09","ba7de6ae-a873-41b7-b56b-6e6fda84259e",161342,1.0679709284857943,1.0624477989835623,1
"2014-05-04 15:05:11","80c22e3c-3034-4461-88f5-510e061950ea",161770,1.4102062742056063,1.4100802082316688,1

Nothing to escape here but...
And it is impossible to overwrite because of protected final method. If field is longer than 20 symbols it will always be quoted.... :(

 protected final boolean _mayNeedQuotes(String value, int length)
    {
        // 21-Mar-2014, tatu: If quoting disabled, don't quote
        if (_cfgQuoteCharacter < 0) {
            return false;
        }
        // let's not bother checking long Strings, just quote already:
        if (length > MAX_QUOTE_CHECK) {
            return true;
        }
        for (int i = 0; i < length; ++i) {
            if (value.charAt(i) < _cfgMinSafeChar) {
                return true;
            }
        }
        return false;
    }

You can take a look at beanIO library for parsing files. It's slower than jackson but with valid escaping and quotes.

@cowtowncoder
Copy link
Member

I can easily remove final from there (as a design principle, it's one of two choices, white vs blacklisting).

But bigger question is this: should there be different logic for handling header fields? As I have commented on multiple entries, choice of whether to quote or not can be done in two ways:

  1. By doubling up the cost of operation, and checking first if quoting is needed, then doing it if need be, or
  2. Heuristically checking only short(er) Strings, and erring in side of caution for longer ones

Jackson does (2). I am perfectly fine in changing this, but want to do it in a way that tries to balance overhead with benefits. To me, for example, mix of quoted and unquoted works just fine. For others less so.

What I do not like is getting complaints of how things suck, without suggestions of what could be done, and proposing improved handling.

On that note: how about:

  1. I will see if handling of header line (only) could be changed to either:
    • Always check the whole value, regardless of length -- names of columns should not be too long anyway, and it is just one row
    • OR, always check all values first, and either quote them all (if at least one requires it), or quote none
  2. Add a feature to force checking of the whole value for content rows (default: false); either as
    • CvsGenerator.Feature
    • OR CvsSchema value

How does this sound?

@dmitry87
Copy link

Thank you for your great Work!

Handling of quotes must be the same everywhere I think(both headers and data). I've pasted a sample of text and quoting is inconsistent.

Sorry, have never worked with GitHub.

To my mind

1. always check all values first, and either quote them all (if at least one requires it), or quote none
2. Add a feature to force checking of the whole value for content rows (default: false); (Feature or CvsSchema - the simpliest one is better)

is better and the final decision is up to you. Here is the implementation for escaping from beanIO library.

 /**
     * Returns <tt>true</tt> if the given field must be quoted.
     * @param cs the field to test
     * @return <tt>true</tt> if the given field must be quoted
     */
    private boolean mustQuote(char [] cs) {
        for (char c : cs) {
            if (c == delim)
                return true;
            if (c == quote)
                return true;
            if (c == '\n')
                return true;
            if (c == '\r')
                return true;
        }
        return false;
    }

cowtowncoder added a commit that referenced this issue May 21, 2014
@cowtowncoder cowtowncoder added this to the 2.4. milestone May 21, 2014
@cowtowncoder
Copy link
Member

Added CsvGenerator.Feature.STRICT_CHECK_FOR_QUOTING, enabling of which will make checks strict and optimal (but with some more overhead for long Strings -- although unlikely to be an issue for most users). Default remains false at least for now, to keep behavior identical to earlier versions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants