Inconsistent quoting of header columns #26

benson-basis · 2013-11-17T17:10:04Z

I instructed the writer to write the schema as a header line. Why are two columns quoted and the other three not? This is with 2.1.1.

threads,"fundamentalMeter-count","fundamentalMeter-meanRate",opTimer-count,opTimer-meanRate
1,100000,9788.112806825524,100000,9786.141492731049
2,300000,14731.272857104346,300000,14736.13115156725

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2013-11-18T02:44:40Z

Code uses simple heuristics to determine whether quoting might be needed, and conservatively quotes Strings it thinks may require quoting.
But my first guess (that hyphen triggers it) seems unlikely.

I am open to improvement ideas, if this is a problem; for example forcing quoting for all header names. But streaming nature of API makes it difficult to add limitations like "all-or-nothing".

whitlockjc · 2013-12-11T01:51:29Z

I'm noticing this as well but it doesn't seem specific to headers. I see some headers and some values that from a content perspective, would not require quotes around them but they have quotes around them. The resulting output is ugly unnecessarily. I'll look around and see if there is anything I can help with.

dmitry87 · 2014-05-13T11:18:07Z

date,fcts_transaction_id,fcts_pos_fx_rate_id,fcts_pos_fx_rate,"fcts_provider_fx_effective_rate","fcts_pos_fx_rate_noteq_fcts_provider_fx_effective_rate"
"2014-05-04 15:05:05","c7c9c0af-71b2-4190-9d2f-5e209bcb8c14",161770,1.4102062742056063,1.4100802082316688,1
"2014-05-04 15:05:09","ba7de6ae-a873-41b7-b56b-6e6fda84259e",161342,1.0679709284857943,1.0624477989835623,1
"2014-05-04 15:05:11","80c22e3c-3034-4461-88f5-510e061950ea",161770,1.4102062742056063,1.4100802082316688,1

Nothing to escape here but...
And it is impossible to overwrite because of protected final method. If field is longer than 20 symbols it will always be quoted.... :(

 protected final boolean _mayNeedQuotes(String value, int length)
    {
        // 21-Mar-2014, tatu: If quoting disabled, don't quote
        if (_cfgQuoteCharacter < 0) {
            return false;
        }
        // let's not bother checking long Strings, just quote already:
        if (length > MAX_QUOTE_CHECK) {
            return true;
        }
        for (int i = 0; i < length; ++i) {
            if (value.charAt(i) < _cfgMinSafeChar) {
                return true;
            }
        }
        return false;
    }

You can take a look at beanIO library for parsing files. It's slower than jackson but with valid escaping and quotes.

cowtowncoder · 2014-05-13T18:32:26Z

I can easily remove final from there (as a design principle, it's one of two choices, white vs blacklisting).

But bigger question is this: should there be different logic for handling header fields? As I have commented on multiple entries, choice of whether to quote or not can be done in two ways:

By doubling up the cost of operation, and checking first if quoting is needed, then doing it if need be, or
Heuristically checking only short(er) Strings, and erring in side of caution for longer ones

Jackson does (2). I am perfectly fine in changing this, but want to do it in a way that tries to balance overhead with benefits. To me, for example, mix of quoted and unquoted works just fine. For others less so.

What I do not like is getting complaints of how things suck, without suggestions of what could be done, and proposing improved handling.

On that note: how about:

I will see if handling of header line (only) could be changed to either:
- Always check the whole value, regardless of length -- names of columns should not be too long anyway, and it is just one row
- OR, always check all values first, and either quote them all (if at least one requires it), or quote none
Add a feature to force checking of the whole value for content rows (default: false); either as
- CvsGenerator.Feature
- OR CvsSchema value

How does this sound?

dmitry87 · 2014-05-15T11:50:43Z

Thank you for your great Work!

Handling of quotes must be the same everywhere I think(both headers and data). I've pasted a sample of text and quoting is inconsistent.

Sorry, have never worked with GitHub.

To my mind

1. always check all values first, and either quote them all (if at least one requires it), or quote none
2. Add a feature to force checking of the whole value for content rows (default: false); (Feature or CvsSchema - the simpliest one is better)

is better and the final decision is up to you. Here is the implementation for escaping from beanIO library.

 /**
     * Returns <tt>true</tt> if the given field must be quoted.
     * @param cs the field to test
     * @return <tt>true</tt> if the given field must be quoted
     */
    private boolean mustQuote(char [] cs) {
        for (char c : cs) {
            if (c == delim)
                return true;
            if (c == quote)
                return true;
            if (c == '\n')
                return true;
            if (c == '\r')
                return true;
        }
        return false;
    }

cowtowncoder · 2014-05-21T02:41:44Z

Added CsvGenerator.Feature.STRICT_CHECK_FOR_QUOTING, enabling of which will make checks strict and optimal (but with some more overhead for long Strings -- although unlikely to be an issue for most users). Default remains false at least for now, to keep behavior identical to earlier versions.

…ccount data

dmitry87 mentioned this issue May 7, 2014

Impossible to turn off Alphanumeric Sort for bean properties #40

Closed

cowtowncoder added a commit that referenced this issue May 21, 2014

Implement #26

579de90

cowtowncoder added this to the 2.4. milestone May 21, 2014

cowtowncoder closed this as completed May 21, 2014

pierre referenced this issue in killbill/killbill Nov 2, 2016

util: Fix somewhat random quotes around column names when exporting a…

a0df460

…ccount data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent quoting of header columns #26

Inconsistent quoting of header columns #26

benson-basis commented Nov 17, 2013

cowtowncoder commented Nov 18, 2013

whitlockjc commented Dec 11, 2013

dmitry87 commented May 13, 2014

cowtowncoder commented May 13, 2014

dmitry87 commented May 15, 2014

cowtowncoder commented May 21, 2014

Inconsistent quoting of header columns #26

Inconsistent quoting of header columns #26

Comments

benson-basis commented Nov 17, 2013

cowtowncoder commented Nov 18, 2013

whitlockjc commented Dec 11, 2013

dmitry87 commented May 13, 2014

cowtowncoder commented May 13, 2014

dmitry87 commented May 15, 2014

cowtowncoder commented May 21, 2014