20250310 - cywongg/2025 GitHub Wiki

Below is an explanation of *why* the test case

```
"good\r\n \"life\", sad 1\r\n2\r\n3"
```

should trigger a `CSVFormatException`, and *how* you can adjust your code (in a state-machine approach) to detect such malformed input.

---

## Why this input is invalid

Consider how a CSV parser would process this row in a typical RFC-4180–style parsing loop:

```
"good\r\n \"life", sad 1\r\n2\r\n3
 ^ opening quote
```

1. **Starts in quoted mode**: we see the initial `"` character, so we say “Okay, we’re in a quoted field now.”  

2. **We read characters** into our quoted content: `good\r\n ` (which includes the carriage return and line feed).  

3. Suddenly we reach `\"life"` – we have another quote before we explicitly closed the first quote or encountered a valid “escaped” sequence (`""`).  

4. We aren’t looking at `""` (which is the correct escape for a literal `"` inside a quoted field). Instead, we’re seeing `"` that *looks* like it’s trying to close the field, but we still have leftover text *immediately after* it in the same field, or weird spacing, etc.

Essentially, we have “nested quotes without the proper escaping.” This means the CSV is malformed—once you’ve started a quoted field, you only close it with a `"` *if* you’re truly at the end of that field. The next character ought to be either a comma (`,`), a newline (for the end of the row), or end-of-file. But we see spaces plus another quote. That breaks the RFC 4180 rule.

---

## How to fix your parser

Below is a high-level approach for a “state machine” that can catch malformed inputs like the above. (Your code structure may differ, so adapt accordingly.)

### 1) Maintain a clear set of states

Something like:

```java
enum QuoteState {  
    NOTHING, // we are not currently in quotes (unquoted field)
    OPEN,    // we are inside a quoted field
    CLOSED   // we’ve just closed the quote, but not yet seen a comma or newline
}
```

### 2) Enforce valid transitions

- **From `NOTHING`**  
  - If we see a `"` → move to `OPEN` (start a quoted field).  
  - If we see a comma → that ends this field (add field to the list, start a new one).  
  - If we see a newline → that ends this row (add field to the row, return row).  
  - If we see regular text (TEXTDATA) → accumulate in an unquoted field.  
  - If we see a `"` in the *middle* of accumulating unquoted text, that’s invalid → throw `CSVFormatException`.  

- **From `OPEN`** (quoted field)  
  - If we see TEXTDATA (anything except comma, quote, CR, LF) → append to the current field.  
  - If we see a newline (CR or LF) while `OPEN`, that’s valid *inside* the quoted field → append to the current field.  
  - If we see `"` → we need to look ahead or check the next character. Two sub-cases:  
    1. If next char is *also* `"` → it’s an escaped `"` → append one literal `"` to the field, stay in `OPEN`.  
    2. Otherwise, it’s closing the quote → move to `CLOSED`.  

- **From `CLOSED`** (we just closed a quoted field)  
  - If we see a comma → that’s valid, meaning: “end of the field, move on to next field.” So end the quoted field fully, add it to the list, go to `NOTHING`.  
  - If we see a newline (CR or LF) → that’s valid, meaning: “end of the row.” So add the final field to the row, return row.  
  - Anything else (including a space, letter, or another `"` that isn’t the escaped `""` from the same quoted field) → **throw `CSVFormatException`**.  

### 3) Handle your line breaks properly

RFC 4180 allows line breaks (LF or CRLF) *inside* a quoted field. In your code, you must handle them by staying in the `OPEN` state and appending them to `builder`. But if you’re in `NOTHING` or `CLOSED` state, a CRLF or LF typically signifies “end of this row.”

### 4) Edge Case: `CR` immediately followed by something that’s not `LF`

If your assignment is relaxing or adjusting the CRLF rules, be sure to handle them exactly as specified. Typically, if we see `CR`, we either expect an ensuing `LF`, or treat `CR` alone as the line break. If you see `CR` followed by random text (and you’re not in `OPEN`), that’s also a format error.

---

## Putting it all together

Your current code likely already has some of this logic, but you must:

1. **Detect** when you’re in the middle of a quoted field and run into a suspicious quote or line break that doesn’t match the rules.  
2. **Detect** when you’ve just closed a quoted field (`CLOSED`) and see something invalid that’s *not* a comma, newline, or EOF.  
3. **Detect** if you are in `NOTHING` (unquoted field) and see a stray double-quote come in the *middle* of a field (not at the beginning), i.e. `abc"def`.  

In this specific example (`"good\r\n \"life", sad 1\r\n2\r\n3`), the problem is that after `good\r\n `, we see more quotes/spaces in a way that does not match the “escaped double-quote” pattern and does not properly close the first quoted field. Therefore, you must detect that and throw a `CSVFormatException`.

---

### Summary

- **The test case**:  
  ```
  "good\r\n \"life", sad 1\r\n2\r\n3
  ```
  is invalid because it tries to nest or re-open quotes within a single field without properly escaping them or properly closing the first quoted section before continuing.  

- **Adjustment**:  
  In your code, whenever you are in `OPEN` and encounter a quote, carefully check whether it’s an escaped `""` (stay in `OPEN`), or a closing quote (move to `CLOSED`). If you later see more quotes or non-comma text while in `CLOSED`, that’s a `CSVFormatException`. Essentially, once you close a quote, the only valid next things are **comma**, **newline**, or **EOF**. Anything else must trigger an error.  

With these clarifications in place, your parser will raise the correct `CSVFormatException` for that tricky test case.