On Sentence Splitting - robbiemu/aclarai GitHub Wiki
β Sentence Splitting Strategy for Claimify Input
π― Goal: Break Tier 1 blocks into semantically coherent, complete-enough chunks for the Claimify pipeline, with every chunk going through Selection.
π― Your Intent for Sentence Splitting
Split Tier 1 blocks into coherent, agent-ready sentence units for Claimify, with these qualities:
- Merge incomplete lead-ins (
...I get:
) with the next line - Keep quoted diagnostics or phrases as standalone if meaningful
- Avoid over-splitting around code or newline boundaries
- Language-aware, but not rigid β we want utility, not strict grammar
- Output should be ready for Selection β Disambiguation β Decomposition
β Top Sentence Splitting Options
Option | Summary | Fit for Your Use Case |
---|---|---|
spaCy doc.sents |
Rule-based sentence boundary detection (for known languages) | π‘ Decent, needs patching for fragments |
NLTK PunktSentenceTokenizer |
Unsupervised, trained on language samples | π‘ Over-splits code, too formal |
LlamaIndex SentenceSplitter |
Chunking utility w/ token count + overlap + line breaks | π’ Strong candidate, customizable |
Langchain TextSplitter |
Similar to LlamaIndex, token-aware + overlap support | π‘ Needs manual alignment with IDs |
Custom rule-based splitter | Build your own with regex, indents, punctuation, colons | π’ Best control, most effort |
Stanza / syntactic tools | Tree-aware, language-specific | π΄ Overkill for this preprocessing task |
Absolutely β youβre right to emphasize this: we are not discarding content. Every chunk should still go through the Claimify pipeline and either:
- Produce one or more
(:Claim)
nodes or - Get stored as a
(:Sentence)
node (withclaimified: false
)
So letβs revise the approach with that in mind.
π₯ Recommended: Hybrid Sentence Splitter Strategy
π§ Base Layer:
-
Use
LlamaIndex.SentenceSplitter
with:chunk_size=300
chunk_overlap=30
keep_separator=True
This gives:
- Token-aware splitting
- Language-agnostic sentence boundaries
- Works well on Markdown blocks, even informal ones
π§ Post-processing Rules (Optional Enhancements)
These improve semantic coherence, especially for agentic processing:
-
Merge colon-ended lead-ins
- If a sentence ends in
:
and the next one starts with lowercase or quote/code β merge - e.g.
in the else block I get:
+O argumento do tipo...
- If a sentence ends in
-
Short prefix merger
- If a sentence is < 5 tokens and is followed by something more complete β merge forward
- e.g.
Example:
+"The model failed"
-
No discards
- Code fragments, single symbols, or diagnostics are retained
- They are passed to Claimify like any other sentence
- If no claim is found β they are stored as
(:Sentence)
nodes only
-
Linebreak preservation
- Keep newline structure intact to help with aligning back to original Markdown if needed
β Result
Each Tier 1 block produces a list of chunked sentence inputs, where:
-
Each chunk is grammatically plausible (enough)
-
Each chunk is guaranteed to:
- Get a
aclarai:id
- Be sent to the Claimify pipeline
- Be recorded as either a
(:Claim)
or(:Sentence)
- Get a
π§Ύ Output Format
Each output chunk includes:
text
: the chunked input for Claimifyaclarai_block_id
: ID of the parent Tier 1 blockchunk_index
: ordinal within block- (optional)
offset_start
,offset_end
if you track line or char spans
π Summary
Stage | Tool | Behavior |
---|---|---|
Base splitting | LlamaIndex.SentenceSplitter |
Token-aware, chunked output |
Postprocess merge | Custom wrapper | Handles colons, short prefixes, etc. |
Result per chunk | Claimify input | One sentence (or merged) β Selection β Disambiguation β Decomposition |
If no claim | (:Sentence) |
Preserves traceability and context |