Meeting Mon 5 May - swiss-ai/dsl25-8-llm-programs GitHub Wiki
Agenda
Status
Refactored all code into a package llm_programs
on GH. PRs: https://github.com/swiss-ai/dsl25-8-llm-programs/pulls
TODOs now tracked on GH: https://github.com/swiss-ai/dsl25-8-llm-programs/issues
-
PR 1 (severin): Cleaning etc https://github.com/swiss-ai/dsl25-8-llm-programs/pull/1
- DocumentDiriectory, Document, DocumentTransform abstractions for easier document directory processing
- incl. iterating over pages and windows
- Cleaning: deleting Markdown tables results in a 3x average size reduction
- LLM Functions a la v-agent, but supporting local LLMs (for redaction), and more closely inspired by LLM Nodes in http://arxiv.org/abs/2407.14788
- Refactored previously-written redaction & contract-generation programs with LM functions
- RecurrentLM & MapReduce abstractions
-
PR 2 (severin): Generalized Extraction: map + filter https://github.com/swiss-ai/dsl25-8-llm-programs/pull/25
- Unary, binary, N-ary LM functions are realised via
ArgsPrompter
, converting positional arguments first into a keyword dictionary, which is applied to the template viaformat_map
. - Predicate LM functions are realised via "PredicateParse", that is, parsers that return a boolean given the LM response.
- The generalized ExtractionProgram is a composition of mapping and filtering
- Unary, binary, N-ary LM functions are realised via
-
PR 3 (severin): Redaction eval
- Generate a DocumentDirectory of 10 synthetic documents (ground-truth keywords -> description -> full contract)
- Evaluate naive Precision&Recall, and "Redaction" Precision&Recall
- Results (Gemma 3 27B, window size 10K characters):
Method: 1 call, 1 prompt
Avg Naive Precision: 0.61
Avg Naive Recall: 0.78
Avg Redaction Precision: 0.66
Avg Redaction Recall: 0.82
Method: 1 call, M prompts (specialized prompts per information type)
Avg Naive Precision: 0.83
Avg Naive Recall: 0.95
Avg Redaction Precision: 0.84
Avg Redaction Recall: 0.96
Method: N calls, M prompts, OR
Avg Naive Precision: 0.83
Avg Naive Recall: 0.95
Avg Redaction Precision: 0.84
Avg Redaction Recall: 0.96
Method: N calls, M prompts, AND
Avg Naive Precision: 0.86
Avg Naive Recall: 0.95
Avg Redaction Precision: 0.86
Avg Redaction Recall: 0.96
- PR 4 (youssef): Exploratory work for classifying synthetic legal clauses as AI-related or not, using both basic and orchestrated prompting strategies via Claude API
-
Loaded synthetic legal clause data in the clauses.json, which includes id, text, references, and label fields.
-
Visualized clause interdependencies as a directed graph using NetworkX to support analysis of clause referencing behavior.
-
Applied two prompt-based classification strategies:
- A simple approach using direct LLM calls on clause text.
- A multi-agent orchestrator-worker pipeline inspired by Anthropic’s agent design principles. (link)
-
Best one was the simple approach. I then incorporated referenced clauses into prompt inputs to improve classification accuracy in cases where relevance was indirect.
-
Some misclassifications occurred when AI relevance was implied only through references. This informed the move toward including referenced clause content in prompts.
-
Next steps
-
Ensure 100% on institution names and most sensitive (identifying) info
- Use regex for emails and IP addresses? and a classical method for names?
-
Show evaluations to a lawyer?
Discussion points
- Discussed status so far
- Timeline:
- Four weeks left - we should have a result for ETH before that so we can get some feedback
- Out-there-ideas
Actions
- Continue on redaction (convert all docs with M-prompts method) & para retrieval
- Give more thought on LMPrograms-writing-LMPrograms, and other theoretical conceptions