Meeting Mon 5 May - swiss-ai/dsl25-8-llm-programs GitHub Wiki

Agenda

Status

Refactored all code into a package llm_programs on GH. PRs: https://github.com/swiss-ai/dsl25-8-llm-programs/pulls

TODOs now tracked on GH: https://github.com/swiss-ai/dsl25-8-llm-programs/issues

PR 1 (severin): Cleaning etc https://github.com/swiss-ai/dsl25-8-llm-programs/pull/1
- DocumentDiriectory, Document, DocumentTransform abstractions for easier document directory processing
- incl. iterating over pages and windows
- Cleaning: deleting Markdown tables results in a 3x average size reduction
- LLM Functions a la v-agent, but supporting local LLMs (for redaction), and more closely inspired by LLM Nodes in http://arxiv.org/abs/2407.14788
- Refactored previously-written redaction & contract-generation programs with LM functions
- RecurrentLM & MapReduce abstractions
PR 2 (severin): Generalized Extraction: map + filter https://github.com/swiss-ai/dsl25-8-llm-programs/pull/25
- Unary, binary, N-ary LM functions are realised via ArgsPrompter, converting positional arguments first into a keyword dictionary, which is applied to the template via format_map.
- Predicate LM functions are realised via "PredicateParse", that is, parsers that return a boolean given the LM response.
- The generalized ExtractionProgram is a composition of mapping and filtering
PR 3 (severin): Redaction eval
- Generate a DocumentDirectory of 10 synthetic documents (ground-truth keywords -> description -> full contract)
- Evaluate naive Precision&Recall, and "Redaction" Precision&Recall
- Results (Gemma 3 27B, window size 10K characters):

Method: 1 call, 1 prompt
  Avg Naive Precision: 0.61
  Avg Naive Recall: 0.78
  Avg Redaction Precision: 0.66
  Avg Redaction Recall: 0.82
Method: 1 call, M prompts (specialized prompts per information type)
  Avg Naive Precision: 0.83
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.84
  Avg Redaction Recall: 0.96
Method: N calls, M prompts, OR
  Avg Naive Precision: 0.83
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.84
  Avg Redaction Recall: 0.96
Method: N calls, M prompts, AND
  Avg Naive Precision: 0.86
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.86
  Avg Redaction Recall: 0.96

PR 4 (youssef): Exploratory work for classifying synthetic legal clauses as AI-related or not, using both basic and orchestrated prompting strategies via Claude API
- Loaded synthetic legal clause data in the clauses.json, which includes id, text, references, and label fields.
- Visualized clause interdependencies as a directed graph using NetworkX to support analysis of clause referencing behavior.
- Applied two prompt-based classification strategies:
  - A simple approach using direct LLM calls on clause text.
  - A multi-agent orchestrator-worker pipeline inspired by Anthropic’s agent design principles. (link)
- Best one was the simple approach. I then incorporated referenced clauses into prompt inputs to improve classification accuracy in cases where relevance was indirect.
- Some misclassifications occurred when AI relevance was implied only through references. This informed the move toward including referenced clause content in prompts.

Next steps

Ensure 100% on institution names and most sensitive (identifying) info
- Use regex for emails and IP addresses? and a classical method for names?
Show evaluations to a lawyer?

Discussion points

Discussed status so far
Timeline:
- Four weeks left - we should have a result for ETH before that so we can get some feedback
Out-there-ideas

Actions

Continue on redaction (convert all docs with M-prompts method) & para retrieval
Give more thought on LMPrograms-writing-LMPrograms, and other theoretical conceptions