Meeting Mon 5 May - swiss-ai/dsl25-8-llm-programs GitHub Wiki

Agenda

Status

Refactored all code into a package llm_programs on GH. PRs: https://github.com/swiss-ai/dsl25-8-llm-programs/pulls

TODOs now tracked on GH: https://github.com/swiss-ai/dsl25-8-llm-programs/issues

  • PR 1 (severin): Cleaning etc https://github.com/swiss-ai/dsl25-8-llm-programs/pull/1

    • DocumentDiriectory, Document, DocumentTransform abstractions for easier document directory processing
    • incl. iterating over pages and windows
    • Cleaning: deleting Markdown tables results in a 3x average size reduction
    • LLM Functions a la v-agent, but supporting local LLMs (for redaction), and more closely inspired by LLM Nodes in http://arxiv.org/abs/2407.14788
    • Refactored previously-written redaction & contract-generation programs with LM functions
    • RecurrentLM & MapReduce abstractions
  • PR 2 (severin): Generalized Extraction: map + filter https://github.com/swiss-ai/dsl25-8-llm-programs/pull/25

    • Unary, binary, N-ary LM functions are realised via ArgsPrompter, converting positional arguments first into a keyword dictionary, which is applied to the template via format_map.
    • Predicate LM functions are realised via "PredicateParse", that is, parsers that return a boolean given the LM response.
    • The generalized ExtractionProgram is a composition of mapping and filtering
  • PR 3 (severin): Redaction eval

    • Generate a DocumentDirectory of 10 synthetic documents (ground-truth keywords -> description -> full contract)
    • Evaluate naive Precision&Recall, and "Redaction" Precision&Recall
    • Results (Gemma 3 27B, window size 10K characters):
Method: 1 call, 1 prompt
  Avg Naive Precision: 0.61
  Avg Naive Recall: 0.78
  Avg Redaction Precision: 0.66
  Avg Redaction Recall: 0.82
Method: 1 call, M prompts (specialized prompts per information type)
  Avg Naive Precision: 0.83
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.84
  Avg Redaction Recall: 0.96
Method: N calls, M prompts, OR
  Avg Naive Precision: 0.83
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.84
  Avg Redaction Recall: 0.96
Method: N calls, M prompts, AND
  Avg Naive Precision: 0.86
  Avg Naive Recall: 0.95
  Avg Redaction Precision: 0.86
  Avg Redaction Recall: 0.96
  • PR 4 (youssef): Exploratory work for classifying synthetic legal clauses as AI-related or not, using both basic and orchestrated prompting strategies via Claude API
    • Loaded synthetic legal clause data in the clauses.json, which includes id, text, references, and label fields.

    • Visualized clause interdependencies as a directed graph using NetworkX to support analysis of clause referencing behavior.

    • Applied two prompt-based classification strategies:

      • A simple approach using direct LLM calls on clause text.
      • A multi-agent orchestrator-worker pipeline inspired by Anthropic’s agent design principles. (link)
    • Best one was the simple approach. I then incorporated referenced clauses into prompt inputs to improve classification accuracy in cases where relevance was indirect.

    • Some misclassifications occurred when AI relevance was implied only through references. This informed the move toward including referenced clause content in prompts.

Next steps

  • Ensure 100% on institution names and most sensitive (identifying) info

    • Use regex for emails and IP addresses? and a classical method for names?
  • Show evaluations to a lawyer?


Discussion points

  • Discussed status so far
  • Timeline:
    • Four weeks left - we should have a result for ETH before that so we can get some feedback
  • Out-there-ideas

Actions

  • Continue on redaction (convert all docs with M-prompts method) & para retrieval
  • Give more thought on LMPrograms-writing-LMPrograms, and other theoretical conceptions