Towards a LASER Co‐Pilot - laser-base/laser-core GitHub Wiki

Introduction

I wanted to do some experiments to see how close one could come, and with what effort, to creating a RAG-style Chatbot for LASER. The idea here is essentially to have a full co-pilot tool out-of-the-box.

Enabling Technology

GPT

At this point one knows that one can use ChatGPT or various co-pilot tools to get all sorts of help writing code in Python, C, SQL, etc. That's great. But those existing tools are good at what they're trained at. You can't ask it to help you write code for a product that's only just been released, or what we're just releasing. That said, one can manually copy-paste chunks of code or documentation into one's prompt or in the case of GPT4 attach files and make custom queries around that. For example I've copy-pasted my update_ages.cpp file into the prompt and asked GPT3.5:

How would I write a new C function in this style that implements the following SQL-style step function (...)?

And gotten excellent results.

RAG

RAG is Retrieval Augmented Generation and an excellent solution for pointing an LLM at a private corpus of text such that a custom chatbot can answer questions about it.

Langchain

langchain is a popular emerging Python library for doing RAGs and custom LLMs.

Methods

Here we use some simple langchain code to ingest a single composite file of code and documentation of my early laser prototype and let the user ask 'typical' questions, so we can get a sense of the quality of the answers.

Code:

from langchain_openai import OpenAIEmbeddings

from langchain_community.document_loaders import WebBaseLoader, Docx2txtLoader,TextLoader,PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.llms import OpenAI

loader = TextLoader("/home/jbloedow/DEV/LASER/laser/jb/combo_code_for_gpt.txt")

embeddings = OpenAIEmbeddings(organization='<org id>') # maybe org isn't necessary
index = VectorstoreIndexCreator(embedding=embeddings).from_loaders([loader])

llm = OpenAI(temperature=0)

while True:
    question = input("Enter a question (or 'Q' to quit): ")
    if question.upper() == 'Q':
        break
    print(index.query(question,llm=llm))

Implementation Notes.

  1. Note that I manually decided which files to put into my combo file: I included all python scripts, all .md files (a main README.md and a quickstart.mb I created by nbconverting my Quickstart.ipynb notebook), and the output of running tree on the root directory of the package. Being judicious about what goes in here, and maybe what doesn't, is kind of important.
  2. I tried using a DirectoryReader to just slurp up ever file found under a root node but while that's impressive in terms of all the file formats that langchain can process, it ultimately broke my install because I had to install nvidia libraries and all sorts of stuff, so I gave up that. That still seems cool, but too much of a rabbit hole at this point.

Results

Enter a question (or 'Q' to quit): Give me an overview of the code

The code in this package is used to model the spread of infectious diseases in a population. It includes tools for creating and manipulating population data, as well as functions for running simulations and analyzing results. The main entry point for running simulations is the cli.py file, which allows users to specify various settings and parameters for their simulation. The examples directory contains sample input files and settings for different scenarios, such as the E&W example and the CCS examples. The utils directory contains various utility functions for tasks such as creating population data and fitting EULA parameters. The model_numpy and model_sql directories contain the actual simulation models, with the former using NumPy arrays for faster computation and the latter using SQL databases for larger simulations. The update_ages.cpp file contains functions for updating agent ages in the simulation. Overall, the code is designed to be flexible and customizable for different scenarios and research purposes.

Enter a question (or 'Q' to quit): Is there a quickstart or some simple examples? Yes, there is a QuickStart.ipynb notebook in the examples directory that walks through running the model and visualizing the results. There are also some pre-configured scenarios in the examples directory that you can run to see how the model works.

Enter a question (or 'Q' to quit): What is the first example? The first example is the E&W example, which models measles in England and Wales during the post-war period.

Enter a question (or 'Q' to quit): What is the second example? The second example is the CCS (1 node) example.

Enter a question (or 'Q' to quit): How do I run the second example? To run the second example, you would need to specify the path to the sandbox directory where the workspace will be created. You can do this by typing in the path or using the default path provided. Once the workspace is created, you can change directory to the sandbox or workspace and run the model by typing in the command "python3 -m idmlaser.measles".

Enter a question (or 'Q' to quit): I typed 'python3 -m idmlaser.measles' and got "ValueError: You will need to provide a settings.py file. If running for first time, use: python3 -m idmlaser.utils.build_template_workspace". You need to provide a settings.py file for the model to run. If you are running the model for the first time, you can use the command "python3 -m idmlaser.utils.build_template_workspace" to create a template workspace with the necessary files.

Enter a question (or 'Q' to quit): What happens if I run "python3 -m idmlaser.utils.build_template_workspace"? This command will create a template workspace for the IDMLaser package, including a settings.py and demographics_settings.py file.

Enter a question (or 'Q' to quit): Give me an overview of the things I should set in demographics_settings.py? In demographics_settings.py, you should set the following key-value pairs:

  • pop_file: filename of compressed csv with all the agents to be modeled. Columns are attributes. Rows are agents.
  • eula_pop_fits: filename of npy file which is the slope and intercept of the eula population over time by node.
  • cbr_file: filename of csv file with the crude birth rates of each node by year.

Enter a question (or 'Q' to quit): Give me an overview of the things I will set in settings.py? In settings.py, you will set parameters related to the simulation duration, base infectivity, seasonal and infectivity multipliers, reporting, and burnin. These parameters include the simulation duration in days, base infectivity, seasonal multiplier, infectivity multiplier, report filename, report start time, burnin delay, import cases, and the time after which to stop importing.

Enter a question (or 'Q' to quit): simulation duration? The simulation duration is set to 1000 timesteps in the main simulation function.

Enter a question (or 'Q' to quit): What diseases are modeled in the examples? The E&W example models measles, the CCS (1 node) example models a generic disease with a birth rate of 17.5, and the CCS (100 nodes) example models a generic disease with a birth rate of 17.5 and a population of 1e7.

Enter a question (or 'Q' to quit): How is measles modeled in terms of susceptibilities, incubation, transmission, and recovery? Measles is modeled by assigning each agent an "age" in years, an "infected" boolean flag, an "infection_timer" countdown, an "incubation_timer" countdown, an "immunity" boolean flag, an "immunity_timer" countdown, and an "expected_lifespan" value. The "age" is used in the "update_ages.cpp" code to age agents by 1 day each day. The "infected" flag is set to True when an agent is infected and is used in the "progress_infections" function to simulate the spread of the disease. The "infection_timer" and "incubation_timer" countdowns are used to track the progression of the disease in infected agents. The "immunity" flag is set to True when an agent has developed immunity to the disease and the "immunity_timer" countdown is used to track the duration of immunity. The "expected_lifespan" value is used to determine the age at which an agent will die.

Analysis

I need to say more here. I'm generally pretty impressed by the kinds of answers I'm getting but ultimately I'm very demanding of what I want this to be able to do. I ultimately want to be able to have this tool generate all the code the user needs, including command line code to run as well as new model code.

Next Steps

  1. Add a few more files to training corpus.
  2. Turn into a simple webservice.
  3. Test on CL version.
  4. Put in hands of actual users.
⚠️ **GitHub.com Fallback** ⚠️