Developer Docs - benedikt-weyer/datahub-ai GitHub Wiki

Welcome to the Developer Docs!

Required Software

It is required to have Git, Docker, Python, an Environment Manager for Python (the documentation follows anaconda), Ollama for managing different models, and PostGIS for interfacing Datahub's provided Data

Setup and Installation

In the following you will find the walkthrough for installing and running the dh-chatbot-llm project

Cloning the Repository

  • Clone the repository $ git clone https://github.com/yunussozeri/dh-chatbot-llm
  • Copy the .env.example to .env using the following command: $ cp .env.example .env
  • Open the .env file and make sure the following variables are set SECRET_KEY, DATAHUB_NAME (instructions are inside the .env file)
  • Run $ docker compose up -d
  • Wait/check until http://localhost:8000/ shows the chat interface

After this you can start/stop the system with:

$ docker compose start
$ docker compose stop

If you change the .env file run the following command to apply the changes:

$ docker compose up -d

Now either import an existing data dump, or create a new instance. Steps for importing an existing dump is below.

If you get the error "Cannot connect to the Docker daemon at..." this means your Docker is not running, try starting Docker and running the command again.

Preparing the Data-Dump

Datahub provides ready-to-use database export for Ghana that you can use to directly see and use the system without the need to download and process the raw data on your local machine.

Go to the releases page and download the latest *.dump file and place it in the ./data/ folder.

Run the following command from the root of the repository:

$ docker compose exec datahub python manage.py restore ./data/<downloaded *.dump file>

Installing LLM Models

After the docker containers are running you can install the required models with the following commands:

$ docker compose exec ollama ollama pull mxbai-embed-large:latest
$ docker compose exec ollama ollama pull deepseek-r1:8b
$ docker compose exec ollama ollama pull gemma2:9b

Recomended: To run the LLMs on the GPU please take the steps mentioned in the official Ollama Docker Documentation and modify the docker-compose.yaml in the root directory of the project.

Installing Ollama for non docker runs (optional)

Go to Ollama Website and download for your operating system.

After the installation is complete go to the

After that you can download and install the required models with the following commands:

$ ollama ollama pull mxbai-embed-large:latest
$ ollama ollama pull deepseek-r1:8b
$ ollama ollama pull gemma2:9b

Edit the .env file to incorporate the new ollama host.

Creating a Virtual Environment

This documentation uses Anaconda Distribution as Python Environment Manager. You can use any environment manager you like

For download go to Anaconda Download Page and download for your operating system.

After the installation is complete, go to the repository location and open a terminal in the root directory.

Create a new virtual environment wit the following command : conda create -n <name of your virtual environment> python=3.9.13

After that you can activate the virtual environment with: conda activate <name of your virtual environment>, And deactivate with : conda deactivate <name of your virtual environment>

Creating a Django-Superuser (optional)

Run the following command to create a new user with which you can log in into the backend (http://localhost:8000/admin):

$ docker compose exec datahub python manage.py createsuperuser

Composing together the rest of the required Containers

After importing the data dump you are ready to build the container that will communicate with the language model.

To do this, run the following command in the root directory: $ docker compose build

You need an active internet connection for the build.

After the process is complete you can now use the AI Chat in the Datahub interface.


Reaching the Interface

The interface is avaliable under AI Chat.

ai chat

You can enter your prompt into the text box and send it to the AI and get information about the data.

Architechture and Design

We decided to seperate our components by responsibility, each component is simultaneosly their own docker container.

The RAG-Pipeline is designed as following: RAG Pipeline

Here, roles and responsibilities of each component:

  • Datahub Module: Acts as the central service point for end users by prodiving a UI and offering an interaface to inspect the data manually.
  • mongo-db: Contains the data about active tables and persists them locally
  • Post-GIS: Database also used by Datahub to store the demographical and geographical data. Offers interface to Datahub and Datahub-AI
  • Ollama: Hosts the LLM's (Large Language Models) and their embeddings.
  • Datahub-AI: Main Component that is developed by us. Manages the AI and the active tables as well as offers a REST-like API to communicate with AI and handle the table descriptions.

architechture picture

Choosing a Large Language Model

We wanted a model that is free, smart enough to sustain our requirements but small enough so that it could run on many devices.

However, the performance of the model and interface cohesion varied drastically when we experimented with different models.

In the end we decided to go for dolphin-llama3.

Limitations

Missing Know-How about django Architechture made it sometimes hard to adapt the solutions to the surfaced problems. Although, in very few cases we were not able to find a workaround solution.

One option was to deploy the project on the kubernetes-cluster of University of Applied Sciences of Hamburg, which we are still working on.

Contact and Feedback

The Chatbot Extension to datahub-ghana is made by students of University of Applied Sciences in Hamburg

You can create Pull-Requests to our Repository or send an Email to one of us for Feedback!