User Docs - benedikt-weyer/datahub-ai GitHub Wiki

This is a combination of a Data Hub instance and a Chat-Bot powered by an LLM for interaction.

The Data Hub is a geographic information system (GIS) featuring a data fusion engine designed for data harmonization, alongside an interactive dashboard for effective data exploration and collaboration. Its key objective is to merge data of multiple formats and sources across temporal and spatial axes, allowing users to combine, analyze, and interpret the data.

Required Software

It is required to have Git, Docker and Python 3.13

Setup and Installation

In the following you will find the walkthrough for installing and running the dh-chatbot-llm project

Cloning the Repository

Clone the repository $ git clone https://github.com/yunussozeri/dh-chatbot-llm
Copy the .env.example to .env using the following command: $ cp .env.example .env
Open the .env file and make sure the following variables are set SECRET_KEY, DATAHUB_NAME (instructions are inside the .env file)
Run $ docker compose up -d
Wait/check until http://localhost:8000/ shows the chat interface

After this you can start/stop the system with:

$ docker compose start
$ docker compose stop

If you change the .env file run the following command to apply the changes:

$ docker compose up -d

Now either import an existing data dump, or create a new instance. Steps for importing an existing dump is below.

If you get the error "Cannot connect to the Docker daemon at..." this means your Docker is not running, try starting Docker and running the command again.

Preparing the Data-Dump

Datahub provides ready-to-use database export for Ghana that you can use to directly see and use the system without the need to download and process the raw data on your local machine.

Go to the releases page and download the latest *.dump file and place it in the ./data/ folder.

Run the following command from the root of the repository:

$ docker compose exec datahub python manage.py restore ./data/<downloaded *.dump file>

Installing LLM Models

After the docker containers are running you can install the required models with the following commands:

$ docker compose exec ollama ollama pull mxbai-embed-large:latest
$ docker compose exec ollama ollama pull deepseek-r1:8b
$ docker compose exec ollama ollama pull gemma2:9b

Recomended: To run the LLMs on the GPU please take the steps mentioned in the official Ollama Docker Documentation and modify the docker-compose.yaml in the root directory of the project.

Creating a Django-Superuser (optional)

Run the following command to create a new user with which you can log in into the backend (http://localhost:8000/admin):

$ docker compose exec datahub python manage.py createsuperuser

Composing together the rest of the required Containers

After importing the data dump you are ready to build the container that will communicate with the language model.

To do this, run the following command in the root directory: $ docker compose build

You need an active internet connection for the build.

After the process is complete you can now use the AI Chat in the Datahub interface.

Reaching the Interface

The interface is avaliable under AI Chat.

ai chat

You can enter your prompt into the text box and send it to the AI and get information about the data.

Verbose Mode

Verbose Mode provides more detailed overview of what happens inside the AI after you send your request. Click on the check-box to enable this mode.

Active Tables

Currently, the AI depends on manually entered descriptions about the tables in the database. These are called active tables.

data desc

You can manage this by using this interface under Data Description.

Updating Table Descriptions

You can either import prepared table descriptions or write them, better them yourself. To import and existing one click on Choose File/Datei Auswählen and select the json-file:

select import

After selecting the json-file, you will return to the page , now press on Import Descriptions

import desc

This will load the descriptions into the database.

After that your data descriptions should look something like this :

active tables

Architechture and Design

We decided to seperate our components by responsibility, each component is simultaneosly their own docker container.

The RAG-Pipeline is designed as following: RAG Pipeline

Here, roles and responsibilities of each component:

Datahub Module: Acts as the central service point for end users by prodiving a UI and offering an interaface to inspect the data manually.
mongo-db: Contains the data about active tables and persists them locally
Post-GIS: Database also used by Datahub to store the demographical and geographical data. Offers interface to Datahub and Datahub-AI
Ollama: Hosts the LLM's (Large Language Models) and their embeddings.
Datahub-AI: Main Component that is developed by us. Manages the AI and the active tables as well as offers a REST-like API to communicate with AI and handle the table descriptions.

architechture picture

Choosing a Large Language Model

We wanted a model that is free, smart enough to sustain our requirements but small enough so that it could run on many devices.

However, the performance of the model and interface cohesion varied drastically when we experimented with different models.

In the end we decided to go for dolphin-llama3.

Limitations

Missing Know-How about django Architechture made it sometimes hard to adapt the solutions to the surfaced problems. Although, in very few cases we were not able to find a workaround solution.

One option was to deploy the project on the kubernetes-cluster of University of Applied Sciences of Hamburg, which we are still working on.

Contact and Feedback

The Chatbot Extension to datahub-ghana is made by students of University of Applied Sciences in Hamburg

Jan Biedasiek, [email protected]
Michael German, [email protected]
Yunus Sözeri, [email protected]
Benedikt Weyer, [email protected]

You can create Pull-Requests to our Repository or send an Email to one of us for Feedback!