03.07 preset projects - advantech-EdgeAI/edge_agent GitHub Wiki
The Adaptive Door Status Detection project showcases how to enhance the accuracy and adaptability of a Vision Language Model (VLM) for determining a door's status (e.g., open, closed, slightly open). It utilizes Retrieval-Augmented Generation (RAG) with the NanoDB multimodal database. The core idea is that when the VLM initially makes errors, these instances can be tagged and stored in NanoDB. Subsequently, NanoDB provides this stored contextual information to the VLM, helping it make more accurate predictions over time. This project demonstrates a learning loop where the system's performance improves by learning from corrected past examples.
To run this project, load the preset named:
Door_RAG_webrtc_advan
Note:
- Ensure the Edge Agent is running and accessible.
- The demonstration video file (
Door_RAG_advan.mp4
) must be in/ssd/jetson-containers/data/videos/demo/
. - The
nanodb
folder frompre_install
should be in/ssd/jetson-containers/data/
. - A directory for NanoDB to store its database and the tagged images (e.g.,
/data/nanodb/images
, which might map to/ssd/jetson-containers/data/nanodb/images
on the host) must be accessible and writable.
This project is best understood as a multi-phase workflow:
Phase 1: Initial VLM Assessment (Error Identification)
- Goal: Observe the VLM's baseline performance on door status detection and identify images/scenarios where it makes mistakes.
-
Key Nodes:
VideoSource
,RateLimit
,AutoPrompt_ICL
(basic prompt),VILA-1.5-13B
(or other VLM),VideoOverlay
,VideoOutput
. - Data Flow: Video frames are fed to the VLM, which attempts to determine the door status.
Phase 2: Database Learning & Enrichment (Image Tagging)
- Goal: Populate NanoDB with images where the VLM previously erred, along with correct descriptive tags.
-
Key Nodes:
VideoSource
,VideoOutput
(with a "Stop" button to pause on specific frames),NanoDB_Fashion
. -
Data Flow: User pauses the video on a problematic frame, types a corrective tag (e.g., "door slightly open") into the
NanoDB_Fashion
UI's "Insert" field, and adds it to the database.
Phase 3: Intelligent Retrieval & Adaptive Detection
- Goal: Run the VLM with RAG enabled, where NanoDB provides relevant context from the tagged images to improve detection accuracy.
-
Key Nodes:
VideoSource
,RateLimit
,NanoDB_Fashion
(in RAG mode),AutoPrompt_ICL
(modified to include RAG context),VILA-1.5-13B
,VideoOverlay
,VideoOutput
. -
Data Flow: For each new frame,
NanoDB_Fashion
retrieves similar tagged images/text. This RAG context is passed toAutoPrompt_ICL
, which combines it with the current frame and the question for the VLM. The VLM then makes a more informed decision.
Configurations will vary across the phases. The Door_RAG_webrtc_advan
preset will provide a baseline for one of these stages.
-
VideoSource
Node:-
Input:
/data/videos/demo/Door_RAG_advan.mp4
. -
Loops:
-1
for continuous playback.
-
Input:
-
AutoPrompt_ICL
Node:-
Phase 1 Template:
<reset><image>Check the current the status of the door. Is it open or closed?
-
Phase 3 Template (RAG-enabled):
<reset><text><image>Check the current the status of the door. Is it open or closed?
(The<text>
token is crucial for injecting RAG context). -
seq_replace_mode
:true
. -
Roi
:false
.
-
Phase 1 Template:
-
VILA-1.5-13B
(NanoLLM_ICL Node):-
Model Selection:
Efficient-Large-Model/VILA-1.5-13B
. -
API Selection:
MLC
. -
Quantization Setting:
q8f16_ft
is used in the PDF example for this VLM. -
Chat Template:
llava-v1
. - System Prompt: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
-
Model Selection:
-
NanoDB_Fashion
Node:-
Path
(for all phases):/data/nanodb/images
(or your chosen path). -
Model
(Embedding Model for all phases):openai/clip-vit-large-patch14-336
is used in the PDF example. - Phase 2 (Tagging - UI Interaction): Use the "Insert" field to type descriptive tags for images (e.g., "door is slightly open," "door is fully closed," "door is open with a person visible") and click "Add."
-
Phase 3 (RAG Mode Settings):
-
RAG Sample Size
: Example1
(number of retrieved samples to use for context). -
RAG Threshold
: Example90
(similarity threshold for retrieved samples, 0-100).
-
-
Phase 1: Initial VLM Assessment (using a baseline VLM pipeline)
- Load the
Door_RAG_webrtc_advan
preset. It might already be configured for Phase 3 (RAG-enabled detection) or a simpler VLM setup. If it's set for Phase 3, you may need to temporarily disconnect or bypass theNanoDB_Fashion
RAG input toAutoPrompt_ICL
and use the Phase 1AutoPrompt_ICL
template to assess baseline VLM performance without RAG. - Run the
Door_RAG_advan.mp4
video. - Observe the
VideoOutput
panel. Note down frames or scenarios where the VLM's assessment of the door status is incorrect (e.g., calling a slightly open door "closed").
Phase 2: Database Learning & Enrichment (Image Tagging)
- Modify the pipeline (or use a simpler one) to focus on
VideoSource
,VideoOutput
, andNanoDB_Fashion
for tagging. - In the
NanoDB_Fashion
node settings, configure thePath
and embeddingModel
. - Play the
Door_RAG_advan.mp4
video. - When you reach a frame where the VLM previously made an error, pause the video using the "Stop" button.
- Open the
NanoDB_Fashion
node's grid widget/UI. - In its "Insert" section, type an accurate tag describing the door's status (e.g., "The door is slightly open").
- Click the "Add" button. The current frame is captured, tagged, and stored in NanoDB.
- Repeat for several different scenarios to build a representative database. Examples of tags from the PDF include "slightly open," "close," "open."
Phase 3: Intelligent Retrieval & Adaptive Detection
- Ensure the pipeline is configured for RAG:
VideoSource
->RateLimit
->NanoDB_Fashion
(RAG mode) ->AutoPrompt_ICL
(Phase 3 template with<text>
token). TheRateLimit
also connects to the image input ofAutoPrompt_ICL
.NanoDB_Fashion
's RAG output connects to the text input ofAutoPrompt_ICL
.AutoPrompt_ICL
->VILA-1.5-13B
->VideoOverlay
->VideoOutput
. TheDoor_RAG_webrtc_advan
preset should represent this stage. - Configure
NanoDB_Fashion
with the correctPath
,Model
,RAG Sample Size
, andRAG Threshold
. - Run the
Door_RAG_advan.mp4
video again. - Observe the
VideoOutput
panel. The VLM's responses should now be more accurate for the scenarios you tagged, as it leverages RAG context from NanoDB.
- Initial Phase: The VLM may make errors in determining the door's status, especially in ambiguous situations.
-
Tagging Phase: Users successfully add images with descriptive tags to the NanoDB database. The
NanoDB_Fashion
UI will show the growing collection of tagged images. -
Adaptive Phase: When the RAG-enhanced pipeline (represented by the
Door_RAG_webrtc_advan
preset, appropriately configured) is run, the VLM's accuracy in describing the door's status should improve for scenarios similar to those tagged in NanoDB. TheVideoOutput
will display these more nuanced and correct assessments.
-
VLM Still Making Errors in Phase 3:
- Insufficient/Poor Tags: The quality and quantity of tagged images in NanoDB are crucial. Ensure varied scenarios, especially VLM error cases, are tagged accurately.
-
RAG Settings: Experiment with
RAG Sample Size
andRAG Threshold
inNanoDB_Fashion
. -
Prompt for RAG: Verify the
AutoPrompt_ICL
template in Phase 3 correctly uses the<text>
token and effectively instructs the VLM.
-
NanoDB Not Storing/Retrieving Images:
- Verify the
Path
inNanoDB_Fashion
is correct and writable. - Check the embedding
Model
inNanoDB_Fashion
. - Ensure correct pipeline connections for Phase 3 RAG operation.
- Verify the
-
Preset Behavior: If the
Door_RAG_webrtc_advan
preset loads directly into the Phase 3 configuration, you might need to manually adjust it (e.g., by temporarily simplifying theAutoPrompt_ICL
template and disconnecting the RAG input from NanoDB) to perform the Phase 1 assessment. The core value of this project is understanding and working through these distinct phases.