03.07 preset projects - advantech-EdgeAI/edge_agent GitHub Wiki

3.7SpacerAdaptive Door Status Detection

The Adaptive Door Status Detection project showcases how to enhance the accuracy and adaptability of a Vision Language Model (VLM) for determining a door's status (e.g., open, closed, slightly open). It utilizes Retrieval-Augmented Generation (RAG) with the NanoDB multimodal database. The core idea is that when the VLM initially makes errors, these instances can be tagged and stored in NanoDB. Subsequently, NanoDB provides this stored contextual information to the VLM, helping it make more accurate predictions over time. This project demonstrates a learning loop where the system's performance improves by learning from corrected past examples.

3.7.1SpacerPrerequisites

To run this project, load the preset named:

  • Door_RAG_webrtc_advan

Note:

  • Ensure the Edge Agent is running and accessible.
  • The demonstration video file (Door_RAG_advan.mp4) must be in /ssd/jetson-containers/data/videos/demo/.
  • The nanodb folder from pre_install should be in /ssd/jetson-containers/data/.
  • A directory for NanoDB to store its database and the tagged images (e.g., /data/nanodb/images, which might map to /ssd/jetson-containers/data/nanodb/images on the host) must be accessible and writable.

3.7.2SpacerPipeline Overview

This project is best understood as a multi-phase workflow:

Phase 1: Initial VLM Assessment (Error Identification)

  • Goal: Observe the VLM's baseline performance on door status detection and identify images/scenarios where it makes mistakes.
  • Key Nodes: VideoSource, RateLimit, AutoPrompt_ICL (basic prompt), VILA-1.5-13B (or other VLM), VideoOverlay, VideoOutput.
  • Data Flow: Video frames are fed to the VLM, which attempts to determine the door status.

Phase 2: Database Learning & Enrichment (Image Tagging)

  • Goal: Populate NanoDB with images where the VLM previously erred, along with correct descriptive tags.
  • Key Nodes: VideoSource, VideoOutput (with a "Stop" button to pause on specific frames), NanoDB_Fashion.
  • Data Flow: User pauses the video on a problematic frame, types a corrective tag (e.g., "door slightly open") into the NanoDB_Fashion UI's "Insert" field, and adds it to the database.

Phase 3: Intelligent Retrieval & Adaptive Detection

  • Goal: Run the VLM with RAG enabled, where NanoDB provides relevant context from the tagged images to improve detection accuracy.
  • Key Nodes: VideoSource, RateLimit, NanoDB_Fashion (in RAG mode), AutoPrompt_ICL (modified to include RAG context), VILA-1.5-13B, VideoOverlay, VideoOutput.
  • Data Flow: For each new frame, NanoDB_Fashion retrieves similar tagged images/text. This RAG context is passed to AutoPrompt_ICL, which combines it with the current frame and the question for the VLM. The VLM then makes a more informed decision.

3.7.3SpacerKey Node Configurations

Configurations will vary across the phases. The Door_RAG_webrtc_advan preset will provide a baseline for one of these stages.

  • VideoSource Node:

    • Input: /data/videos/demo/Door_RAG_advan.mp4.
    • Loops: -1 for continuous playback.
  • AutoPrompt_ICL Node:

    • Phase 1 Template: <reset><image>Check the current the status of the door. Is it open or closed?
    • Phase 3 Template (RAG-enabled): <reset><text><image>Check the current the status of the door. Is it open or closed? (The <text> token is crucial for injecting RAG context).
    • seq_replace_mode: true.
    • Roi: false.
  • VILA-1.5-13B (NanoLLM_ICL Node):

    • Model Selection: Efficient-Large-Model/VILA-1.5-13B.
    • API Selection: MLC.
    • Quantization Setting: q8f16_ft is used in the PDF example for this VLM.
    • Chat Template: llava-v1.
    • System Prompt: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
  • NanoDB_Fashion Node:

    • Path (for all phases): /data/nanodb/images (or your chosen path).
    • Model (Embedding Model for all phases): openai/clip-vit-large-patch14-336 is used in the PDF example.
    • Phase 2 (Tagging - UI Interaction): Use the "Insert" field to type descriptive tags for images (e.g., "door is slightly open," "door is fully closed," "door is open with a person visible") and click "Add."
    • Phase 3 (RAG Mode Settings):
      • RAG Sample Size: Example 1 (number of retrieved samples to use for context).
      • RAG Threshold: Example 90 (similarity threshold for retrieved samples, 0-100).

3.7.4SpacerStep-by-Step Running Instructions

Phase 1: Initial VLM Assessment (using a baseline VLM pipeline)

  1. Load the Door_RAG_webrtc_advan preset. It might already be configured for Phase 3 (RAG-enabled detection) or a simpler VLM setup. If it's set for Phase 3, you may need to temporarily disconnect or bypass the NanoDB_Fashion RAG input to AutoPrompt_ICL and use the Phase 1 AutoPrompt_ICL template to assess baseline VLM performance without RAG.
  2. Run the Door_RAG_advan.mp4 video.
  3. Observe the VideoOutput panel. Note down frames or scenarios where the VLM's assessment of the door status is incorrect (e.g., calling a slightly open door "closed").

Phase 2: Database Learning & Enrichment (Image Tagging)

  1. Modify the pipeline (or use a simpler one) to focus on VideoSource, VideoOutput, and NanoDB_Fashion for tagging.
  2. In the NanoDB_Fashion node settings, configure the Path and embedding Model.
  3. Play the Door_RAG_advan.mp4 video.
  4. When you reach a frame where the VLM previously made an error, pause the video using the "Stop" button.
  5. Open the NanoDB_Fashion node's grid widget/UI.
  6. In its "Insert" section, type an accurate tag describing the door's status (e.g., "The door is slightly open").
  7. Click the "Add" button. The current frame is captured, tagged, and stored in NanoDB.
  8. Repeat for several different scenarios to build a representative database. Examples of tags from the PDF include "slightly open," "close," "open."

Phase 3: Intelligent Retrieval & Adaptive Detection

  1. Ensure the pipeline is configured for RAG: VideoSource -> RateLimit -> NanoDB_Fashion (RAG mode) -> AutoPrompt_ICL (Phase 3 template with <text> token). The RateLimit also connects to the image input of AutoPrompt_ICL. NanoDB_Fashion's RAG output connects to the text input of AutoPrompt_ICL. AutoPrompt_ICL -> VILA-1.5-13B -> VideoOverlay -> VideoOutput. The Door_RAG_webrtc_advan preset should represent this stage.
  2. Configure NanoDB_Fashion with the correct Path, Model, RAG Sample Size, and RAG Threshold.
  3. Run the Door_RAG_advan.mp4 video again.
  4. Observe the VideoOutput panel. The VLM's responses should now be more accurate for the scenarios you tagged, as it leverages RAG context from NanoDB.

3.7.5SpacerExpected Behavior & Output

  • Initial Phase: The VLM may make errors in determining the door's status, especially in ambiguous situations.
  • Tagging Phase: Users successfully add images with descriptive tags to the NanoDB database. The NanoDB_Fashion UI will show the growing collection of tagged images.
  • Adaptive Phase: When the RAG-enhanced pipeline (represented by the Door_RAG_webrtc_advan preset, appropriately configured) is run, the VLM's accuracy in describing the door's status should improve for scenarios similar to those tagged in NanoDB. The VideoOutput will display these more nuanced and correct assessments.

3.7.6SpacerTroubleshooting

  • VLM Still Making Errors in Phase 3:
    • Insufficient/Poor Tags: The quality and quantity of tagged images in NanoDB are crucial. Ensure varied scenarios, especially VLM error cases, are tagged accurately.
    • RAG Settings: Experiment with RAG Sample Size and RAG Threshold in NanoDB_Fashion.
    • Prompt for RAG: Verify the AutoPrompt_ICL template in Phase 3 correctly uses the <text> token and effectively instructs the VLM.
  • NanoDB Not Storing/Retrieving Images:
    • Verify the Path in NanoDB_Fashion is correct and writable.
    • Check the embedding Model in NanoDB_Fashion.
    • Ensure correct pipeline connections for Phase 3 RAG operation.
  • Preset Behavior: If the Door_RAG_webrtc_advan preset loads directly into the Phase 3 configuration, you might need to manually adjust it (e.g., by temporarily simplifying the AutoPrompt_ICL template and disconnecting the RAG input from NanoDB) to perform the Phase 1 assessment. The core value of this project is understanding and working through these distinct phases.

⬆️ Top



⚠️ **GitHub.com Fallback** ⚠️