App A white paper - advantech-EdgeAI/edge_agent GitHub Wiki

ASpacerAdvancing Edge-Agent with Multimodal & RAG Technologies

This comprehensive white paper explores groundbreaking advancements in AI technology for edge computing environments. It highlights three core innovations: Video Transformer and Image, Text, Speech Multimodal technology, with a dynamic model switching architecture, and the integration of intuitive interfaces, plus optimized running models. Designed for AI technology users, this document demonstrates how these features revolutionize AI Application deployment, adaptability, and operational efficiency.


A.1SpacerRevolutionizing Edge AI with Video Transformer Technology

A.1.1SpacerOverview

Video Transformer technology integrates deep learning-based vision models with temporal sequence analysis to efficiently process and interpret video data. Unlike traditional approaches, this technology leverages state-of-the-art transformer architectures, enabling:

  • High Accuracy : Precise detection and analysis of complex video patterns.
  • Scalability : Adaptability to a wide range of Advantech MIC-AI devices.
  • Efficiency : Real-time processing with optimized computational requirements.

A.1.2SpacerKey Benefits

Enhanced Performance :

  • Accurate object detection, classification, and video/image comprehension.
  • Effective in challenging scenarios, such as low-light environments and occlusion.

Applications across Industries :

  • Industrial Safety : Detect anomalies such as fires or unauthorized access in real time.
  • Public Safety : Monitor crowded areas for potential threats.

A.1.3SpacerPractical Examples

  • Door Detection : Video Transformers monitor factory doors, triggering alerts if left open for a prolonged period.
  • Forbidden Zone Monitoring : AI models ensure compliance by detecting objects placed in restricted areas, with real-time alerts for violations.

A.2SpacerRevolutionizing Multimodal Models: Integrating LLM, VLM, Speech, and RAG

A.2.1SpacerExpanding AI Multimodal

AI Multimodal refers to the ability to process and integrate multiple data modalities—such as text, images, audio, and video—within a single system. This approach enhances AI’s versatility and contextual understanding.

  • Data Integration :
    • Combines inputs like text (LLM), images (VLM), and audio (Speech models) for comprehensive analysis.
    • Example: A security system integrating video surveillance and real-time alerts.
  • Enhanced Contextual Understanding :
    • By leveraging diverse data streams, AI systems gain a richer and more nuanced understanding.
    • Example: Interactive educational tools combining text-to-speech, image recognition, and knowledge retrieval to create personalized learning experiences.
  • Technological Innovations :
    • Synchronized text, audio, and visual processing for seamless user interaction.
    • Retrieval-Augmented Generation (RAG) enhances output by incorporating external knowledge bases.

Multimodal models unlock new possibilities by leveraging the strengths of:

  • LLM (Large Language Models) :
    • Powerful natural language understanding and generation.
    • Context-aware interactions for nuanced responses.
  • VLM (Vision-Language Models) :
    • Seamless integration of visual and textual data.
    • Superior performance in image captioning, video analysis, and object detection.
  • Speech Technologies :
    • Piper : High-quality text-to-speech synthesis with multi-language support.
    • Riva : Real-time speech-to-text and natural language processing.

Delivers accurate, up-to-date, and contextually relevant answers.

A.3SpacerSimplifying AI Deployment, Interaction and Adaptability

A.3.1SpacerThe Need for Flexibility

AI deployment often faces challenges, such as the need for diverse models tailored to specific tasks and environments. The Plugin architecture and Web UI design offer model-switching feature which addresses this by enabling:

  • Rapid Deployment : Switching between pre-trained models without downtime.
  • User-Friendly Interface : Intuitive design for users of all technical backgrounds.
  • Efficient Resource Utilization : Loading only necessary models, reducing memory and power consumption.

A.3.2SpacerBenefits of Plugin Architecture Design

  • Modular: Plugins are self-contained and can be developed independently.
  • Extensible: Adding a new plugin requires only placing the file in the plugins directory.
  • Decoupled: The core application is unaware of plugin implementation details.
  • Dynamic: Plugins can be loaded or replaced at runtime without modifying the core application.

A.3.3SpacerHow It Works


  • Dynamic Switching :
    • Models are preloaded and can be activated or replaced via a simple interface.
    • Switching occurs without restarting the system, minimizing operational disruptions.
  • Optimized for Edge Devices :
    • Specifically designed for hardware with limited resources, such as NVIDIA Jetson and Advantech MIC series devices.
  • Real-Time Inspection :
    • Provides real-time visual feedback to facilitate user reviews.

A.3.4SpacerPractical Tips

  • Drag-and-drop tools for assembling AI pipelines.
  • Visual monitoring via VideoOverlay nodes to track system performance.
  • Configurable presets for quick adaptation to various scenarios.

A.4SpacerVideo Transformers & Multimodal Model Optimization

  • Frameworks : The model is built using TensorFlow and PyTorch, leveraging ONNX for interoperability across different platforms. This allows for flexibility in deployment and integration with existing systems.
  • Deployment : Optimized for deployment using Docker containers, ensuring portability and ease of integration. It's also compatible with a range of edge platforms, including NVIDIA Jetson devices for efficient real-time processing at the edge.
  • Performance : We've implemented several optimization techniques, including 8-bit quantization, to significantly reduce the model's size and improve inference speed without a substantial loss in accuracy. Dynamic model loading further enhances performance by only loading the necessary components based on the current task, minimizing resource consumption.

A.4.1SpacerFine-Tuning ViT – LORA Fine-Tuning Steps

Fine-tuning a Vision Transformer (ViT) using Low-Rank Adaptation (LoRA) is an efficient approach to adapt large pre-trained models to specific tasks without updating all parameters.

  1. Label Mapping: Before fine-tuning, we convert between label IDs (numerical representations) and label names (human-readable descriptions). This ensures consistency between the model's internal representation and our understanding of the data.
  2. Image Processing: This involves resizing images to a consistent dimension to avoid issues with varying input sizes that could affect model performance. We also apply normalization, which scales pixel values to a specific range (e.g., 0 to 1), ensuring optimal input for our Vision Transformer (ViT) model.
  3. Load Pre-trained ViT Model:
    from transformers import ViTForImageClassification
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
  4. Apply LoRA Configuration:
    from peft import get_peft_model, LoraConfig, TaskType
    lora_config = LoraConfig(
        task_type=TaskType.IMAGE_CLASSIFICATION,
        r=8, 
        lora_alpha=16, 
        lora_dropout=0.1
    )
    lora_model = get_peft_model(model, lora_config)
  5. Load and Prepare Dataset:
    from datasets import load_dataset
    dataset = load_dataset('your_dataset_name')
  6. Train the Model:
    from transformers import TrainingArguments, Trainer
    training_args = TrainingArguments(
        output_dir='./results', 
        num_train_epochs=5,
        per_device_train_batch_size=16, 
        learning_rate=5e-5
    )
    trainer = Trainer(
        model=lora_model,
        args=training_args,
        train_dataset=dataset['train'],
        eval_dataset=dataset['validation']
    )
    trainer.train()
  7. Evaluate the Model:
    results = trainer.evaluate()
    print(f"Test Accuracy: {results['eval_accuracy']}")

A.4.2SpacerFine-Tuning Vision-Language Models (VLM) with Retrieval-Augmented Generation (RAG) for QA Models

Our VLM RAG (Visual Language Model Retrieval-Augmented Generation) is an advanced technology designed to address the limitations of QA (Question Answering) models when dealing with image-related questions. By combining Visual Language Models (VLM) with Retrieval-Augmented Generation (RAG), this flow significantly enhances the model's response accuracy. Below is an overview of the key workflows:

  1. Error Identification and Tagging: When the QA model provides unsatisfactory answers to certain image-related questions, these "poorly answered" images are further analyzed. Our system generates additional tags (e.g., core content, context, or object descriptions) for these images. These tags, along with the images, are stored in a database to strengthen the foundation for future queries.
  2. Database Enrichment and Learning: The tagged images form an intelligent database that acts as the system's "memory". When users pose new questions with accompanying images, the system first checks the database to see if there are images with a semantic similarity above a certain threshold. If similar images are found, the system retrieves their associated tags.
  3. Intelligent Retrieval and Enhanced QA: During the QA process, the system doesn’t rely solely on the new image. Instead, it combines the retrieved tags from similar images in the database with the user’s original query, creating richer contextual information. This allows the QA model to provide more accurate and insightful answers based on additional background knowledge. Therefore, with RAG, VLM model gets advantage of 1. Dynamic Contextual Enhancement, 2. Iterative Learning, 3. Efficient Resource Utilization.

A.4.3SpacerApplications and Advantages

  • Improved Model Performance: Significantly enhances the accuracy and adaptability of the model in challenging scenarios.
  • Efficient Knowledge Expansion: As tagged images accumulate, the system builds a continuously evolving database, enabling more precise answers over time.
  • Semantic Similarity Retrieval: Leverages advanced semantic analysis to provide greater flexibility when dealing with new images.
  • Reduced Costs and Complexity: Eliminates the need for expensive and complex model retraining. The system achieves performance improvements through database enrichment alone.
  • Enhanced User Experience: Users no longer need to struggle with the model’s inability to interpret images correctly. The system automatically learns and optimizes based on historical data.

A.5SpacerUse Case Highlights

Using Video Transformers, an AI-powered system monitors factory floors for safety compliance, detecting hazards like open flames or unprotected workers. With Web-UI, operators dynamically switch to specialized models for different zones, enhancing efficiency and safety.


  • Case 1: Industrial Automation
  • Case 2: Smart Retail: Retailers utilize AI to analyze video feeds for customer demographics and behavior. By dynamically switching models, businesses can optimize data collection during peak hours and focus on different demographics in real-time.
  • Case 3: Safety Zone Enforcement: Using Vision-Language Models and predefined zones, AI systems monitor restricted areas and trigger alerts for policy violations, such as unauthorized entry or object placement.
  • Case 4: Sandbox Testing: The interactive web UI enables users to rapidly build AI pipelines by adding or replacing VLMs or LLMs, introducing new rules such as frame rate limitation, or applying pre- or post-processing to enhance results. Its versatility allows users to experiment freely without needing to configure the environment or GPU resources.
⚠️ **GitHub.com Fallback** ⚠️