02 functional node descriptions - advantech-EdgeAI/edge_agent GitHub Wiki
The Edge Agent includes a variety of functional nodes that can be interconnected to design and experiment with automation agents, personal assistants, and edge AI systems. These nodes are optimized for deployment on Jetson devices.
The Advantech Edge Agent is built upon NVIDIA's Agent Studio. They share a common foundation in utilizing multimodal LLMs, speech and vision transformers, vector databases, and I/O connectivity.
- NanoLLM_ICL: This node loads quantized Large Language Models (LLM) or Vision Language Models (VLM) using MLC for speed, AWQ for quality, or HF Transformers for compatibility. It also supports In-Context Learning (ICL). It can take various inputs like strings, lists of strings, or image data, and provides outputs such as generated text (delta, partial, final, words), chat history, and tool usage capabilities.
-
AutoPrompt_ICL: This node automatically applies a template to incoming data. For instance, it can tag images with a text prompt for VLMs. It supports structured messages referencing previous inputs (like
<image>
or<text>
) in a last-in, first-out sequence. It also supports ICL by allowing users to input images via file paths in the template and can use a Region of Interest (ROI). - UserPrompt: This plugin allows text input via the keyboard (terminal or UI text box). It can also load prompts from text or JSON files, which can reference other files.
- TextStream: A basic plugin to display any text stream from the system in a UI text box. It can add color highlights to partial and final responses from Automatic Speech Recognition (ASR) or LLM sources.
- WhisperASR: This node performs streaming speech-to-text using Whisper models with TensorRT. It supports 'tiny', 'base', and 'small' Whisper models. It takes audio data as input and outputs final and partial transcripts.
- PiperTTS: This node handles text-to-speech using Piper models with CUDA and onnxruntime. It can download available Piper models and speaker voices. It takes text strings as input and outputs audio as a NumPy array.
- VAD Filter: This is a voice activity detection model using Silero. It filters incoming audio, only passing it through if it exceeds a set VAD threshold. This helps reduce spurious transcripts from background noise when used before ASR plugins.
- AudioRecorder: Saves an audio stream to a WAV file on the server.
- WebAudioIn: Receives audio samples streamed from a client over WebSockets.
- WebAudioOut: Transmits audio samples to a client over WebSockets.
- VideoSource: Captures images from cameras (V4L2/CSI), network streams (RTP, RTSP), or video files (MP4, MKV, AVI, FLV).
- VideoOutput: Outputs H264/H265 encoded video to network streams (RTP, RTSP, WebRTC), display, ROI, or saves it to a file.
- VideoOverlay: Overlays text on video streams for HUD or OSD-style displays.
- RateLimit: Limits the transmission speed of video or audio data to a specified rate.
- NanoDB_Fashion: An in-memory multimodal vector database optimized for text-to-image, image-to-image similarity searches, and image tagging. It automatically generates a database for images placed in a folder and supports insert, delete, and Retrieval-Augmented Generation (RAG) searches.
Note
For more information on the core functionalities and the general Agent Studio environment, see the official documentation at https://www.jetson-ai-lab.com/agent_studio.html.
While built on Agent Studio, the following nodes appear to be specific additions or significantly customized developments by Advantech.
- OpenWord_detector: Uses an OpenWord model for detection tasks. It takes an image as input and outputs the visualized detection, detection status message, JSON results for MQTT (first detected instance per object), and a JSON string of all detection results in the current frame. It can also be triggered via MQTT if a topic is defined during setup.
- OpenWord_FineTune: This node is for fine-tuning the OpenWord detector model. Users can use annotated data to improve the model's effectiveness when the original performance is suboptimal. It is suggested to clean the pipeline before running this node.
- One_Step_Alert: Triggers alerts based on specified keywords by monitoring their frequency within a set time. An alarm activates if alert keywords are predominant. It outputs a JSON result with the state, check time, and collected text.
- Two_Steps_Alert: Monitors specified alert keywords and their frequency. If an initial alert is triggered, it moves to a second level, checking for a "normal" keyword within a timeframe to indicate resolution. If not resolved, the alarm continues. It outputs a JSON result with the state, check time, and collected text.
- Save_Pics: Monitors specified alert keywords and saves images if the keywords are frequent enough within a set timeframe. The number of images to store is configurable by the user.
- MQTT_Publisher: This system receives JSON messages, packages them, and sends them to a predefined broker under a specified topic for real-time reception by subscribers. The messages include a device ID and timestamp along with the original JSON payload.
- Users can create custom plugins by writing a Python file within
edge_agent/nano_llm/plugins/custom_func
. The class in this file must inherit from thePlugin
parent class. An example,RecDetRes
, is provided, which processes OpenWord detector results by converting JSON strings into a single output string.-
Plugin Development Steps:
- Create a class inheriting from
Plugin
. - Define
__init__()
to set up outputs and parameters. - Write a
process(self, input)
method for the logic. - Optionally, add
type_hints()
for UI configuration. - Use
self.output(data, port_id)
to send results.
- Create a class inheriting from
-
Plugin Development Steps: