2Functional Node Descriptions

The Edge Agent includes a variety of functional nodes that can be interconnected to design and experiment with automation agents, personal assistants, and edge AI systems. These nodes are optimized for deployment on Jetson devices.

2.1Nodes from Agent Studio

The Advantech Edge Agent is built upon NVIDIA's Agent Studio. They share a common foundation in utilizing multimodal LLMs, speech and vision transformers, vector databases, and I/O connectivity.

2.1.1LLM (Large Language Model) Nodes

NanoLLM_ICL: This node loads quantized Large Language Models (LLM) or Vision Language Models (VLM) using MLC for speed, AWQ for quality, or HF Transformers for compatibility. It also supports In-Context Learning (ICL). It can take various inputs like strings, lists of strings, or image data, and provides outputs such as generated text (delta, partial, final, words), chat history, and tool usage capabilities.
AutoPrompt_ICL: This node automatically applies a template to incoming data. For instance, it can tag images with a text prompt for VLMs. It supports structured messages referencing previous inputs (like <image> or <text>) in a last-in, first-out sequence. It also supports ICL by allowing users to input images via file paths in the template and can use a Region of Interest (ROI).
UserPrompt: This plugin allows text input via the keyboard (terminal or UI text box). It can also load prompts from text or JSON files, which can reference other files.
TextStream: A basic plugin to display any text stream from the system in a UI text box. It can add color highlights to partial and final responses from Automatic Speech Recognition (ASR) or LLM sources.

2.1.2Speech Nodes

WhisperASR: This node performs streaming speech-to-text using Whisper models with TensorRT. It supports 'tiny', 'base', and 'small' Whisper models. It takes audio data as input and outputs final and partial transcripts.
PiperTTS: This node handles text-to-speech using Piper models with CUDA and onnxruntime. It can download available Piper models and speaker voices. It takes text strings as input and outputs audio as a NumPy array.
VAD Filter: This is a voice activity detection model using Silero. It filters incoming audio, only passing it through if it exceeds a set VAD threshold. This helps reduce spurious transcripts from background noise when used before ASR plugins.

2.1.3Audio Nodes

AudioRecorder: Saves an audio stream to a WAV file on the server.
WebAudioIn: Receives audio samples streamed from a client over WebSockets.
WebAudioOut: Transmits audio samples to a client over WebSockets.

2.1.4Video Nodes

VideoSource: Captures images from cameras (V4L2/CSI), network streams (RTP, RTSP), or video files (MP4, MKV, AVI, FLV).
VideoOutput: Outputs H264/H265 encoded video to network streams (RTP, RTSP, WebRTC), display, ROI, or saves it to a file.
VideoOverlay: Overlays text on video streams for HUD or OSD-style displays.
RateLimit: Limits the transmission speed of video or audio data to a specified rate.

2.1.5Database Nodes

NanoDB_Fashion: An in-memory multimodal vector database optimized for text-to-image, image-to-image similarity searches, and image tagging. It automatically generates a database for images placed in a folder and supports insert, delete, and Retrieval-Augmented Generation (RAG) searches.

Note

For more information on the core functionalities and the general Agent Studio environment, see the official documentation at https://www.jetson-ai-lab.com/agent_studio.html.

⬆️ Top

2.2Specific Nodes and Enhancements

While built on Agent Studio, the following nodes appear to be specific additions or significantly customized developments by Advantech.

2.2.1Tools Nodes

OpenWord_detector: Uses an OpenWord model for detection tasks. It takes an image as input and outputs the visualized detection, detection status message, JSON results for MQTT (first detected instance per object), and a JSON string of all detection results in the current frame. It can also be triggered via MQTT if a topic is defined during setup.
OpenWord_FineTune: This node is for fine-tuning the OpenWord detector model. Users can use annotated data to improve the model's effectiveness when the original performance is suboptimal. It is suggested to clean the pipeline before running this node.
One_Step_Alert: Triggers alerts based on specified keywords by monitoring their frequency within a set time. An alarm activates if alert keywords are predominant. It outputs a JSON result with the state, check time, and collected text.
Two_Steps_Alert: Monitors specified alert keywords and their frequency. If an initial alert is triggered, it moves to a second level, checking for a "normal" keyword within a timeframe to indicate resolution. If not resolved, the alarm continues. It outputs a JSON result with the state, check time, and collected text.
Save_Pics: Monitors specified alert keywords and saves images if the keywords are frequent enough within a set timeframe. The number of images to store is configurable by the user.
MQTT_Publisher: This system receives JSON messages, packages them, and sends them to a predefined broker under a specified topic for real-time reception by subscribers. The messages include a device ID and timestamp along with the original JSON payload.

2.2.2Custom_func Nodes

Users can create custom plugins by writing a Python file within edge_agent/nano_llm/plugins/custom_func. The class in this file must inherit from the Plugin parent class. An example, RecDetRes, is provided, which processes OpenWord detector results by converting JSON strings into a single output string.
- Plugin Development Steps:
  1. Create a class inheriting from Plugin.
  2. Define __init__() to set up outputs and parameters.
  3. Write a process(self, input) method for the logic.
  4. Optionally, add type_hints() for UI configuration.
  5. Use self.output(data, port_id) to send results.