Software Requirement Specification - Mir-Fahad-Abdullah/Walking-Assistant-Robot-for-Blind-People GitHub Wiki
Methodology
Walking Assistant Robot for Visually Impaired Users tackles the issue of learning with a modular and iterative methodology toward unifying deep learning, natural language processing, and embedded systems. The principal components of the methodology are hardware configuration, object detection, Bangla voice navigation, system calibration, and finally integration with all ends focused toward keeping real-time capability on low-power hardware.
DL in Action (Object Detection Models)
Object detection is a fundamental computer vision and image processing capability that enables detection and localization of objects such as people, obstacles, or vehicles in digital images and video streams. In our project, the Walking Assistant Robot for Blind People, object detection is a very critical component by enabling the system to perceive the world and detect potential obstacles in real time, hence enabling safe and effective movement.
There are several Deep Learning models that are optimized and runs well on low-powered processing units. Some of them include:
- Tiny-YOLOv3 / Tiny-YOLOv4:
Tiny-YOLOv4 (or v3) provides a great trade-off between speed and accuracy, enabling real-time detection on Jetson Orin Nano (4GB) without using external accelerators. It enables support for several object classes and can be used with Tensorflow-Lite, TensorRT or converted into ONNX for GPU inference at high efficiency.
- MobileNet-SSD:
Light and agile, MobileNet-SSD performs well on Jetson Orin Nano with low GPU usage and attains decent accuracy without consuming much memory relative to the full-precision YOLO models.
- EfficientDet-Lite:
Optimized for edge devices like Jetson Orin Nano, EfficientDet-Lite0 and Lite1 are low-latency inference with reasonable accuracy for detecting small objects. TensorRT-acceleration and model quantization can be used with them to improve their performance.
Schematic Diagram
Model Architecture (YOLOv4)
- Input Layer: Accepts the input image and normalized to be of uniform size and proper pixel value ranges for processing.
- 13 Convolutional Layers: }Use Conv2D operations to extract visual features, with Batch Normalization to prevent training instability and LeakyReLU to maintain gradient flow for negative values. These layers learn incrementally from simple edges and textures to complex object structures.
- 6 MaxPooling Layers: Do downsampling by selecting maximum values within regions, reducing spatial dimensions to reduce computation and highlight significant features while achieving spatial invariance.
- 2 Fully Connected Layers: Convert extracted features into predictions—bounding box coordinates, objectness scores, and class probabilities. Detection is performed at two spatial scales to improve accuracy for small and large objects.
- Output Layer: Ends with two or more output layers. Normally the output layers are determined by the number of classes of the dataset or classification problem of the model.
Why Tiny or Low-Powered Models?
Suppose: for Tiny-YOLOv4
- Input channels: 32
- Filter: (3X3)
- Number of Filters: 64
Then:
- Conv2D parameters: (3X3X32)X64 = 18,432
- BatchNorm parameters: 4X64 = 256
- Total for this layer: 18,688 parameters
Total number of parameters in Tiny-YOLOv4 is approximately: 6.06 million parameters.
This includes all convolutional + batch normalization layers.
On the other hand, for YOLOv4:
Total number of parameters in YOLOv4 is approximately: 64 million!!!!
The computational requirements of a deep learning model are in direct proportion to its architectural complexity and the number of parameters. Models with lots of parameters, such as YOLOv4 or YOLOv7, are highly accurate but expensive in terms of memory consumption, time taken for training, inference time, and power requirements. Models like these usually require high-end computing setups with GPUs or TPUs supporting parallel processing capabilities, high memory bandwidth, and thermal management.
But in real-world applications, especially when dealing with embedded systems or edge devices such heavy resource models are not practicable. Instead, light models like MobileNet or Tiny-YOLOv4 are utilized. These models are designed in such a way to offer a compromise between performance and efficiency by reducing the number of parameters and using techniques like depthwise separable convolutions or quantization.
Tiny-YOLOv4, for instance, is a lightweight version of the YOLO object detection model with significantly less computational burden yet maintaining good accuracy. Its minimalistic architecture allows it to run in real-time on low-processing power and power-constrained devices. It is thus ideal for real-time use cases such as surveillance, visually impaired individuals' assistance technology, robotics, or autonomous navigation where rapid decisions based on visual input are critical.
Moreover, reducing model size not only improves responsiveness but also decreases heat generation and power consumption, which is very important for battery-powered systems. In all practical scenarios, model optimization of such constraints through techniques like pruning, quantization, or model distillation is essential to deploy deep learning applications in the field.
NLP in Action
ONatural Language Processing (NLP) functions as an important translation service which converts AI system data along with sensor information into understandable Bangla voice commands for visually impaired users. The communication system becomes more effective through these features which allow individuals to make proper decisions during their navigation journeys. The principal functions of the NLP component embrace:
- The NLP component converts both object detection results together with navigation directions into easily understandable and direct Bangla language instructions.
- The system needs to create voice instructions which match the situation and sound naturally and exhibit emotional elements to build trust between the device and users.
- The system meets the requirement of delivering real-time processing that qualifies for deployment in wearable assistive systems using minimal power.
Speech Synthesis Models
- A system uses real-time Bangla voice instruction production for navigation texts through rule-based and deep learning programming techniques.
- Data-driven models outperform rule-based models regarding natural sound but rule-based systems provide better speed performance.
- The system runs in real-time by implementing model compression together with quantization while maintaining short and direct outputs.
- The developed system provides distinct and reactive voice guidance in Bangla language for secure navigation purposes.
Speech Challenges and Solutions
- The production of robot voice using Bangla speech entailed overcoming numerous obstacles. The system required pronunciation adjustments because different Bangla speech patterns existed across regions and we addressed this issue by implementing data collection which modified the system to achieve broader user understanding.
- Premade systems presented either artificial speech along with insufficient real-time performance capabilities. The team trained better responsive and clear lightweight models to address the issue.
- The system performance required model compression and short speech delivery to run effectively on available hardware. The pitch and vocal tone received modification to create a singing voice which facilitated better understanding for listeners.
Model Optimization
Organizing AI models for execution on Jetson Orin Nano needs specific optimization steps to keep the system operating at fast speeds and using minimal resources yet maintaining effective performance. The model optimization process consisted of size reduction along with speed enhancement combined with memory optimization while keeping accurate outputs.
- Quantization:
The quantization enabled both small model size and high operating speed. Quantization of 32-bit floating point numbers to 8-bit integers enabled faster operation of the deep learning engine on Jetson Orin Nano (4GB) systems without sacrificing the accuracy level too much. 2. Pruning:
The model received pruning treatment which eliminated nonessential parameters. The removal of superfluous weights in Tacotron 2 through this method reduced both computational load during inference time and system memory requirements. 3. Lightweight Models:}
We can accelerate the system performance by implementing these simple lightweight models.
- Tiny DeepSpeech for speech-to-text capabilities.
- Festival TTS and Google TTS API for straightforward Bangla voice generation.
- FastSpeech for low-latency, expressive TTS.
- Vocabulary Limiting:
The system performance improves when navigation terms from a limited vocabulary are enforced for both speech processing and synthesizing functions. The system became both simpler and faster and more predictable due to these restrictions. 5. Frame Skipping and Buffering:
The system uses a frame skipping function and buffering technique to manage CPU performance and prevent audio problem occurrences. The system bypassed unneeded audio frames which enabled smooth continuous voice instructions to flow uninterrupted to users.
System Integration Challenges
- Real-Time Synchronization: The integration of multiple real-time subsystems including perception navigation along with communication became a difficulty to manage because of hardware restrictions.
- Modular Design Approach: A modular framework served to enable different subsystems when operating separately yet remaining coordinated. This improved overall system reliability and maintainability.
- Performance Optimization: The optimization process minimized model memory demands and processing requirements to achieve better performance on minimal hardware.
- Efficient Communication: The system developed a speedy communication protocol that enabled dependable data transfers among its different components.
- Sensor Reliability: The system used filtering methods together with cross-validation techniques to manage inconsistent and noisy sensor information by improving its accuracy and stability levels.
Process of Work
- Hardware & System design: Putting all the hardware components to initiate the process of constructing the robot. May face robot structural and adequate weight distribution issues here.
- Object detection: Through Data preprocessing, Data augmentation and then training different DL models with different large datasets and comparing the performance in order to choose the optimized model.
- Bangla Voice Guidance using NLP: Utilize Bangla TTS to provide clear, real-time navigation instructions on low-power devices.
- Sensor based issue evaluation: Resolving user tracking and climbing stairs sensor-based issues.
- Testing & optimization: Optimizing the AI models followed by testing the robot.
This system places a greater emphasis on a balance between performance and computational efficiency, thus being ideal for low-power aiding hardware.
Future Scope
The developed system creates a cost-effective platform designed specifically to assist visually impaired users with their mobility needs. Future developments will concentrate on implementing quality enhancements for both system functionality and user interface improvement:
- Users would benefit from basic voice command functionality that enables them to control the robot system using their spoken commands.
- The device would incorporate an SOS emergency alert system which enables contact notifications to emergency contacts during unsafe situations and emergency accidents.
- An upgrade to the robot will enable automatic user tracking and movement adaptation for following the user.
- The robot would get outdoor navigation capabilities thanks to GPS integration through which users could receive guided travel in outdoor environments.
- Enhancing the obstacle detection system to detect several types of obstacles would lead to more precise guidance.
- Custom Bangla language models should be developed for better natural and dialect-aware speech output.
- Future development will concentrate on creating a portable small version of the device alongside making it easily wearable.
Conclusion
The primary work of this project involved developing user-robot interface communication tools by using TTS technologies and Bangla NLP systems. To achieve the requirements the voice interface needed both natural accuracy alongside quick reliable performance. Our solution meets the needs of blind users in Bangladesh thanks to optimized models and tailored speech training and considerate implementation methods. Through its development the project achieves two main goals: solving technological difficulties as well as improving accessibility safety in public spaces for blind users.