LCR Fleet Alerts - directedmachines/customer-support GitHub Wiki
Email alerts are sent from: "[email protected]" with the subject "message from LCR24ZS0-xxxxxxxxxxxxxx"
This alert is sent when the robot is active, may also be sent when user connects to the robot UI over WIFI or LTE.
The following charge-level alerts are sent when different thresholds are reached. If the charge level is dropping and the robot is plugged in or out in the sun please diagnose if there is a charging issue or if something is draining the battery (e.g. a device plugged into the inverter). When in low-power mode the robot can drive but it will be slower and AUX will be disabled.
- Charge level at 100%
- Voltage is 47.710000, estimated charge is 49.965795
- Voltage is 53.470000, estimated charge is 19.873410, setting power state to [LOW_POWER]
Self monitoring service could not sense battery voltage, likely because all motor controllers or sensors that sense it are not enumerated due to hardware faults. Likely causes are
- USB hub for motor controllers is faulty due to damaged USB peripheral or frame short.
- Use Main Electronics and/or USB Troubleshooting diagnostic trees
- See self monitoring fault troubleshooting if a self monitoring fault is reported. If all 3 motor controllers report disconnected or just disappear from the device list (possibly due to a short), we may also see a "Battery Voltage Not Detected" alert. Note that this can also be a method to diagnose a short for TC1.1.
- electronics went through a power deep cycle due to battery out of charge and a specific bad batch of motor controllers are in DFU mode. See Motor Controller Disconnected Section
Code branch mismatch: GIT branch set to development, service state (JSON configuration) set to production
No action should be taken unless this message repeats periodically due to a restart loop.
When the code branch is mismatched and runtime did not self PATCH correctly, the UI will periodically restart+rebuild to try to sync the latest code. To fix this error, either SSH into the robot and switch the git branch to the correct branch, or patch the "sourceBranch" for "cap" in the self-monitoring-tasks ui: http://<ROBOT_IP>:8000/dashboard/services/ui/?path=/mgmt/self-monitoring-tasks/default
If this happened suddenly without explanation it could be caused by data corruption on the SD card. When corruption is detected the configuration files are reloaded and changes may be lost.
Low disk space can lead to un safe, un predictable behavior. SSH into the device and determine remaining disk usage, check log file sizes
Failure loading config mgmt-self-monitoring-tasks-default.json: java.lang.IllegalArgumentException: Unparseable JSON body: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $
This type of alert indicates a corruption of the self monitoring state. Syslog may show abrupt reboots, such as starting runtime and abruptly running the reboot cycle.
The cause of this issue may be one of two things:
- The Pi is not powered properly. In this case, verify the wiring is not damaged, and is routed properly before further diagnosis is required.
- The SD card is not seated properly or is otherwise corrupt. In this case, verify the SD card is seated properly before further diagnosis on the SD card is required.
Large delta (1800 s) since last sample, self monitoring possibly stuck
Some false positives are possible with this alert, its general purpose is to prompt some careful examination of the robot software runtime state. Some possible causes:
- The robot might have been disconnected from the internet / cell network and just synchronized time. Alert can be ignored after operator verifies robot state
- The CAP runtime is no longer updating state in critical services such as the self monitoring task service. Software team member should check the statistics (/stats) JSON of the self-mon service and see if maintenance stat version is incrementing. If not, they need to check the maint. stat counter at other key services, such as pose-estimators/default, navigators/default and determine if periodic updates happened around the same time. The runtime logs should be captured along with syslog and kernel logs and added to an issue tracking "stuck" runtime occurrences
- The host specific (override) JSON file is corrupt. SSH into robot
- delete file: dCentralizedSystems/cap-config/config/mgmt-self-monitoring-task-default..json
- restart CAP runtime
System log has grown and might take up remaining free space on the SD card. SSH into the reporting device and inspect the log, delete if necessary
A slow update rate will occur when a plan has too many items in it, or too long of paths. This means that the topological planner is taking too long to localize. To verify the topological planner is the issue, in the logs you can search for java.util.concurrent.CancellationException: queue limit exceeded (operationQueue for update on /navigation/topological-path-planners/default)]
- To ensure smoother planning on the robots end, try to keep total path length less than 20 total kilometers. All paths that are used are split in 1m samples bi-directionally(i.e. a 20 meter path has 40 total samples). Getting into Km's, the number of samples add up very quickly.
- To alleviate the issue in larger plans, split the large plan into multiple smaller plans.
- Ensure to follow Plan Design: Best Practices to prevent this issue.
Diagnose MC issues using the Motor Controller Troubleshooting Tree
AXLE_AUX_MOTOR HW fault: R: M:0 T:0 P:0 V:0 T1:076.70C T2:063.70C VB:52.33V M+:10.90V M-:10.55V IF:-000.68A IR:-000.49A W:1500 MF:1 FS:4719104(1544) ES:0 I:0
This email is sent when a motor controller (AUX1, LEFT_MOTOR, or RIGHT_MOTOR) experiences a fault and includes the associated "R: 0" report string. This is usually due to over-current on the motor driver and the fault should be auto-cleared by CAP.
If the alert is reoccurring, review the Mowing Deck Stalled Motor Guide
If the alert will not clear, attempt a runtime restart, and if this fails, SSH into the robot and verify the MC is not stuck in DFU mode. If all clear, and the alert persists, a physical inspection is needed and typically results in the MC being swapped.
LEFT_MOTOR HW fault: R: M:0 T:0 P:0 V:0 T1:026.75C T2:026.40C VB:56.16V M+:00.00V M-:28.05V IF:-000.29A IR:-000.19A W:900 MF:1 FS:4718608(1552) ES:0 I:1
A hardware fault is sent but the report contains "I:1". This alert is caused when current is detected but the robot is supposed to be idle. See details here: LCR Drivetrain Troubleshooting: Motor Controller Faults.
Over current detected on AXLE_AUX_MOTOR: 200.945750 amps over 60 seconds
The motor is likely stalling due to a clog or a failed motor. Review the Mowing Deck Stalled Motor Guide
LEFT|RIGHT|AXLE_AUX_MOTOR: Motor disconnected (check motor disconnect switch, wiring, motor brushes)
This notification is sent when there is voltage across the motor controller (AUX, LEFT_MOTOR, or RIGHT_MOTOR) leads but no current. This is usually caused by an unplugged motor, the motor disconnect switch being turned off, or broken wiring inside the motor. For the AUX motor, it can also be triggered by an overheating motor. As temperature goes up, resistance goes up and current goes to 0. The warning can also be caused by a dirty or clogged motor.
For a LEFT or RIGHT MOTOR disconnect, review the Drivetrain Disconnected Motor Guide
For an AXLE_AUX_MOTOR disconnect, review the Mowing Deck Disconnected Motor Guide
It can also be caused by the Pi receiving undervoltage. Details here: LCR Pi Troubleshooting
(1.000000) LEFT_MOTOR: No response for 5 seconds, toggling serial connection
This email is sent when the robot stops receiving the "R: 0" report strings from a motor controller (AUX1, LEFT_MOTOR, or RIGHT_MOTOR). We automatically close and re-open the serial connection after this alert is sent so it should recover on its own. Note that this can be a symptom of a Realsense Camera causing serious USB faults, please see Camera Troubleshooting for details and more detailed diagnostics.
If reported on Drive Motor, Disable the AUX motor and drive the robot a bit. If the issue still occurs, there is likely a short in the AUX motor (leads to motor frame, damaged brush ring).
LEFT_MOTOR: device address opened previously but no longer exists, check for USB disconnects or chassis shorts
Could be caused by:
- A motor controller being physically unplugged during repairs or a flaky USB cable
- A motor controller reset that is done every ~50 days (will be designed out in the future)
- Electronics instability or chassis short.
High temperature sensed on LEFT_MOTOR: 696.250000 C
- The high-temperature alert is only sent at very high temperatures, and firmware should stop the MC before it gets hot enough to trigger alerts
- This alert likely indicates a broken temperature sensor
- In rare cases this could be due to a serious problem such as a fire in the electronics tray
Device detected but unresponsive to version request on startup, not starting processing loop
- The pico serial device is detected by the kernel but not responding to serial commands
- This can be caused by firmware getting hung up, software bugs, or mismatched versions
- Power cycling the device, or reinstalling the firmware should fix it. If not, contact the engineering team
Max failed initializations reached, firmware may require power cycle
- The pico serial device is detected by the kernel and semi-responsive to serial commands but initialization failed too many times.
- This can be caused by firmware getting hung up, software bugs, or mismatched versions
- Power cycling the device, or reinstalling the firmware should fix it. If not, contact the engineering team
Out of date firmware version detected (VER: DM_PICO_##_##_##)
- The pico serial device is detected by the kernel and responsive
- Older version of firmware was detected, the CAP runtime will restart and auto-update the firmware.
Device address opened previously but no longer exists, check for USB disconnects, electrical noise from sense cables near high-current power cables, or chassis shorts
- Caused when a Pico device's USB cable is unplugged, the USB cable is damaged or has a bad connection
- Could also be an electrical issue, see: https://github.com/directedmachines/customer-support/wiki/Electronics-and-Chassis-Short-Troubleshooting
Frame short detected.
Check for damaged USB peripherals, motors, antennas, or wiring.
Out of range output detected (-72.79 amps). Inspect microcontroller, sensor, and sensor wiring.
Unexpected ADC reference voltage (measured: 0.89, expected: 1.65). Inspect Pico Tray 004260 and sensor wiring.
- NOTE: robots with an unexpected ADC reference will not report charge current (the Pico Tray assumes the sensor reading is inaccurate).
- See Pico Tray: ADC Reference Voltage Alert
IMU Fault Status Flag nonzero (<FAULT_CODE>)
- Details of the parsed fault code will be included in the email
- See Inertial Measurement Unit - Pico Brain troubleshooting: https://github.com/directedmachines/customer-support/wiki/LCR-Sensor-Troubleshooting#pico-brain
Restarting (100.000000) capture process for left, counter now: 2626, last sample at 0, process live: true
Note that this can be a symptom of a Realsense Camera causing serious USB faults, please see Camera Troubleshooting for details and more detailed diagnostics.
Capture process returned nonzero exit code (<EXIT_CODE>)
Alert describes a camera-capture issue, could be cause by an unplugged camera, corrosion, or a short to chassis.
See:
LCR Realsense Camera Troubleshooting
LCR USB Side Camera Troubleshooting
GPS Antenna Fault(s)
email alert indicate issues with the GPS antenna. The list of fault messages should give details (e.g. ANT1:GPS antenna max gain threshold exceeded
). Antenna issues are usually caused by one of the following:
- Missing GPS antenna (SMA disconnected or wire severed)
- GPS antenna has fallen and has bad signal
- GPS antenna is water damaged
Example below:
GPS Antenna Fault(s):
ANT1:GPS antenna max gain threshold exceeded
Antenna Status Message
UBXMonRFData:
version: 0
nBlocks: 2
Block[0]: RFBlock { blockId=0, flags=2, antStatus=2, antPower=1, postStatus=0x00000000, noisePerMS=69, agcCnt=7020, cwSuppression=11, ofsI=8, magI=234, ofsQ=14, magQ=214, rfBand=0 }
Block[1]: RFBlock { blockId=1, flags=0, antStatus=2, antPower=1, postStatus=0x00000000, noisePerMS=48, agcCnt=7020, cwSuppression=8, ofsI=8, magI=231, ofsQ=-3, magQ=205, rfBand=0 }
Failure initializing GPS device:java.lang.IllegalStateException: Device query timeout
The GPS can be reset by following the GPS Accuracy Troubleshooting