LCR Fleet Alerts - directedmachines/customer-support GitHub Wiki

Table of Contents

Overview

Email alerts are sent from: "[email protected]" with the subject "message from LCR24ZS0-xxxxxxxxxxxxxx"

General Alerts

Active

This alert is sent when the robot is active, may also be sent when user connects to the robot UI over WIFI or LTE.

Charge Level

The following charge-level alerts are sent when different thresholds are reached. If the charge level is dropping and the robot is plugged in or out in the sun please diagnose if there is a charging issue or if something is draining the battery (e.g. a device plugged into the inverter). When in low-power mode the robot can drive but it will be slower and AUX will be disabled.

  • Charge level at 100%
  • Voltage is 47.710000, estimated charge is 49.965795
  • Voltage is 53.470000, estimated charge is 19.873410, setting power state to [LOW_POWER]

Battery Voltage Not Detected

Self monitoring service could not sense battery voltage, likely because all motor controllers or sensors that sense it are not enumerated due to hardware faults. Likely causes are

  • USB hub for motor controllers is faulty due to damaged USB peripheral or frame short.
  • See self monitoring fault troubleshooting if a self monitoring fault is reported. If all 3 motor controllers report disconnected or just disappear from the device list (possibly due to a short), we may also see a "Battery Voltage Not Detected" alert. Note that this can also be a method to diagnose a short for TC1.1.
  • electronics went through a power deep cycle due to battery out of charge and a specific bad batch of motor controllers are in DFU mode. See Motor Controller Disconnected Section

Code Branch Mismatch

Code branch mismatch: GIT branch set to development, service state (JSON configuration) set to production

No action should be taken unless this message repeats periodically due to a restart loop.

When the code branch is mismatched and runtime did not self PATCH correctly, the UI will periodically restart+rebuild to try to sync the latest code. To fix this error, either SSH into the robot and switch the git branch to the correct branch, or patch the "sourceBranch" for "cap" in the self-monitoring-tasks ui: http://<ROBOT_IP>:8000/dashboard/services/ui/?path=/mgmt/self-monitoring-tasks/default

If this happened suddenly without explanation it could be caused by data corruption on the SD card. When corruption is detected the configuration files are reloaded and changes may be lost.

Low Disk Space Detected

Low disk space can lead to un safe, un predictable behavior. SSH into the device and determine remaining disk usage, check log file sizes

Failure loading config mgmt-self-monitoring-tasks-default.json

Failure loading config mgmt-self-monitoring-tasks-default.json: java.lang.IllegalArgumentException: Unparseable JSON body: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $

This type of alert indicates a corruption of the self monitoring state. Syslog may show abrupt reboots, such as starting runtime and abruptly running the reboot cycle.

The cause of this issue may be one of two things:

  1. The Pi is not powered properly. In this case, verify the wiring is not damaged, and is routed properly before further diagnosis is required.
  2. The SD card is not seated properly or is otherwise corrupt. In this case, verify the SD card is seated properly before further diagnosis on the SD card is required.

Self Monitoring Possibly Stuck

Large delta (1800 s) since last sample, self monitoring possibly stuck

Some false positives are possible with this alert, its general purpose is to prompt some careful examination of the robot software runtime state. Some possible causes:

  • The robot might have been disconnected from the internet / cell network and just synchronized time. Alert can be ignored after operator verifies robot state
  • The CAP runtime is no longer updating state in critical services such as the self monitoring task service. Software team member should check the statistics (/stats) JSON of the self-mon service and see if maintenance stat version is incrementing. If not, they need to check the maint. stat counter at other key services, such as pose-estimators/default, navigators/default and determine if periodic updates happened around the same time. The runtime logs should be captured along with syslog and kernel logs and added to an issue tracking "stuck" runtime occurrences
  • The host specific (override) JSON file is corrupt. SSH into robot
  • delete file: dCentralizedSystems/cap-config/config/mgmt-self-monitoring-task-default..json
  • restart CAP runtime

System Log Size

System log has grown and might take up remaining free space on the SD card. SSH into the reporting device and inspect the log, delete if necessary

Slow Update Rate

A slow update rate will occur when a plan has too many items in it, or too long of paths. This means that the topological planner is taking too long to localize. To verify the topological planner is the issue, in the logs you can search for java.util.concurrent.CancellationException: queue limit exceeded (operationQueue for update on /navigation/topological-path-planners/default)]

  • To ensure smoother planning on the robots end, try to keep total path length less than 20 total kilometers. All paths that are used are split in 1m samples bi-directionally(i.e. a 20 meter path has 40 total samples). Getting into Km's, the number of samples add up very quickly.
  • To alleviate the issue in larger plans, split the large plan into multiple smaller plans.
  • Ensure to follow Plan Design: Best Practices to prevent this issue.

Hardware Issues

300A Motor Controller (300AMC)

Diagnose MC issues using the Motor Controller Troubleshooting Tree

Hardware/Firmware Fault

AXLE_AUX_MOTOR HW fault: R: M:0 T:0 P:0 V:0 T1:076.70C T2:063.70C VB:52.33V M+:10.90V M-:10.55V IF:-000.68A IR:-000.49A W:1500 MF:1 FS:4719104(1544) ES:0 I:0

This email is sent when a motor controller (AUX1, LEFT_MOTOR, or RIGHT_MOTOR) experiences a fault and includes the associated "R: 0" report string. This is usually due to over-current on the motor driver and the fault should be auto-cleared by CAP.

If the alert is reoccurring, review the Mowing Deck Stalled Motor Guide

If the alert will not clear, attempt a runtime restart, and if this fails, SSH into the robot and verify the MC is not stuck in DFU mode. If all clear, and the alert persists, a physical inspection is needed and typically results in the MC being swapped.

Over Current while Idle

LEFT_MOTOR HW fault: R: M:0 T:0 P:0 V:0 T1:026.75C T2:026.40C VB:56.16V M+:00.00V M-:28.05V IF:-000.29A IR:-000.19A W:900 MF:1 FS:4718608(1552) ES:0 I:1

A hardware fault is sent but the report contains "I:1". This alert is caused when current is detected but the robot is supposed to be idle. See details here: LCR Drivetrain Troubleshooting: Motor Controller Faults.

Over Current Detected

Over current detected on AXLE_AUX_MOTOR: 200.945750 amps over 60 seconds

The motor is likely stalling due to a clog or a failed motor. Review the Mowing Deck Stalled Motor Guide

Motor Disconnected

LEFT|RIGHT|AXLE_AUX_MOTOR: Motor disconnected (check motor disconnect switch, wiring, motor brushes)

This notification is sent when there is voltage across the motor controller (AUX, LEFT_MOTOR, or RIGHT_MOTOR) leads but no current. This is usually caused by an unplugged motor, the motor disconnect switch being turned off, or broken wiring inside the motor. For the AUX motor, it can also be triggered by an overheating motor. As temperature goes up, resistance goes up and current goes to 0. The warning can also be caused by a dirty or clogged motor.

For a LEFT or RIGHT MOTOR disconnect, review the Drivetrain Disconnected Motor Guide

For an AXLE_AUX_MOTOR disconnect, review the Mowing Deck Disconnected Motor Guide

It can also be caused by the Pi receiving undervoltage. Details here: LCR Pi Troubleshooting

Toggling Serial Connection

(1.000000) LEFT_MOTOR: No response for 5 seconds, toggling serial connection

This email is sent when the robot stops receiving the "R: 0" report strings from a motor controller (AUX1, LEFT_MOTOR, or RIGHT_MOTOR). We automatically close and re-open the serial connection after this alert is sent so it should recover on its own. Note that this can be a symptom of a Realsense Camera causing serious USB faults, please see Camera Troubleshooting for details and more detailed diagnostics.

If reported on Drive Motor, Disable the AUX motor and drive the robot a bit. If the issue still occurs, there is likely a short in the AUX motor (leads to motor frame, damaged brush ring).

Device No Longer Exists

LEFT_MOTOR: device address opened previously but no longer exists, check for USB disconnects or chassis shorts

Could be caused by:

  1. A motor controller being physically unplugged during repairs or a flaky USB cable
  2. A motor controller reset that is done every ~50 days (will be designed out in the future)
  3. Electronics instability or chassis short.

High Temperature

High temperature sensed on LEFT_MOTOR: 696.250000 C

  • The high-temperature alert is only sent at very high temperatures, and firmware should stop the MC before it gets hot enough to trigger alerts
  • This alert likely indicates a broken temperature sensor
  • In rare cases this could be due to a serious problem such as a fire in the electronics tray

Pico Devices

Unresponsive Pico device

Device detected but unresponsive to version request on startup, not starting processing loop

  • The pico serial device is detected by the kernel but not responding to serial commands
  • This can be caused by firmware getting hung up, software bugs, or mismatched versions
  • Power cycling the device, or reinstalling the firmware should fix it. If not, contact the engineering team

Max failed Pico initializations

Max failed initializations reached, firmware may require power cycle

  • The pico serial device is detected by the kernel and semi-responsive to serial commands but initialization failed too many times.
  • This can be caused by firmware getting hung up, software bugs, or mismatched versions
  • Power cycling the device, or reinstalling the firmware should fix it. If not, contact the engineering team

Out of date Pico firmware version

Out of date firmware version detected (VER: DM_PICO_##_##_##)

  • The pico serial device is detected by the kernel and responsive
  • Older version of firmware was detected, the CAP runtime will restart and auto-update the firmware.

Pico device no longer exists

Device address opened previously but no longer exists, check for USB disconnects, electrical noise from sense cables near high-current power cables, or chassis shorts

Pico Tray 004260

Frame Short Detected

Frame short detected.
Check for damaged USB peripherals, motors, antennas, or wiring.

Out of range output detected

Out of range output detected (-72.79 amps). Inspect microcontroller, sensor, and sensor wiring.

Unexpected ADC reference voltage

Unexpected ADC reference voltage (measured: 0.89, expected: 1.65). Inspect Pico Tray 004260 and sensor wiring.

Pico Brain 004216

IMU Fault

IMU Fault Status Flag nonzero (<FAULT_CODE>)

Realsense and Generic Cameras

Capture Process Restart

Restarting (100.000000) capture process for left, counter now: 2626, last sample at 0, process live: true

Note that this can be a symptom of a Realsense Camera causing serious USB faults, please see Camera Troubleshooting for details and more detailed diagnostics.

Nonzero Exit Code

Capture process returned nonzero exit code (<EXIT_CODE>)

Alert describes a camera-capture issue, could be cause by an unplugged camera, corrosion, or a short to chassis. See:
LCR Realsense Camera Troubleshooting
LCR USB Side Camera Troubleshooting

GPS Antenna Fault

GPS Antenna Fault(s) email alert indicate issues with the GPS antenna. The list of fault messages should give details (e.g. ANT1:GPS antenna max gain threshold exceeded). Antenna issues are usually caused by one of the following:

  1. Missing GPS antenna (SMA disconnected or wire severed)
  2. GPS antenna has fallen and has bad signal
  3. GPS antenna is water damaged

Example below:

GPS Antenna Fault(s):
ANT1:GPS antenna max gain threshold exceeded
Antenna Status Message
UBXMonRFData:
version: 0
nBlocks: 2
Block[0]: RFBlock { blockId=0, flags=2, antStatus=2, antPower=1, postStatus=0x00000000, noisePerMS=69, agcCnt=7020, cwSuppression=11, ofsI=8, magI=234, ofsQ=14, magQ=214, rfBand=0 }
Block[1]: RFBlock { blockId=1, flags=0, antStatus=2, antPower=1, postStatus=0x00000000, noisePerMS=48, agcCnt=7020, cwSuppression=8, ofsI=8, magI=231, ofsQ=-3, magQ=205, rfBand=0 }

Failure initializing GPS

Failure initializing GPS device:java.lang.IllegalStateException: Device query timeout The GPS can be reset by following the GPS Accuracy Troubleshooting

⚠️ **GitHub.com Fallback** ⚠️