Perception - umrover/mrover-ros GitHub Wiki

ArUco Markers

Overview

ArUco markers (which are a type of fiducial markers) are special patterns than encode numbers. Often time the course will have these placed around so we can orient our rover and execute a certain task.

High Level Detection Method

OpenCV is used to run detection on the camera stream. This gives us information about where the tag is in pixel space, specifically its four corners. We can then fuse this with point cloud data, which gives us the xyz position for any given pixel relative to the camera. Specifically, we query the pointcloud at the center of the marker and thus find its transform relative to the rover.

We then publish the tags to the tf tree.

Details

Update Loop:

Detect the IDs and vertices in pixel space of ArUco tags from the current camera frame.
Add any new tags to the "immediate" map or update existing ones. We calculate the center here by finding the average of the four vertices. If we also have a point cloud reading for this tag publish it to the TF tree as an immediate tag relative to the rover. These readings are filled in by another callback.
Decrement the hit counter of any tags that were not seen this frame. If it reaches zero remove them entirely from the immediate map.
Publish all tags to the TF tree that have been seen enough times. Importantly this time they will be relative to the map frame not the rover.
Draw the detected markers onto an image and then publish it

Object Detection

Overview

As part of URC the rover must be able to identify two objects, a hammer and water bottle. These objects will be placed in close proximity to two GNSS coordinates and the rover has to identify, locate, and drive to these objects.

Objects

High Level Detection Method

At a high level, the detection algorithm uses a custom ML model trained (Using Roboflow) to identify hammers and water bottles. This model is then loaded and executed on the GPU using NVIDIA's TensorRT framework. TensorRT takes advantage of the Tensor Cores available on the Jetson's GPU to accelerate the networks forward pass. The forward pass returns a pair of coordinates in image space, and using the pointcloud this location is then converted into an xyz position. Finally, the xyz position is published to the tf tree in map space where navigation can then move towards the object.

Example Video

https://youtu.be/1mlohZMx3wQ

Details

Update Loop:

Grab Image from the ZED and convert to CNN input format.
Perform forward pass of CNN.
Locate the object in 3D space.
Add immediate object to ZED camera frame and increment hit counter.
If hit count is above a certain threshold publish object to map frame.
Decrement the hit counter of the object is not seen. If the hit count is below a certain threshold then stop publishing the object's location.
Draw the objects bounding box onto the image and then publish it.

Visual Odometry

Visual Odometry is a method of determining where we are by tracking unique features across multiple camera frames.

We have the option of using the Zed built-in tracking or rtabmap stereo odometry. We have found that both are high quality but the Zed built-in tracking runs at a higher refresh rate at the cost of being more of a black box.

Nodelet Design

Communication between nodes has to use sockets in ROS by default since they all run in separate processes. We use nodelets instead which all run inside of the same process. In this way they share a virtual address space and can share messages via pointers (zero-copy) which is ~50x faster.

One important note is that this message now needs to be thread safe. For this reason a new point cloud message is made every update by the point cloud publisher thread (instead of reusing the same one).

Best ZED Settings

At least 720p is recommended. Anything lower will not work at long ranges. We also try to hit at least 15 Hz so information propagates fast enough to navigation.

For depth quality, we found that QUALITY is best. PERFORMANCE allows 50hz .grab() loop but results in lots of bad locations for the post at long ranges.

Limited the maximum depth of the ZED 2i to around 10-14 meters (default is 20). The ArUco detector cannot really find tags beyond this depth so it is not necessary.

Testing

For off-rover testing a zed_test.launch file is provided. Here are sample usages:

roslaunch mrover zed_test.launch use_builtin_visual_odom:=true

roslaunch mrover zed_test.launch use_rtabmap_stereo_odom:=true

Other configurable options: run_rviz (Useful for looking at TF tree), run_dynamic_reconfigure (useful for configuring tag detection settings), run_tag_detector

Glossary

ArUco: Special pattern of black and white blocks that encode a number. Often times called markers/tags/targets
Stereo Camera: A camera that uses stereo matching to produce point clouds
Features: Unique patterns in an image that are persistent across frames. Corners are a good example
Zed 2i: The stereo camera that we use
OpenCV: A computer vision library
Point Cloud: A collection of 3D points that roughly describe a scene
Odometry: The "pose" of an object, in other words description of where it is in the world (usually position and rotation)
Pixel Space (or Camera Space): x and y coordinates of where a pixel is in an image