Stanford JackRabbot2 - 360° 2D to 3D Multi-Target Tracker

This post follows from a project at the Stanford Vision and Learning Lab, where I've been working for some time, and serves as documentation for our social navigation robot's onboard multi-target person-tracker I developed under the guidance of Roberto, Marynel, Hamid, Patrick and Silvio.

The above video shows the results of the tracker. The top portion of the video is the result of the 360° cylindrical image having been passed through YOLOv2 and Deep SORT. The bottom portion of the video shows the results from fusing the top image with 2x VLP-16 LIDARs to generate people tracks. The many intricacies involved in this process are described below.

Motivation

While working on research in the use of imitation learning to create motion policies for our social navigation robot, Jackrabbot2, we needed to fuse data from different sources on the robot in order to track people trajectories in real time.

Just like you and I, when we walk around other people, we wanted JR2 to be able to understand the location and trajectory of people in its environment, in order to make intelligent decisions about how to interact with and move naturally among those people.

We also wanted to be able to identify and re-identify specific people in the scene, so that JR2 is able to have context-aware interactions with specific people in a natural way.

Re-identification is a challenging problem. Re-identification should occur after a track is lost or a target moves between frames, e.g. moves between different cameras or in and out of the view of one camera. Even state of the art tracking systems have difficulty associating a previous track with a current track when the same object re-appears in a frame. To minimize the need for this difficult re-identification process, we use the 360° camera on JR2 to keep people in the same frame as much as possible, even as they move around the robot. The exception is at the seam of the 360° image, where Deep SORT still sees the track go in and out of frame.

Sensors and Hardware

The sensor suite on JackRabbot2 is similar to that of a small self driving car.

On the tower portion of the robot, from top to bottom, the main sensors are:

GPS
VLP-16 LIDAR (number 1)
ZED Stereo RGB Camera
OCCAM Omni Stereo RGB Camera, RGB 360°
VLP-16 LIDAR (number 2)

Around the base are 2x 2D SICK LIDARs -- one in the front and one in the back.

Inside the chassis is a computer with lots of cores, ram and 2x GPUs along with a plethora of communication devices as well as batteries for the computer and motion systems.

On the side you can see the KINOVA JACO arm and the whole thing is built on a Segway base.

Method

The general idea is to perform tracking on the 2D cylindrical image and project this 2D tracking information into 3D, producing 3D tracking information.

The ROS node that performs the 2D to 3D operations looks something like this:

I developed a model for the cylindrical camera, similar to a pinhole camera model, using some simple geometry to allow us to project between 2D and 3D. This is detailed in the next section.

For extrinsic calibration, we use the CAD model provided by the manufacturer of JR2. The LIDARs' frames are transformed to match the OCCAM RGB 360° camera. We project the color information into the point cloud for visual inspection of the quality of the calibration. Moreover, since the LIDARs are both higher and lower than the OCCAM, in some cases these sensors can see parts of the scene invisible to the RGB camera. Therefore, I use a custom depth map to color only the foreground point cloud points, visible to the RGB camera.

With this information, along with the 2D tracking information, aka. "Darknet Bounding Box" in the diagram, frustum culling is used to segment the point cloud. I then find the center of mass on the segmented points and create the 3D track corresponding to the 2D tracking information.

Finally a Kalman filter is applied (top down) to the 3D tracks. The Kalman filter model is currently linear, which works well for targets in motion. A random walk filter may be added in the future to account for stationary or mostly-still targets.

Cylindrical Camera Model

Given $(X,Y,Z)$ are lidar points in 3D and $(u,v)$ are the coordinates of a pixel in the RGB 360 image.

Figure 1:

The planar projective pinhole camera model is applied to a cylinder as follows.

Planar Pinhole Model

Recall the pinhole model in pixel space to translate $(X,Y,Z)$ -> $(u,v)$ is given by:

$(u,v) = \Big(f_m\big(\frac{w_{img}}{w_{ccd}}\big)\big(\frac{X}{Z}\big),f_m\big(\frac{h_{img}}{h_{ccd}}\big)\big(\frac{Y}{Z}\big)\Big)$

Where the camera matrix is given by $K$ :

$K = \left( \begin{array}{ccc} f_x & s & x_0 \\ 0 & f_y & y_0 \\ 0 & 0 & 1 \end{array} \right)$

We want an estimate for the focal length of the cylindrical camera. The cylindrical image is composed of images from 5 monocular RGB cameras which are stitched to create the cylindrical image. The individual monocular cameras are calibrated independently giving us one focal length estimate per camera. Therefore, we use the median focal length as proposed by Szeliski and Shum in Creating Full View Panoramic Image Mosaics and Environment Maps.

$f_m$ is the median metric focal length of the calibrated cameras from which the 360 image is composed

$w_{img}, h_{img}$ is the image width, height in pixels

$w_{ccd}, h_{ccd}$ is the metric sensor size

Then the median focal length in pixels along the y axis, from all calibrated cameras:

$\hat{f_{px}} = f_m\Big(\frac{w_{img}}{w_{ccd}}\Big)$

Cylindrical Pinhole Model

Let $(H_{img},W_{img})$ be the height and width of the cylindrical image, in pixels.

The horizontal component is calculated under the assumption that our image spans the full 360 degree rotation of the cylinder.

$u = \left\{ \begin{array}{ll} W_{img}\big(\frac{\arctan{\frac{X}{Z}}}{2\pi}\big) & \quad x \geq 0 \\ W_{img}\big(\frac{\arctan{\big(\frac{X}{Z}\big)}+\pi}{2\pi}\big) & \quad x < 0 \end{array} \right\}$

For the vertical component, $Z\prime$ in cylindrical space is computed from $Z$ in planar projective coordinates:

$Z\prime = Z \times sec(θ)$

(see $Z\prime$ in Figure 1)

And substitute $Z$ with $Z\prime$ in the pinhole model:

$v = f_m\big(\frac{w_{img}}{w_{ccd}}\big)\big(\frac{Y}{Z \times \sec{(θ)}}\big)$

Therefore:

$v = f_{px}\big(\frac{Y}{Z \times \sec{(θ)}}\big)$

Taking into account the cameras composing the RGB360 image, where $\hat{f_{px}}$ is the median focal length in pixels along the y axis, from all calibrated cameras:

$v = \hat{f_{px}}\big(\frac{Y}{Z \times \sec{(θ)}}\big)$

Accounting for the calibrated optical center, where:

$\hat{y}_{0}$ is the median optical center along the y axis, from all calibrated cameras

$\hat{h}_{img}$ is the median calibrated image height, from all calibrated cameras $v = \hat{f_{px}}\big(\frac{Y}{Z \times \sec{(θ)}}\big) + \big(\frac{\hat{y_0}}{\hat{h_{img}}}\big)(H_{img})$

Implementation

The package is implemented as a ROS Kinetic node in C++.

Default values are read from a yaml config file. ROS allows these default values to be overridden on the command line or in a ROS launch file.

PointCloud2 (LIDAR) and BoundingBox (2D Tracking) messages are synced with an approximate time sync policy.
When an in-sync message pair is received by the rgb360_lidar node, the incoming cloud is transformed to the rgb360 frame.
- The target frame is configurable, given by the input parameter frames/rgb360 (default is occam).
- By default the incoming pointcloud's upper_velodyne_frame and lower_velodyne_frame are transformed to match the occam frame:

To allow fine tuning this transformation, the incoming cloud is optionally transformed again, as given by the calibration parameters under lidar_to_rgb/calibrated.
- The defaults are specified in defaults.yaml
- 0 values in these parameters indicate no transformation will be made.
- This transformation is relative to the incoming cloud's frame (this is upper_velodyne_frame by default, but defined by the incoming cloud)
2D bounding boxes from the darknet detections (this is the message type, actual detections are output of Deep SORT) are then iterated. For each bounding box:
- The original bounding box is reduced by bounding_box/reduce_by to minimize any extra noise captured around the edges of the following steps
- The PCL FrustumCulling filter is applied to the incoming cloud by transforming the filter's camera pose and field of view such that:
  - Horizontal field of view: bounding box horizontal pixel count / total horizontal pixel count
  - Vertical field of view is initialized to 90 in the first pass
  - Horizontal rotation: bounding box center in pixel coordinates / total pixels
  - Vertical rotation: bounding box center in pixel coordinates x radians per pixel value from calibration
- For each point in the initial frustum culling filter:
  - The 3D point is projected into 2D space
  - The computed y coordinate is compared to the 2D RGB bounding box. Points within the box are kept and the rest discarded.
  - Remaining points are sorted by distance from the origin (closest first) and re-drawn onto a depth map with pixel size given by reprojected_pixel_size_[x|y]
    - Points that would be obscured by a closer pixel (of the size given above) are removed from the segmented cloud
- Results of the above segmentation are passed through a statistical outlier removal filter
- If enough points remain after the above steps (more than min_segmented_cloud_size), people detections and tracks are extracted from the segmented and de-noised cloud by:
  - Taking the centroid of the segment
  - Comparing this position with previous observations and if a large jump occurs, inferring the location from previous velocity
- People locations, as top-down (x,y) coordinates, are passed through an Kalman filter.
- People tracks are smoothed by continuing to publish missing tracks (given by track_id) until tracking_persist_upto_misses synced messages have passed, even when they do not result in a tracking location for the given track_id.
Results from the above filter operations are aggregated into one cloud composed of the two VLP-16s with colorization from the OCCAM
- The cloud's frame is updated to match the target frame (default is occam)
- The aggregated cloud is published