This post follows from a project at the Stanford Vision and Learning Lab, where I've been working for some time, and serves as documentation for our social navigation robot's onboard multi-target person-tracker I developed under the guidance of Roberto, Marynel, Hamid, Patrick and Silvio.
The above video shows the results of the tracker. The top portion of the video is the result of the 360° cylindrical image having been passed through YOLOv2 and Deep SORT. The bottom portion of the video shows the results from fusing the top image with 2x VLP-16 LIDARs to generate people tracks. The many intricacies involved in this process are described below.
While working on research in the use of imitation learning to create motion policies for our social navigation robot, Jackrabbot2, we needed to fuse data from different sources on the robot in order to track people trajectories in real time.
Just like you and I, when we walk around other people, we wanted JR2 to be able to understand the location and trajectory of people in its environment, in order to make intelligent decisions about how to interact with and move naturally among those people.
We also wanted to be able to identify and re-identify specific people in the scene, so that JR2 is able to have context-aware interactions with specific people in a natural way.
Re-identification is a challenging problem. Re-identification should occur after a track is lost or a target moves between frames, e.g. moves between different cameras or in and out of the view of one camera. Even state of the art tracking systems have difficulty associating a previous track with a current track when the same object re-appears in a frame. To minimize the need for this difficult re-identification process, we use the 360° camera on JR2 to keep people in the same frame as much as possible, even as they move around the robot. The exception is at the seam of the 360° image, where Deep SORT still sees the track go in and out of frame.
The sensor suite on JackRabbot2 is similar to that of a small self driving car.
On the tower portion of the robot, from top to bottom, the main sensors are:
Around the base are 2x 2D SICK LIDARs -- one in the front and one in the back.
Inside the chassis is a computer with lots of cores, ram and 2x GPUs along with a plethora of communication devices as well as batteries for the computer and motion systems.
On the side you can see the KINOVA JACO arm and the whole thing is built on a Segway base.
The general idea is to perform tracking on the 2D cylindrical image and project this 2D tracking information into 3D, producing 3D tracking information.
The ROS node that performs the 2D to 3D operations looks something like this:
I developed a model for the cylindrical camera, similar to a pinhole camera model, using some simple geometry to allow us to project between 2D and 3D. This is detailed in the next section.
For extrinsic calibration, we use the CAD model provided by the manufacturer of JR2. The LIDARs' frames are transformed to match the OCCAM RGB 360° camera. We project the color information into the point cloud for visual inspection of the quality of the calibration. Moreover, since the LIDARs are both higher and lower than the OCCAM, in some cases these sensors can see parts of the scene invisible to the RGB camera. Therefore, I use a custom depth map to color only the foreground point cloud points, visible to the RGB camera.
With this information, along with the 2D tracking information, aka. "Darknet Bounding Box" in the diagram, frustum culling is used to segment the point cloud. I then find the center of mass on the segmented points and create the 3D track corresponding to the 2D tracking information.
Finally a Kalman filter is applied (top down) to the 3D tracks. The Kalman filter model is currently linear, which works well for targets in motion. A random walk filter may be added in the future to account for stationary or mostly-still targets.
Given are lidar points in 3D and are the coordinates of a pixel in the RGB 360 image.
The planar projective pinhole camera model is applied to a cylinder as follows.
Recall the pinhole model in pixel space to translate -> is given by:
Where the camera matrix is given by :
We want an estimate for the focal length of the cylindrical camera. The cylindrical image is composed of images from 5 monocular RGB cameras which are stitched to create the cylindrical image. The individual monocular cameras are calibrated independently giving us one focal length estimate per camera. Therefore, we use the median focal length as proposed by Szeliski and Shum in Creating Full View Panoramic Image Mosaics and Environment Maps.is the median metric focal length of the calibrated cameras from which the 360 image is composed
Then the median focal length in pixels along the y axis, from all calibrated cameras:
Let be the height and width of the cylindrical image, in pixels.
The horizontal component is calculated under the assumption that our image spans the full 360 degree rotation of the cylinder.
For the vertical component, in cylindrical space is computed from in planar projective coordinates:
(see in Figure 1)
And substitute with in the pinhole model:
Taking into account the cameras composing the RGB360 image, where is the median focal length in pixels along the y axis, from all calibrated cameras:
Accounting for the calibrated optical center, where:is the median optical center along the y axis, from all calibrated cameras
The package is implemented as a ROS Kinetic node in C++.
Default values are read from a
yaml config file. ROS allows these default values to be overridden on the command line or in a ROS launch file.
PointCloud2 (LIDAR) and BoundingBox (2D Tracking) messages are synced with an approximate time sync policy.
When an in-sync message pair is received by the
rgb360_lidar node, the incoming cloud is transformed to the rgb360 frame.
frames/rgb360 (default is
lower_velodyne_frame are transformed to match the
To allow fine tuning this transformation, the incoming cloud is optionally transformed again, as given by the calibration parameters under
upper_velodyne_frame by default, but defined by the incoming cloud)
2D bounding boxes from the darknet detections (this is the message type, actual detections are output of Deep SORT) are then iterated. For each bounding box:
bounding_box/reduce_by to minimize any extra noise captured around the edges of the following steps
bounding box horizontal pixel count /
total horizontal pixel count
bounding box center in pixel coordinates /
bounding box center in pixel coordinates x
radians per pixel value from calibration
min_segmented_cloud_size), people detections and tracks are extracted from the segmented and de-noised cloud by:
tracking_persist_upto_misses synced messages have passed, even when they do not result in a tracking location for the given
Results from the above filter operations are aggregated into one cloud composed of the two VLP-16s with colorization from the OCCAM