This project was completed as a final project for 6.869 (Advances in Computer Vision), in collaboration with Mark Mazumder. Full details can be seen in the final report can be seen here. Source code can be seen here.
After a semester long course on machine learning and convolutional neural networks applied to computer vision, it was time for the class final project. After applying neural network models to beefed up servers and AWS instances, we wanted to see how far we could get with computer vision implemented in embedded computing environments for mobile computing. A great example of this is robotics, which are often SWaP (size, weight and power) constrained.
As our platform, we borrowed the MIT RACECAR from MIT’s 6.141 class. It has an embedded GPU with 4 GB of shared memory, a 720p camera, and an API for control of velocity and steering. The idea was to incorporate object detection in an autonomous control loop on a small robot car to recognize and follow a specific person as they walk around. We track on shoes since the car’s camera is at ground-level. We also set a goal of only relying on onboard compute.
TensorFlow is running real-time inference as part of the control loop and computing the bounding boxes that you see in the car’s view. To command the car to autonomously track our shoes, we implemented two PID control loops, one each for velocity and steering. The steering controller’s error estimate attempts to keep the shoes centered in the image.
The velocity controller attempts to keep the bounding boxes at a fixed size, or distance. A median filter helps to maintain robustness and stability.
After some tuning of loop parameters, the robot showed reasonable convergence characteristics to step response in steering and distance:
For object detection, we used the Single-Shot Multibox, or SSD, network. SSD predicts object locations inside a set of default boxes at many aspect ratios and scales. For example, the cat matches the small blue box, and the dog matches the larger red box. TensorFlow’s SSD-MobileNet implementation has been optimized for embedded devices. For instance, Mobilenet’s separable convolution filters require far fewer multiplications than normal convolution. As a result, SSD-Mobilenet can achieve similar performance to more conventional CNNs while using a fraction of the resources.
The RACECAR’s GPU shares memory with the CPU, so to fit our network on the device we further had to cap tensorflow’s memory allocator to under 2GB. We were able to achieve an on-device inference rate of 7 FPS. We unexpectedly had to recompile the linux kernel for it to support both the speed controller and tensorflow.
We used weights pre-trained on the COCO dataset, (which doesn’t actually contain a shoe category) and fine-tuned it using 1000 labeled images of our own shoes. Our model was fine-tuned for 14268 steps. Mean Average Precision (mAP) performance at 0.5 IoU (Intersection over Union) can be seen below:
We tried going a step further and implemented gesture control on our car. As the user’s hands are moved up and down, the car is commanded to move back and forth. We calculate the slope of the line between the two bounding boxes, and proportionally output a velocity command. The car stops when two hands are held level with each other. An example application for this might be in medical assistive technology for human-robot interactions.
As a final stretch goal, we fit a license plate recognition model onto our autonomous vehicle. Using existing models borrowed from the openalpr library, we were able to achieve this in real-time on-device. The system can achieve multiple simultaneous detections. We successfully simulated the use of low-end cameras by running inference on low resolution video. applications include traffic monitoring or self-driving cars.