Overview :
This task is done as part of the Autonomous Stair Climbing Robot project and a paper titled 'Deep Learning based Stair Detection and Statistical Image Filtering for Autonomous Stair Climbing' has been published at the IEEE International Conference on Robotic Computing (IRC) 2019 at Naples, Italy.
YOLO (You Only Look Once) is a state-of-the-art, real-time object detection algorithm. We have used transfer learning on YOLOv3 to achieve real-time Stair Detection. It predicts the object class and bounding boxes which contain the particular object.
YOLO (You Only Look Once) is a state-of-the-art, real-time object detection algorithm. We have used transfer learning on YOLOv3 to achieve real-time Stair Detection. It predicts the object class and bounding boxes which contain the particular object.
Data Collection and Annotation :
We have collected various images of stairs, mainly from the campus of Visvesvaraya National Institute of Technology, Nagpur and the remaining from the Internet. As our target is for real-time detection of stairs on our Stair Climbing robot, we have tried to ensure that all of the images are obtained from a height of about 20 cm from the ground (height of camera on the bot from the ground). We collected a total of 848 images.
We have used the LabelImg tool to annotate the images, and used another Python script to convert the resulting .xml files to required .txt files in YOLO format. Realized later that there is an option available in LabelImg to directly get YOLO format labels. See some sample images with ground truth bounding boxes on the left. |
Data Augmentation :
Due to difficulty in collection of large number of images, we have applied data augmentation techniques to augment the size of our dataset. Currently, we have used only horizontal flipping. So, we were able to double the number of images in our dataset.
Total images are 1696 then. |
|
Dataset can be downloaded from here.
Transfer Learning :
Finally, we have trained the deep Convolutional Neural Network on our dataset (mentioned above).
YOLOv3 provides 2 versions of the deep CNN, namely YOLOv3 and tiny-YOLOv3. Tiny-YOLOv3 has a shallower CNN (around 9 convolutional layers) compared to the full sized YOLOv3 (around 24 convolutional layers). Tiny-YOLOv3 is aimed at lower-end hardware (embedded systems without GPUs or with lower-end GPUs).
We have trained both these variants on our dataset and present the mAP (mean Average Precision), F1-scores, IoU (Intersection over Union) and other metrics data obtained after training for specific no. of iterations, in this document :
Note that 25% probability threshold (i.e. all bounding boxes with probability > 0.25 are considered as valid bounding boxes) for detection is used by default for all these metrics.
Also note that performance is measured on a test set (images unseen by the network).
YOLOv3 provides 2 versions of the deep CNN, namely YOLOv3 and tiny-YOLOv3. Tiny-YOLOv3 has a shallower CNN (around 9 convolutional layers) compared to the full sized YOLOv3 (around 24 convolutional layers). Tiny-YOLOv3 is aimed at lower-end hardware (embedded systems without GPUs or with lower-end GPUs).
We have trained both these variants on our dataset and present the mAP (mean Average Precision), F1-scores, IoU (Intersection over Union) and other metrics data obtained after training for specific no. of iterations, in this document :
Note that 25% probability threshold (i.e. all bounding boxes with probability > 0.25 are considered as valid bounding boxes) for detection is used by default for all these metrics.
Also note that performance is measured on a test set (images unseen by the network).
Performance on tiny-YOLOv3 with augmented dataset
Results :
Demonstration :
Conclusion :
We are able to achieve over 100 fps on tiny-YOLOv3 when testing on a video on a Nvidia GTX 1080Ti. On the other hand, testing on a video on a Nvidia Jetson TX1 gives around 20-25 fps when input size of network is 288x288 and 10-15 fps when input size is 416x416.
Note that when input size is larger, we get better accuracy.
However, the result obtained on the TX1 is more relevant given that this task is to be performed on a customized stair climbing bot, which will house a Jetson TX1 as the main controller board.
Note that video size is 640x320 at 60fps.
Note that when input size is larger, we get better accuracy.
However, the result obtained on the TX1 is more relevant given that this task is to be performed on a customized stair climbing bot, which will house a Jetson TX1 as the main controller board.
Note that video size is 640x320 at 60fps.