Technical Approach

In this work, we propose a composite deep convolutional neural network architecture that learns to predict both the semantic category and motion status of each pixel from a pair of consecutive monocular images. The composition of our SMSnet architecture can be deconstructed into three components: a section that learns motion features from generated optical flow maps, a parallel section that generates features for semantic segmentation, and a fusion section that combines both the motion and semantic features and further learns deep representations for pixel-wise semantic motion segmentation.

SMSnet architecture
SMSnet Architecture

Please find the detailed description of the architecture in our IROS 2017 paper.



To facilitate training of neural networks for semantic motion segmentation and to allow for credible quantitative evaluation, we make the following datasets publicly available. Each of these datasets have pixel-wise semantic labels for 10 object classes and their motion status (static or moving). Annotations are provided for the following classes: sky, building, road, sidewalk, cyclist, vegetation, pole, car, sign and pedestrian.


Please cite our work if you use the Cityscapes-Motion Dataset or the KITTI-Motion Dataset and report results based on it.

author = {Johan Vertens and Abhinav Valada and Wolfram Burgard},
title = {SMSnet: Semantic Motion Segmentation using Deep Convolutional Neural Networks},
booktitle = {Proc.~of the IEEE Int.~Conf.~on Intelligent Robots and Systems (IROS)},
year = 2017,
url = {},
address = {Vancouver, Canada}

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the Cityscapes-Motion or the KITTI-Motion datasets, please consider citing the first paper mentioned under publications.



The Cityscapes-Motion dataset is a suppliment to the semantic annotations provided by the Cityscapes dataset, containing 2975 training images and 500 validation images. We provide manually annotated motion labels for the category of cars. The images are of resolution 2048×1024 pixels.


The KITTI-Motion dataset contains pixel-wise semantic class labels and moving object annotations for 255 images taken from the KITTI Raw dataset. The images are of resolution 1280×384 pixels and contain scenes of freeways, residential areas and inner-cities.


A software implementation of this project can be found on our GitHub repository. The implementation is based on Caffe and is licensed for non-commercial use (license summary)

Video Demo


  • Johan Vertens, Abhinav Valada, Wolfram Burgard
    SMSnet: Semantic Motion Segmentation using Deep Convolutional Neural Networks
    Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Vancouver, Canada, 2017.

  • People