RGB Road Scene Material Segmentation

Sudong Cai, Ryosuke Wakaki, Shohei Nobuhara, and Ko Nishino
Kyoto University

rgbrms header image
We address RGB road scene material segmentation, i.e., per-pixel segmentation of materials in real-world driving views with pure RGB images, by building a new tailored benchmark dataset and model for it. Our new dataset, KITTI-Materials, based on the well-established KITTI dataset, consists of 1000 frames covering 24 different road scenes of urban/suburban landscapes, annotated with one of 20 material categories for every pixel in high quality. It is the first dataset tailored to RGB material segmentation in realistic driving scenes which allows us to train and test any RGB material segmentation model. Based on an analysis on KITTI-Materials, we identify the extraction and fusion of texture and context as the key to robust road scene material appearance. We introduce Road scene Material Segmentation Network (RMSNet), a new Transformer-based framework which will serve as a baseline for this challenging task. RMSNet encodes multi-scale hierarchical features with self-attention. We construct the decoder of RMSNet based on a novel lightweight self-attention model, which we refer to as SAMixer. SAMixer achieves adaptive fusion of informative texture and context cues across multiple feature levels. It also significantly accelerates self-attention for feature fusion with a balanced query-key similarity measure. We also introduce a built-in bottleneck of local statistics to achieve further efficiency and accuracy. Extensive experiments on KITTI-Materials validate the effectiveness of our RMSNet. We believe our work lays a solid foundation for further studies on RGB road scene material segmentation.
  • RGB Road Scene Material Segmentation
    S. Cai, R. Wakaki, S. Nobuhara and K. Nishino,
    in Asian Conference on Computer Vision ACCV’22, 2022.
    [ paper ][ supp. PDF ][ project ][ code/data ]

Talk

KITTI-Materials Dataset

kitti-materials dataset
We build a new dataset tailored to RGB road scene material segmentation by annotating images from the KITTI dataset. We refer to this new dataset as the KITTI-Materials dataset. By building on a widely adopted road scene dataset, we are able to establish a dataset guaranteed to be relevant for autonomous driving research. KITTI-Materials consists of 1000 frames densely annotated with one of 20 material categories covering 24 different road scenes of common urban/suburban landscapes. The KITTI-Materials is the first benchmark dataset for pure RGB road scene material segmentation which enables us to train and evaluate our ideas for the task and others to follow.

RMSNet

overview
We identify the extraction and fusion of texture and image context as the key to robust road scene material appearance. For this, we extract features of local textures and long-range context from multilevel hierarchies of an efficient transformer encoder and merge multi-level multiscale features with a novel self-attention-based feature fusion model which we refer to as SAMixer. SAMixer achieves the N-to-1 multi-level feature fusion by generalizing vanilla spatial self-attention to a new axis created by concatenating each feature maps at the aligned positions, where we build self-attention on a newly proposed balanced query-key similarity measure and Bottleneck Local Statistics Encoding-Decoding (BLSED) strategy to improve the effectiveness and efficiency of the self-attention operation for multi-level feature fusion.

Results

results
These show examples of visualized segmentation results on our KITTI-Materials dataset. Our method RMSNet achieves cleaner segmentation than CNN baseline DeepLabv3+ (denoted by DLv3+) and SoTA transformer segmentation framework SegFormer on different materials including “fabric,” “glass,” “metal,” “rubber,” and “human body” which span a wide range of appearances as part of road scene objects.
scale results
We also show visualized examples of segmentation results of moving cars of different scales. Our RMSNet produces richer details of the contours and shapes of objects composed of multiple materials, e.g., windows, headlights, vehicle bodies, and wheels of moving cars.