Kyoto University Computer Vision Lab

Kyoto University Computer Vision Lab

Dept. of Intelligence Science and Technology, Graduate School of Informatics

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

Yuto Matsubara and Ko Nishino
Kyoto University

HeatFormer — We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery
Yuto Matsubara and Ko Nishino,
in Proc. of Conference on Computer Vision and Pattern Recognition CVPR’25, Jun., 2025.
[ arXiv ][ video ][ project ][ code ]

Video

Overview

architecture of HeatFormer — We achieve occlusion-robust multiview human body shape and pose recovery with a novel transformer which we refer to as HeatFormer. HeatFormer realizes neural optimization for HMR in a feed-forward inference in which the Transformer encoder-decoder model serves as an unrolled iteration of SMPL fitting to the observed images. A key idea underlying HeatFormer is to represent and align the body joints as heatmaps. HeatFormer forms heatmaps from the current view-dependent SMPL estimates which are iteratively aligned with the input multiview heatmaps through the encoderdecoder inference iteration. It first extracts image features and a heatmap for each view which are aggregated with a novel encoder and input to the decoder. The decoder also takes in heatmaps generated from the current SMPL estimate and through its unrolled inference, iteratively aligns them together.

Results

effectiveness of neural optimization — HeatFormer is an unrolled iterative optimizer realized through its forward inference. HeatFormer converges to accurate SMPL estimates within three unrolled inferences.

qualitative results on Human3.6M — Qualitative results on the Human3.6M dataset. HeatFormer successfully leverages the multiview observations to resolve the complex occlusions.

qualitative results on mpii3d — Qualitative results on the MPI-INF-3DHP [35] dataset. The body shape and pose behind various kinds of occlusions are successfully recovered.

quantitative results on BEHAVE — Qualitative results on the BEHAVE dataset with object occlusions taken from 4 views. This dataset is not used in training. HeatFormer generalizes well to unseen scenes and unseen types of occlusion thanks to its neural optimization formulation.

qualitative results on RICH — Qualitative results on RICH dataset taken from 4 views. This dataset is a real scene dataset and not used in training. The results clearly demonstrate the strong generalization capability and occlusion-robustness of HeatFormer. Heatformer can estimate acuuratly regardless of view configuration.