A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (2024)

Zhen Fan\dagger, Peng Dai\dagger, Zhuo Su\dagger, Xu Gao, Zheng Lv,
Jiarui Zhang, Tianyuan Du, Guidong Wang, Yang Zhang\ddagger
\dagger Equal contribution  \ddagger Corresponding author
PICO

Abstract

Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal Egocentric human Motion dataset with Head-Mounted Display (HMD) and body-worn IMUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (1)

1 Introduction

Egocentric human pose estimation (HPE) has gained significant attention in computer vision, driven by the demand for accurate motion tracking in immersive VR/AR environments. Unlike traditional exocentric HPE which relies on external sensors, egocentric HPE employs body-worn sensors such as egocentric cameras or sparse IMUs. Although rapid progress has been made in this field, there remain challenges in obtaining accurate full-body poses from single-modal data due to issues like 1) self-occlusion and viewpoint variations in egocentric vision; and 2) sparsity and drifting of IMU data. Most importantly, the lack of real-world multimodal training data poses the most significant challenge.

Previous works[35, 47] introduced egocentric datasets using experimental fisheye camera setups to capture images and annotate 3D joints. However, these setups are impractical for real VR/AR products, which need compact, lightweight designs. Synthetic datasets[38, 2, 9] use physics engines for egocentric image rendering but suffer a domain gap with real images due to the complexity of human motion and environments. Meanwhile, the lower body may be occluded in a vision-based setting, and some body parts may fall outside the field of view (FOV) depending on the body pose. IMU-based datasets avoid occlusion but suffer from drift over time and ill-posed problems from sparse observations. Besides, existing methods[17, 52, 10] typically use synthetic IMU data from AMASS[28], which may not accurately reflect real-world noise and drift. Some datasets[39, 15] provide real 3 Degrees of Freedom(3DoF, rotation) data from XSens, while others[10] include 6DoF(rotation and position) data of head, hands, and 3DoF for lower legs, but these are small-scale and primarily used for evaluation. Recently, several large-scale multimodal datasets[27, 14] have been released, offering RGB images, upper body IMU data, and motion narrations. However, the forward-facing camera limits the egocentric view, and missing lower-body IMU signals can cause ambiguity.

Combining egocentric cameras and body-worn IMUs offers a promising multimodal solution due to their lightweight and flexible design. This configuration is also commonly found in VR scenarios. Our proposed EMHI dataset, as shown in Fig. 1, features a VR headset with two downward-sloping cameras for egocentric image capture, 6DoF head and hand tracking, and additional IMUs on an actual VR device for lower-leg 3DoF tracking. We use a markerless multi-view camera system for SMPL[25] ground truth acquirement, with accuracy and consistency refinement using IMU data, and synchronization via OptiTrack. Furthermore, we propose a new baseline method, MEPoser, integrating egocentric images and IMU data to perform real-time HPE on a standalone VR headset. The method employs a multimodal fusion encoder, a temporal feature encoder, and MLP-based regression heads to estimate SMPL body model parameters, effectively demonstrating the advantages of multimodal data fusion in enhancing pose accuracy and the value of our dataset. This approach paves the way for further research in egocentric HPE using multimodal inputs.

In summary, our work makes the following contributions:

  • We first introduce a large-scale multimodal egocentric motion dataset EMHI on the real VR device, including stereo downward-sloping egocentric images, full-body IMU signals, and accurate human pose annotations.

  • We propose a baseline method MEPoser, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads to perform real-time HPE on a standalone HMD.

  • The experiment results demonstrate the rationality of our multimodal setting and the effectiveness of EMHI for addressing egocentric HPE.

2 Related Work

DatasetDeviceReal/Synth.Sensor ModalitySMPL(x)Statistic
Egocentric VisionInertialActionsSubjectsFrames
Mo2Cap2𝑀superscript𝑜2𝐶𝑎superscript𝑝2Mo^{2}Cap^{2}italic_M italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_a italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (2)Synth. MonocularDownward-Facing--3K700530K
EgoPWA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (3)Real. MonocularDownward-Facing--2010318K
EgoCapA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (4)Real BinocularDownward-Facing---830K
UnrealEgoA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (5)Synth. BinocularDownward-Facing--3017450K
DIP-IMUA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (6)Real- Full-Body3DoF×\times×61510330K
FreeDancingA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (7)Real- Full-Body6DoF×\times×3, 3DoF×\times×3-8532.8K
NymeriaA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (8)Real BinocularForward-Facing Upper-Body6DoF×\times×320264260M
Ego-Exo4D(Ego Pose)A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (9)Real BinocularForward-Facing Head6DoF×\times×1---9.6M
OursA Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (10)Real BinocularDownward-Sloping Full-Body6DoF×\times×3, 3DoF×\times×239583.07M

2.1 Egocentric Motion Dataset

As shown in Tab. 1, existing egocentric motion datasets can be divided into vision-based, IMU-based, and multimodal datasets depending on the input modality.

Unlike previous HPE datasets captured by third-view cameras[16, 3, 19, 50, 5], vision-based egocentric motion datasets provide first-person perspective images using head-mounted monocular or binocular cameras, with the corresponding annotations of the wearer’s poses. Mo2Cap2[47] and xR-EgoPose[38] made an early effort to build the synthetic monocular dataset with a downward-facing fisheye camera. EgoPW[43] proposed the first in-the-wild dataset and was followed by EgoGTA[44] and ECHA[23] with the same camera setting. EgoWholeBody[45] is the latest synthetic dataset providing high-quality images and SMPL-X annotations. EgoCap[35] is a pioneer binocular dataset captured by helmet-mounted stereo cameras, containing 30K frames recorded in a lab environment. EgoGlass[51] optimized the binocular setup with two front-facing cameras mounted on the glasses frames. To relieve the dataset scale limitation, UnrealEgo[1] proposed a large-scale and highly realistic stereo synthetic dataset with 450K stereo views and was extended to 1.25M in UnrealEgo2[2]. SynthEgo[9] extended synthetic datasets with more identities and environments, annotated with SMPL-H for better body shape descriptions.

Sparse IMU-based datasets provide an alternative for this problem. AMASS[28] can provide a large-scale synthetic IMU dataset. TotalCapture[39] and DIP-IMU[15] offered real IMU data captured by Xsens and SMPL pose annotations obtained by marker-based optical mocap system and IMU-based method[41] respectively. PICO-FreeDancing[10] provided sparse IMU data with SMPL format GT fitting using OptiTrack data. However, real IMU-based datasets are generally limited in scale.

Multimodal datasets[11, 13, 8, 34] have attracted significant attention in recent years due to the complementarity of different data modalities.Ego-Exo4D[14], Nymeria[27] and SimXR[26] captured real-world images by Project Aria glasses[36], along with the IMU data in upper-body. Ego-Exo4D provided up to 9.6M image frames with annotations of the body and hand joint positions. Nymeria further offered SMPL format data derived from Xsens mocap suits, with limited clothing diversity of the captured body.However, in these datasets, either the forward-facing perspective restricts the perception range of the wearer’s body, or they have not integrated downward-sloping perspectives and sparse full-body IMU signals on actual VR/AR devices.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (11)

2.2 Egocentric Human Pose Estimation Methods

Existing egocentric HPE methods primarily utilize single-modal input. Vision-based methods have been widely investigated [35, 43, 44, 22, 23, 20, 30, 29, 9]. Jian Wang et al.[42] introduced a spatiotemporal optimization method for single-view egocentric sequences to obtain the 3D skeleton results. Recently, Wang et al.[45] proposed an egocentric motion capture method that combines the vision transformer for undistorted image patch feature extracting and uses diffusion-based motion priors for pose refinement. However, 3D pose estimation from a single image remains challenging due to the lack of depth information. To address this, UnrealEgo[1] introduced stereo egocentric skeleton tracking methods by integrating a weight-shared encoder for stereo heatmap generation and a multi-branch autoencoder for 3D pose prediction. Hiroyasu Akada et al.[2] further enhanced this with a transformer-based model utilizing 3D scene information and temporal features. Despite these advancements, challenges with invisible body parts due to self-occlusion and out-of-view joints persist.

Methods using sparse tracking signals from body-worn IMUs have garnered significant attentions[46, 40, 15, 48, 49, 18, 12, 7]. In egocentric VR and AR scenarios, there are inherently three 6DoF tracking points for the head and hands, with the option to add two additional 3DoF IMUs on the legs.AvatarPoser[17] proposed a global pose prediction framework combining transformer structures with inverse kinematics (IK) optimization, while AvatarJLM[52] introduced a two-stage approach that models joint-level features and uses them as spatiotemporal transformer tokens to achieve smooth action capture.HMD-Poser[10] integrated these inputs, presenting a lightweight temporal-spatial learning method for full-body global 6DoF body action recovery.However, IMU-based data faces challenges such as drift and sparsity. Many current methods rely on synthetic IMU data from AMASS[28], which often fails to capture real-world noise and drift accurately, leading to overfitting as models are not exposed to the complexities and imperfections of real-world conditions.

3 EMHI Dataset

EMHI is a multimodal egocentric motion dataset that contains 3.07M synchronized data pairs organized as 885 sequences recording at 30FPS. Each data pair contains stereo egocentric images (640 ×\times× 480), five IMUs data, and corresponding 3D SMPL pose and 2D keypoints. It is captured by 58 subjects, which are equally split into 29 male and 29 female, with a diverse range of body shapes. Each subject wears their daily clothing during data collection to ensure a wide variety of natural looks. We record 39 common actions of users experiencing games and social applications in VR scenarios and categorize them into upper-body motions, lower-body motions, and full-body motions. Additionally, this dataset is captured under three different environmental lighting conditions: dim light, natural light, and bright light for environment diversity.

3.1 Data Capture System

3.1.1 Hardware

As shown in Fig. 2, the overall hardware consists of three subsystems: EgoSensorKit system to collect sensor data, with a PICO4 headset, two hand controllers, and two leg trackers; Kinect system to obtain SMPL annotations, with 8 cameras recording simultaneously from outside-in viewpoint; Optitrack system for spatiotemporal synchronization between the above two systems, with Optical Rigid Body (ORBs) mounted at the VR headset and all Kinect cameras, allowing all camera moving.

3.1.2 Temporal Synchronization

Kinect and Optitrack systems rely on a signal transmitter device to trigger simultaneously, ensuring their inter-frame alignment.EgoSensorKit and Optitrack could also be synchronized offline with the headset IMU’s and ORB’s angular velocity, according to the motion correlation method in[32]. Finally, the data frames of Kinect and EgoSensorKit are aligned via Optitrack as a bridge. As the recorded frame rate is 30Hz, the maximum synchronization deviation will reach up to 16.5 ms which might be notable during fast motion. So the annotations are further post-processed with linear interpolation to better align with the EgoSensorKit’s timestamps.

3.1.3 Spatial Alignment

The 6DoF of the headset (IMU) in Optitrack coordination Thosuperscriptsubscript𝑇𝑜T_{h}^{o}italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT could be obtained by Tho=TrboThrbsuperscriptsubscript𝑇𝑜superscriptsubscript𝑇𝑟𝑏𝑜superscriptsubscript𝑇𝑟𝑏T_{h}^{o}=T_{rb}^{o}\ T_{h}^{rb}italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = italic_T start_POSTSUBSCRIPT italic_r italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_b end_POSTSUPERSCRIPT, where Thrbsuperscriptsubscript𝑇𝑟𝑏T_{h}^{rb}italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_b end_POSTSUPERSCRIPT is a pre-calibrated rigid transformation between the IMU sensor in headset and its ORB, and the ORB’s 6DoF Trbosuperscriptsubscript𝑇𝑟𝑏𝑜T_{rb}^{o}italic_T start_POSTSUBSCRIPT italic_r italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is tracked with Optitrack. Similarly, the extrinsic parameters of each Kinect RGB camera in the Optitrack coordinate system Tkosuperscriptsubscript𝑇𝑘𝑜T_{k}^{o}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT can be determined using the same method. Then, the spatial transformation between Kinect cameras and headset could be obtained by Tkh=(Tho)1Tkosuperscriptsubscript𝑇𝑘superscriptsuperscriptsubscript𝑇𝑜1superscriptsubscript𝑇𝑘𝑜T_{k}^{h}=({T_{h}^{o}})^{-1}\ T_{k}^{o}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. Finally, the transformation matrix between the Kinects and egocentric cameras could be further calculated by Tkc=ThcTkhsuperscriptsubscript𝑇𝑘𝑐superscriptsubscript𝑇𝑐superscriptsubscript𝑇𝑘T_{k}^{c}=T_{h}^{c}\ T_{k}^{h}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, where Thcsuperscriptsubscript𝑇𝑐T_{h}^{c}italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is also a constant spatial relationship between the headset (IMU) and its egocentric cameras.

3.2 Ground-Truth Acquisition

3.2.1 Keypoints Annotation

We use HRNet[37] to detect 2D keypoints in multi-view Kinect RGB images with the body25 format[6]. Then, we follow HuMMan[5] to derive 3D keypoints annotations P3Dsubscript𝑃3𝐷P_{3D}italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT by triangulation with camera parameters obtained in the spatial alignment, in which we also import smoothness and bone length constraints for P3Dsubscript𝑃3𝐷P_{3D}italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT to reduce temporal jitter and improve human shape consistency.

3.2.2 SMPL Fitting

Multi-view SMPL fitting was a well-solved problem, with the inclusion of 3D joint, prior, smooth, and shape regularization errors.However, due to the occlusion of facial and hand areas by the EgoSensorKit (HMD and controllers) and the Kinect camera’s limited resolution, it’s challenging to ensure the accuracy of the corresponding joint detection.This results in unreasonable SMPL fitting for wrist and head joint rotations. To tackle this problem, we incorporate the global rotation of head Rheadsubscript𝑅𝑒𝑎𝑑R_{head}italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT and wrist joints Rwristsubscript𝑅𝑤𝑟𝑖𝑠𝑡R_{wrist}italic_R start_POSTSUBSCRIPT italic_w italic_r italic_i italic_s italic_t end_POSTSUBSCRIPT, which are transferred from the collected IMU rotations of hand controllers Rcontrollersubscript𝑅𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟R_{controller}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l italic_l italic_e italic_r end_POSTSUBSCRIPT and headset Rheadsetsubscript𝑅𝑒𝑎𝑑𝑠𝑒𝑡R_{headset}italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_s italic_e italic_t end_POSTSUBSCRIPT by Rhead=RheadsetheadRheadsetsubscript𝑅𝑒𝑎𝑑superscriptsubscript𝑅𝑒𝑎𝑑𝑠𝑒𝑡𝑒𝑎𝑑subscript𝑅𝑒𝑎𝑑𝑠𝑒𝑡R_{head}=R_{headset}^{head}\ R_{headset}italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_s italic_e italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h italic_e italic_a italic_d end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_s italic_e italic_t end_POSTSUBSCRIPT and Rwrist=RcontrollerwristRcontrollersubscript𝑅𝑤𝑟𝑖𝑠𝑡superscriptsubscript𝑅𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟𝑤𝑟𝑖𝑠𝑡subscript𝑅𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟R_{wrist}=R_{controller}^{wrist}\ R_{controller}italic_R start_POSTSUBSCRIPT italic_w italic_r italic_i italic_s italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l italic_l italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_r italic_i italic_s italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l italic_l italic_e italic_r end_POSTSUBSCRIPT, where Rcontrollerwristsuperscriptsubscript𝑅𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟𝑤𝑟𝑖𝑠𝑡R_{controller}^{wrist}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l italic_l italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_r italic_i italic_s italic_t end_POSTSUPERSCRIPT and Rheadsetheadsuperscriptsubscript𝑅𝑒𝑎𝑑𝑠𝑒𝑡𝑒𝑎𝑑R_{headset}^{head}italic_R start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_s italic_e italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h italic_e italic_a italic_d end_POSTSUPERSCRIPT are the constant transformation matrix obtained by statistical methods with a large amount of data collected in the standard sensor-wearing settings. Moreover, we leverage the calibrated leg motion tracker data, which represents the knee joint rotation Rkneesubscript𝑅𝑘𝑛𝑒𝑒R_{knee}italic_R start_POSTSUBSCRIPT italic_k italic_n italic_e italic_e end_POSTSUBSCRIPT, to constrain the lower leg pose. With the keypoint annotations and five joint rotations obtained above, we fit the SMPL parameters by minimizing the following energy function:

E(θ,β)=λrotErot+λjointEjoint+λpriorEprior+λsmoothEsmooth+λregEreg,𝐸𝜃𝛽subscript𝜆𝑟𝑜𝑡subscript𝐸𝑟𝑜𝑡subscript𝜆𝑗𝑜𝑖𝑛𝑡subscript𝐸𝑗𝑜𝑖𝑛𝑡subscript𝜆𝑝𝑟𝑖𝑜𝑟subscript𝐸𝑝𝑟𝑖𝑜𝑟subscript𝜆𝑠𝑚𝑜𝑜𝑡subscript𝐸𝑠𝑚𝑜𝑜𝑡subscript𝜆𝑟𝑒𝑔subscript𝐸𝑟𝑒𝑔\begin{split}E(\theta,\beta)=&\lambda_{rot}E_{rot}+\lambda_{joint}E_{joint}+%\lambda_{prior}E_{prior}+\\&\lambda_{smooth}E_{smooth}+\lambda_{reg}E_{reg},\end{split}start_ROW start_CELL italic_E ( italic_θ , italic_β ) = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT , end_CELL end_ROW(1)

where θ75𝜃superscript75\theta\in\mathbb{R}^{75}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT and β10𝛽superscript10\beta\in\mathbb{R}^{10}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT are optimized SMPL pose and shape parameters and λsubscript𝜆\lambda_{*}italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT are balance weight (see Appendix).For occluded joints, we introduce the rotation term to encourage the pose consistency with transferred IMU data as follows:

Erot=j(θ)jRj,subscript𝐸𝑟𝑜𝑡subscript𝑗normsubscript𝜃𝑗subscript𝑅𝑗E_{rot}=\sum_{j}||\mathcal{F}(\theta)_{j}-R_{j}||,italic_E start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | caligraphic_F ( italic_θ ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | ,(2)

where j{head,wrist,knee}𝑗𝑒𝑎𝑑𝑤𝑟𝑖𝑠𝑡𝑘𝑛𝑒𝑒j\in\{head,wrist,knee\}italic_j ∈ { italic_h italic_e italic_a italic_d , italic_w italic_r italic_i italic_s italic_t , italic_k italic_n italic_e italic_e } and \mathcal{F}caligraphic_F indicate a forward kinematic (FK) to get the joint global rotation.Other energy terms are like previous works [4, 5, 31], in which Ejointsubscript𝐸𝑗𝑜𝑖𝑛𝑡E_{joint}italic_E start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT minimizes the 3D distance between P3Dsubscript𝑃3𝐷P_{3D}italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT and regressed SMPL joints. Eprior(θ)subscript𝐸𝑝𝑟𝑖𝑜𝑟𝜃E_{prior}(\theta)italic_E start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT ( italic_θ ) is Vposer prior from SMPLify-X. Esmoothsubscript𝐸𝑠𝑚𝑜𝑜𝑡E_{smooth}italic_E start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT helps to keep smooth pose tracking, while the shape regularization term Eregsubscript𝐸𝑟𝑒𝑔E_{reg}italic_E start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT penalizes large shape variance.With the space alignment result, the SMPL results could transfer from world space to the egocentric camera coordinate and obtain 2D pose annotations on egocentric images.

4 A New Baseline Method: MEPoser

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (12)

To demonstrate the significance of the EMHI dataset and to inspire new designs for multimodal egocentric HPE, we introduce a new baseline method called Multi-modal Egocentric Pose Estimator (MEPoser).MEPoser takes multimodal inputs, including stereo egocentric images and inertial measurements, to extract multimodal representations and perform real-time HPE on a standalone HMD.As shown in Fig.3, MEPoser consists of three components. (1) A multimodal fusion encoder extracts object representations at each frame from multimodal input data. (2) A temporal feature encoder composed of long short-term memory (LSTM) modules and feed-forward networks generates latent variables containing temporal information incorporated from past frames. (3) With the temporal aggregated multimodal features, two MLP-based (multi-layer perception) heads regress the pose and shape parameters of the SMPL model respectively.

4.1 Multimodal Fusion Encoder

The multimodal encoder first has separate feature encoders for different modalities, i.e., two weight-sharing CNN backbones for images and an MLP network for IMU data.To make MEPoser run in real-time on HMD, we use a lightweight RegNetY-400MF[33] backbone, which takes stereo images {Imgleftti,Imgrightti}640×480×1𝐼𝑚superscriptsubscript𝑔𝑙𝑒𝑓𝑡subscript𝑡𝑖𝐼𝑚superscriptsubscript𝑔𝑟𝑖𝑔𝑡subscript𝑡𝑖superscript6404801\{Img_{left}^{t_{i}},Img_{right}^{t_{i}}\}\in\mathbb{R}^{640\times 480\times 1}{ italic_I italic_m italic_g start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_I italic_m italic_g start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT 640 × 480 × 1 end_POSTSUPERSCRIPT as inputs, and generates 2D image features represented as {Fleftti,Frightti}80×60×256superscriptsubscriptF𝑙𝑒𝑓𝑡subscript𝑡𝑖superscriptsubscriptF𝑟𝑖𝑔𝑡subscript𝑡𝑖superscript8060256\{\mathrm{F}_{left}^{t_{i}},\mathrm{F}_{right}^{t_{i}}\}\in\mathbb{R}^{80%\times 60\times 256}{ roman_F start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , roman_F start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT 80 × 60 × 256 end_POSTSUPERSCRIPT.These features are then concatenated and forwarded to a few convolution layers to infer a set of heatmaps {Hleftti,Hrightti}80×60×JsuperscriptsubscriptH𝑙𝑒𝑓𝑡subscript𝑡𝑖superscriptsubscriptH𝑟𝑖𝑔𝑡subscript𝑡𝑖superscript8060𝐽\{\mathrm{H}_{left}^{t_{i}},\mathrm{H}_{right}^{t_{i}}\}\in\mathbb{R}^{80%\times 60\times J}{ roman_H start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , roman_H start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT 80 × 60 × italic_J end_POSTSUPERSCRIPT. Here we predict 22 joints of the SMPL, i.e., J=22𝐽22J=22italic_J = 22.To train the RegNetY-400MF backbone, we calculate the binary cross-entropy with logits loss (BCEWithLogitsLoss) heatmap2Dsuperscriptsubscript𝑒𝑎𝑡𝑚𝑎𝑝2𝐷\mathcal{L}_{heatmap}^{2D}caligraphic_L start_POSTSUBSCRIPT italic_h italic_e italic_a italic_t italic_m italic_a italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT between the GT heatmaps and the estimated 2D heatmaps.Then, the predicted heatmaps are flattened and forwarded to an MLP network to obtain the image feature. We have obtained the IMU and image features so far. To boost the performance of the pose estimation, we added a 3D module to estimate the 3D joint positions in both the local camera coordinate and the global world coordinate. Specifically, given the image features from stereo heatmaps, an MLP network first encodes them to estimate the 3D joint positions in the local camera coordinate P^localJ×3superscriptsubscript^P𝑙𝑜𝑐𝑎𝑙𝐽3\mathrm{\hat{P}}_{local}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT. Then, these joints are transferred to the global SMPL coordinate P^globalJ×3superscriptsubscript^P𝑔𝑙𝑜𝑏𝑎𝑙𝐽3\mathrm{\hat{P}}_{global}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT with the offline calibration results and the online headset’s 6DoF data. P^localJ×3superscriptsubscript^P𝑙𝑜𝑐𝑎𝑙𝐽3\mathrm{\hat{P}}_{local}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT and P^globalJ×3superscriptsubscript^P𝑔𝑙𝑜𝑏𝑎𝑙𝐽3\mathrm{\hat{P}}_{global}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT are used to calculate the 3D joint loss jointsLocalsuperscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐿𝑜𝑐𝑎𝑙\mathcal{L}_{joints}^{Local}caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_o italic_c italic_a italic_l end_POSTSUPERSCRIPT and jointsGlobalsuperscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐺𝑙𝑜𝑏𝑎𝑙\mathcal{L}_{joints}^{Global}caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT, respectively.Next, the joint positions P^globalJ×3superscriptsubscript^P𝑔𝑙𝑜𝑏𝑎𝑙𝐽3\mathrm{\hat{P}}_{global}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT are flattened and forwarded to an MLP network to obtain the 3D joint features.Finally, the IMU, image, and 3D joint features are concatenated to output the multimodal fused feature ftisuperscript𝑓subscript𝑡𝑖f^{t_{i}}italic_f start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

4.1.1 Temporal Feature Encoder

As demonstrated in HMD-Poser[10], temporal correlation information is the key to tracking accurate human motions. However, the multimodal fused features {fti}superscript𝑓subscript𝑡𝑖\{f^{t_{i}}\}{ italic_f start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } are still temporally isolated. To solve this problem, Transformer and RNN are adopted in existing methods. Although Transformer-based methods[52] have achieved state-of-the-art results in HPE, their computational costs are much higher than those of RNN-based methods. To ensure our method runs in real-time on HMDs, we introduce a lightweight LSTM-based temporal feature encoder. Specifically, the encoder is composed of a stack of N=3𝑁3N=3italic_N = 3 identical blocks. And each block has two sub-layers. The first is an LSTM module to learn the temporal representation, and the second is a simple fully connected feed-forward network. We employ a residual connection followed by layer normalization.

4.2 SMPL Decoder

The SMPL decoder first adopts two regression heads to estimate the local pose parameters θtisuperscript𝜃subscript𝑡𝑖\theta^{t_{i}}italic_θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and the shape parameters βtisuperscript𝛽subscript𝑡𝑖\beta^{t_{i}}italic_β start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of SMPL. Both regression heads are designed as a 2-layer MLP. Then, it uses an FK module to calculate all joint positions P^SMPLJ×3superscriptsubscript^P𝑆𝑀𝑃𝐿𝐽3\mathrm{\hat{P}}_{SMPL}^{J\times 3}over^ start_ARG roman_P end_ARG start_POSTSUBSCRIPT italic_S italic_M italic_P italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT with θtisuperscript𝜃subscript𝑡𝑖\theta^{t_{i}}italic_θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, βtisuperscript𝛽subscript𝑡𝑖\beta^{t_{i}}italic_β start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and the online head’s 6DoF data from headset.We define the SMPL loss function SMPLsuperscript𝑆𝑀𝑃𝐿\mathcal{L}^{SMPL}caligraphic_L start_POSTSUPERSCRIPT italic_S italic_M italic_P italic_L end_POSTSUPERSCRIPT as a combination of root orientation loss orisubscript𝑜𝑟𝑖\mathcal{L}_{ori}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT, local pose loss lrotsubscript𝑙𝑟𝑜𝑡\mathcal{L}_{lrot}caligraphic_L start_POSTSUBSCRIPT italic_l italic_r italic_o italic_t end_POSTSUBSCRIPT, global pose loss grotsubscript𝑔𝑟𝑜𝑡\mathcal{L}_{grot}caligraphic_L start_POSTSUBSCRIPT italic_g italic_r italic_o italic_t end_POSTSUBSCRIPT and joint position loss jointsubscript𝑗𝑜𝑖𝑛𝑡\mathcal{L}_{joint}caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT. All these losses are calculated as the mean of absolute errors (L1 norm) between the predicted results and the ground-truth values.

4.3 Training MEPoser

For the overall training loss, we combine a smooth loss smoothsubscript𝑠𝑚𝑜𝑜𝑡\mathcal{L}_{smooth}caligraphic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT with the above losses, including 2D heatmap loss heatmap2Dsuperscriptsubscript𝑒𝑎𝑡𝑚𝑎𝑝2𝐷\mathcal{L}_{heatmap}^{2D}caligraphic_L start_POSTSUBSCRIPT italic_h italic_e italic_a italic_t italic_m italic_a italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT, 3D joint loss jointsLocalsuperscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐿𝑜𝑐𝑎𝑙\mathcal{L}_{joints}^{Local}caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_o italic_c italic_a italic_l end_POSTSUPERSCRIPT, jointsGlobalsuperscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐺𝑙𝑜𝑏𝑎𝑙\mathcal{L}_{joints}^{Global}caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT and SMPL loss SMPLsubscript𝑆𝑀𝑃𝐿\mathcal{L}_{SMPL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_P italic_L end_POSTSUBSCRIPT. The smooth loss from HMD-Poser[10] is adopted to further enhance the temporal smoothness.

=λhpheatmap2D+λljointsjointsLocal+λgjointsjointsGlobal+λsmplSMPL+λsmoothsmooth,subscript𝜆𝑝superscriptsubscript𝑒𝑎𝑡𝑚𝑎𝑝2𝐷subscript𝜆𝑙𝑗𝑜𝑖𝑛𝑡𝑠superscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐿𝑜𝑐𝑎𝑙subscript𝜆𝑔𝑗𝑜𝑖𝑛𝑡𝑠superscriptsubscript𝑗𝑜𝑖𝑛𝑡𝑠𝐺𝑙𝑜𝑏𝑎𝑙𝜆𝑠𝑚𝑝𝑙subscript𝑆𝑀𝑃𝐿subscript𝜆𝑠𝑚𝑜𝑜𝑡subscript𝑠𝑚𝑜𝑜𝑡\begin{split}\mathcal{L}=&\lambda_{hp}\mathcal{L}_{heatmap}^{2D}+\lambda_{%ljoints}\mathcal{L}_{joints}^{Local}+\lambda_{gjoints}\mathcal{L}_{joints}^{%Global}\\&+\lambda{smpl}\mathcal{L}_{SMPL}+\lambda_{smooth}\mathcal{L}_{smooth},\end{split}start_ROW start_CELL caligraphic_L = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_h italic_e italic_a italic_t italic_m italic_a italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_l italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_o italic_c italic_a italic_l end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_g italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_l italic_o italic_b italic_a italic_l end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_λ italic_s italic_m italic_p italic_l caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_P italic_L end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT , end_CELL end_ROW(3)

where λsubscript𝜆\lambda_{*}italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT here are the balance weights (see Appendix).

DatasetMethodMPJRE\downarrowMPJPE\downarrowPA-MPJPE\downarrowUpperPE\downarrowLowerPE\downarrowRootPE\downarrowJitter\downarrow
Protocol 1UnrealEgo-5.53.94.07.74.2592.5
HMD-Poser4.65.82.84.87.15.8114.9
MEPoser(Ours)4.13.72.52.75.13.2161.8
Protocol 2UnrealEgo-6.44.34.68.95.0610.5
HMD-Poser4.97.03.45.29.77.2165.7
MEPoser(Ours)4.74.82.93.27.03.8204.9

5 Experiment

5.1 Dataset Splitting

We split the dataset into three parts: one for training (70%) and two separate testing sets based on different protocols, as follows. The training set comprises 615 sequences captured by 38 individuals, covering 20 daily actions involving upper-body, lower-body, and full-body movements.For testing, Protocol 1 (16%) contains 141 sequences with the same set of actions but performed by 8 different subjects than those in the training set, to evaluate cross-subject generalization. Protocol 2 (14%) is designed to assess the model’s effectiveness and robustness in more general scenarios, consisting of 129 sequences involving 19 unseen actions and 20 unseen subjects not present in the training set.

5.2 Comparison

Our multimodal dataset fills a gap in egocentric VR scenarios. To validate the dataset and the corresponding baseline methods, we conducted the following comparison using our dataset between MEPoser against the latest single-modal methods:UnrealEgo[1] with stereo egocentric images as inputs and HMD-Poser[10] with IMU observations from HMD and 2 leg motion trackers.

The quantitative results in Tab. 2 show that MEPoser outperforms existing single-modal methods for egocentric HPE. Compared to UnrealEgo, our method reduces the MPJPE (Mean Per Joint Position Error, cm) by 32.7% and 25% in protocol1 and protocol2, respectively. Notably, MEPoser significantly enhances the smoothness of estimation results by using a temporal LSTM structure. In comparison with HMD-Poser, MEPoser shows 36.2% and 31.4% reduction in MPJPE on two test sets by combining the egocentric image features and slight improvement of joint rotation accuracy according to the MPJRE (Mean Per-Joint Rotation Error, ) results. The results also demonstrate the improved generalizability of MEPoser across subjects and actions, validating the effectiveness of the multimodal setting and the value of our dataset in solving egocentric HPE.

Qualitative comparisons are shown in Fig. 4 using various test sequences, featuring different actions and environments. MEPoser could relieve the limitations of the single-modal methods by exploiting the complementary between the vision and IMU signals. Specifically, our method could deal with issues like self-occlusion and out-of-FOV problems in egocentric images by utilizing IMU features and temporal information to get more accurate pose results. Additionally, our approach mitigates the sparsity, drifting, and ambiguous measurements in slow motions of IMU signals by incorporating visible body joints in egocentric images.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (13)

5.3 Annotation Cross-validation

To further assess the accuracy of our dataset annotations, we randomly capture 9 motion sequences for cross-validation by simultaneously using both our system and an optical marker-based motion capture system.The subject wears a suit attached with reflective markers tracked by the OptiTrack system. Marker-based SMPL parameters are then derived from the MoCap data using MoSh++[24], with a temporal filter applied to reduce jittering.As shown in Tab. 3, the low error metrics and variance demonstrate the robustness of our annotation pipeline and the high quality of the dataset. Besides the validation, every sequence of our dataset has been inspected manually to eliminate the data with erroneous annotations.

ActionMPJPEActionMPJPE
walking2.61kicking2.12
hand-waving2.72lunge2.23
taichi2.98boxing2.73
shuttleco*ck-kicking2.65dancing2.49
marching-in-place2.08Average2.51

6 Conclusion

In this paper, we introduce EMHI, a novel multimodal human motion dataset designed for egocentric HPE. It includes synchronized egocentric images and IMU signals from a real VR product suite, with SMPL annotations in the same world coordinate system. To enhance generalization in real-world applications, we collected a diverse range of data across various actions and individuals. We also present MEPoser, a new baseline HPE method that combines image and IMU inputs for real-time HPE on a standalone HMD. MEPoser effectively demonstrates the benefits of multimodal fusion, improving accuracy and addressing the limitations of previous single-modal methods. This approach serves as an initial exploration, inviting further research of egocentric HPE with multimodal data. We believe releasing this dataset and method will accelerate the practical implementation of HPE with body-worn sensors in future VR/AR products.

Training / Testing Protocol 1Testing Protocol 2
Upper-body chest expansion(39), T/A pose(40),cutting(39), clicking(39) hand waving(12), arm swing(2),raising hand(5), random waving(3)
Lower-body standing(40), high stepping(42), sitting(15),side kicking(41), forward kicking(38),shuttleco*ck kicking(37), lunge(36), squat(38) backward kicking(6), backward stepping(6),leapfrog(12), standing up(7), kneeling(9),random kicking(4),leg lifting(6)
Whole-body marching in place(38), playing basketball(40),walking(33), running(40), boxing(39),dancing(41), fencing(42), taichi(39) f/b cross jumping(6), l/r cross jumping(6),playing golf(3), squat and stand(7),dancing on dance pad(5), free-style(6),playing table tennis(12), playing football(12)

Appendix

Appendix A Details of GT Fitting

A.1 Energy Term Details.

To provide a detailed understanding of the SMPL fitting process outlined in the main manuscript, we elaborate on the energy terms minimized during optimization in this supplementary material. The fitting process involves minimizing a weighted sum of various energy terms [4, 5, 31], each contributing to the accuracy and plausibility of the final SMPL parameters. Below, we define each energy term included in this equation:

E(θ,β)=λrotErot+λjointEjoint+λpriorEprior+λsmoothEsmooth+λregEreg.𝐸𝜃𝛽subscript𝜆𝑟𝑜𝑡subscript𝐸𝑟𝑜𝑡subscript𝜆𝑗𝑜𝑖𝑛𝑡subscript𝐸𝑗𝑜𝑖𝑛𝑡subscript𝜆𝑝𝑟𝑖𝑜𝑟subscript𝐸𝑝𝑟𝑖𝑜𝑟subscript𝜆𝑠𝑚𝑜𝑜𝑡subscript𝐸𝑠𝑚𝑜𝑜𝑡subscript𝜆𝑟𝑒𝑔subscript𝐸𝑟𝑒𝑔\begin{split}E(\theta,\beta)=&\lambda_{rot}E_{rot}+\lambda_{joint}E_{joint}+%\lambda_{prior}E_{prior}+\\&\lambda_{smooth}E_{smooth}+\lambda_{reg}E_{reg}.\end{split}start_ROW start_CELL italic_E ( italic_θ , italic_β ) = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT . end_CELL end_ROW(4)

1) Rotation Alignment Term:This term enforces the alignment between the SMPL model’s joint rotations and the rotations obtained from external sensors (IMU data). It alleviates the HMD-occlusion problem and encourages the pose consistency with IMU data:

Erot=j(θ)jRj2,subscript𝐸𝑟𝑜𝑡subscript𝑗superscriptnormsubscript𝜃𝑗subscript𝑅𝑗2E_{rot}=\sum_{j}\|\mathcal{F}(\theta)_{j}-R_{j}\|^{2},italic_E start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ caligraphic_F ( italic_θ ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(5)

where j{head,wrist,knee}𝑗𝑒𝑎𝑑𝑤𝑟𝑖𝑠𝑡𝑘𝑛𝑒𝑒j\in\{head,wrist,knee\}italic_j ∈ { italic_h italic_e italic_a italic_d , italic_w italic_r italic_i italic_s italic_t , italic_k italic_n italic_e italic_e } and \mathcal{F}caligraphic_F indicate a forward kinematic (FK) to get the joint global rotation.

2) Joint Term:This term measures the discrepancy between the 3D joint locations of the SMPL model and those detected in the multi-view setup. It ensures that the SMPL model closely follows the observed joint positions and is formulated as:

Ejoint=iΦ(θ,β)iP3D,i2,subscript𝐸𝑗𝑜𝑖𝑛𝑡subscript𝑖superscriptnormΦsubscript𝜃𝛽𝑖subscript𝑃3𝐷𝑖2E_{joint}=\sum_{i}\|\Phi(\theta,\beta)_{i}-P_{3D,i}\|^{2},italic_E start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ roman_Φ ( italic_θ , italic_β ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 3 italic_D , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(6)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) returns SMPL joints, the 3D position of i𝑖iitalic_i-th jointis Φ(θ,β)iΦsubscript𝜃𝛽𝑖\Phi(\theta,\beta)_{i}roman_Φ ( italic_θ , italic_β ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and P3D,isubscript𝑃3𝐷𝑖P_{3D,i}italic_P start_POSTSUBSCRIPT 3 italic_D , italic_i end_POSTSUBSCRIPT is the corresponding joint detected from the multi-view setup.

3) Prior Term:

Eprior=𝐳(θ)𝐳prior2,subscript𝐸𝑝𝑟𝑖𝑜𝑟superscriptnorm𝐳𝜃subscript𝐳𝑝𝑟𝑖𝑜𝑟2E_{prior}=\|\mathbf{z}(\theta)-\mathbf{z}_{prior}\|^{2},italic_E start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT = ∥ bold_z ( italic_θ ) - bold_z start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(7)

The term 𝐳(θ)𝐳𝜃\mathbf{z}(\theta)bold_z ( italic_θ ) represents the pose latent vector generated by the VPoser model [31], while 𝐳priorsubscript𝐳𝑝𝑟𝑖𝑜𝑟\mathbf{z}_{prior}bold_z start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT denotes the center of the Gaussian distribution in the latent space.

4) Smooth term enforces temporal smoothness in the estimated joint rotations to prevent abrupt changes in pose, which could lead to unrealistic motion artifacts. The smoothness term can be expressed as:

Esmooth=tiθt,iθt1,i2,subscript𝐸𝑠𝑚𝑜𝑜𝑡subscript𝑡subscript𝑖superscriptnormsubscript𝜃𝑡𝑖subscript𝜃𝑡1𝑖2E_{smooth}=\sum_{t}\sum_{i}\|\theta_{t,i}-\theta_{t-1,i}\|^{2},italic_E start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(8)

where θt,isubscript𝜃𝑡𝑖\theta_{t,i}italic_θ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT represents the SMPL pose parameters for joint i𝑖iitalic_i at time t𝑡titalic_t.

5) The shape regularization term regularizes the SMPL shape parameters to ensure that the estimated body shape remains within plausible human body shapes. The regularization typically penalizes deviations from the mean shape and large shape variance:

Ereg=β2,subscript𝐸𝑟𝑒𝑔superscriptnorm𝛽2E_{reg}=\|\beta\|^{2},italic_E start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(9)

where β𝛽\betaitalic_β are the SMPL shape parameters representing variations in body shape.

A.2 Optimization Process

The optimization process begins after triangulating all frames. We start by loading the SMPL model and initializing the SMPL parameters. Then we fit the SMPL shape using 3D limb data and initialize body rotation and translation parameters. Next, we refine the poses by optimizing pose parameters, including global rotation and translation. This optimization minimizes the above energy function. Finally, we get the estimated SMPL parameters.

A.3 Balance Weights

The balance weights (λ𝜆\lambdaitalic_λ) used in the optimization process are crucial for controlling the influence of each energy term. The specific weights applied are as follows:

- λrot=1.0subscript𝜆𝑟𝑜𝑡1.0\lambda_{rot}=1.0italic_λ start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT = 1.0 for the rotation alighment term.

- λjoint=5.0subscript𝜆𝑗𝑜𝑖𝑛𝑡5.0\lambda_{joint}=5.0italic_λ start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT = 5.0 for the 3D joint distance term.

- λprior=0.01subscript𝜆𝑝𝑟𝑖𝑜𝑟0.01\lambda_{prior}=0.01italic_λ start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT = 0.01 for regularizing the pose parameters.

- λsmooth=1.0subscript𝜆𝑠𝑚𝑜𝑜𝑡1.0\lambda_{smooth}=1.0italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT = 1.0 for the pose smoothness term.

- λreg=0.01subscript𝜆𝑟𝑒𝑔0.01\lambda_{reg}=0.01italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = 0.01 for regularizing the shape parameters.

Appendix B Training Details of the Baseline Method

In our training process, the MEPoser model was trained for a total of 20 epochs using the Adam optimizer[21], with an initial learning rate of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The learning rate was decayed by a factor of 0.1 after 7 and 14 epochs to ensure stable convergence.We used a batch size of 32 during training. The training was conducted on a machine with NVIDIA V100 GPUs, and the entire training process took approximately 72 hours.

The loss function weights were carefully tuned based on validation performance. Specifically, we set the weight for the 2D heatmap loss λhpsubscript𝜆𝑝\lambda_{hp}italic_λ start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT to 1.0, the weight for the local 3D joint loss λljointssubscript𝜆𝑙𝑗𝑜𝑖𝑛𝑡𝑠\lambda_{ljoints}italic_λ start_POSTSUBSCRIPT italic_l italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT to 1.0, and the weight for the global 3D joint loss λgjointssubscript𝜆𝑔𝑗𝑜𝑖𝑛𝑡𝑠\lambda_{gjoints}italic_λ start_POSTSUBSCRIPT italic_g italic_j italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT to 1.0. The SMPL loss λsmplsubscript𝜆𝑠𝑚𝑝𝑙\lambda_{smpl}italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_p italic_l end_POSTSUBSCRIPT was set to 5.0, and the temporal smoothness loss λsmoothsubscript𝜆𝑠𝑚𝑜𝑜𝑡\lambda_{smooth}italic_λ start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT was set to 0.5 to maintain the temporal coherence of the predicted poses.Our approach represents an initial attempt to perform multimodal egocentric pose estimation, and we will open-source our baseline method to inspire better designs in this area.

Appendix C Hardware Details

As illustrated in Fig. 5, the Optitrack system and the multi-view Kinect system are physically connected to the host PC via cables, while the EgoSensorKit operates wirelessly, offering the subject greater flexibility to perform a range of actions. During recording, two clock systems are employed: the Host-PC-Clock for Kinect and Optitrack data, and the Headset-Clock for EgoSensorKit data. Synchronization between these two systems is achieved using the method described in the Temporal Synchronization section of the main paper.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (14)

Eight Azure Kinect cameras are used to capture multiple third-view RGB images. These cameras are connected in a daisy-chain configuration using audio cables to ensure temporal synchronization, with our tests showing a time offset of less than 5 microseconds between the trigger signals of the first and last camera. To ensure full-body visibility of the subject, the cameras are positioned evenly in eight directions around the subject, ensuring that any given point is visible from at least two cameras. Given the bandwidth demands of transmitting data from eight cameras, we set the resolution to 1280×72012807201280\times 7201280 × 720 and the frame rate to 30Hz, ensuring stable data transmission during acquisition. The activity area for the subject is set to approximately 1 meter in diameter, with the Kinect cameras positioned around 2 meters from the center of this area.

Appendix D Dataset

D.1 Dataset Splitting Details

For dataset splitting, we provide the specific actions under different protocols as shown in Tab.4.

D.2 Dataset Visualization

Fig. 6 presents additional visualization results from our dataset, illustrating the qualitative accuracy of our annotations. The alignment of the projected SMPL mesh with the subject in the third-view images highlights the precision of our SMPL fitting. Furthermore, the 2D skeleton annotations on egocentric images confirm both the spatiotemporal accuracy and the reliability of the 2D annotations.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (15)

Appendix E Qualitative Results for Cross-validation

In addition to the quantitative cross-validation in the main manuscript, we also show qualitative results to further validate our dataset annotations. As shown in Fig.7, we visually compared the predicted poses from our system (red skeletons) against the poses obtained from the optical marker-based motion capture system (green skeletons). The visual comparisons across multiple frames demonstrate that our annotation pipeline accurately captures the subject’s motion, with minimal deviations from marker-based ones. These results provide qualitative evidence supporting the robustness and high quality of our dataset annotations.

A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (16)

References

  • Akada etal. [2022]Hiroyasu Akada, Jian Wang, Soshi Shimada, Masaki Takahashi, Christian Theobalt, and Vladislav Golyanik.Unrealego: A new dataset for robust egocentric 3d human motion capture.In European Conference on Computer Vision, pages 1–17. Springer, 2022.
  • Akada etal. [2024]Hiroyasu Akada, Jian Wang, Vladislav Golyanik, and Christian Theobalt.3d human pose perception from egocentric stereo videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 767–776, 2024.
  • Andriluka etal. [2014]Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele.2d human pose estimation: New benchmark and state of the art analysis.In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
  • Bogo etal. [2016]Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and MichaelJ Black.Keep it smpl: Automatic estimation of 3d human pose and shape from a single image.In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 561–578. Springer, 2016.
  • Cai etal. [2022]Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, etal.Humman: Multi-modal 4d human dataset for versatile sensing and modeling.In European Conference on Computer Vision, pages 557–577. Springer, 2022.
  • Cao etal. [2017]Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinity fields.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  • Castillo etal. [2023]Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbeláez, Ali Thabet, and Artsiom Sanakoyeu.Bodiffusion: Diffusing sparse observations for full-body human motion synthesis.In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 4221–4231, 2023.
  • Cha etal. [2021]Young-Woon Cha, Husam Shaik, Qian Zhang, Fan Feng, Andrei State, Adrian Ilie, and Henry Fuchs.Mobile. egocentric human body motion reconstruction using only eyeglasses-mounted cameras and a few body-worn inertial sensors.In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pages 616–625. IEEE, 2021.
  • Cuevas-Velasquez etal. [2024]Hanz Cuevas-Velasquez, Charlie Hewitt, Sadegh Aliakbarian, and Tadas Baltrušaitis.Simpleego: Predicting probabilistic body pose from egocentric cameras.In 2024 International Conference on 3D Vision, pages 1446–1455. IEEE, 2024.
  • Dai etal. [2024]Peng Dai, Yang Zhang, Tao Liu, Zhen Fan, Tianyuan Du, Zhuo Su, Xiaozheng Zheng, and Zeming Li.Hmd-poser: On-device real-time human motion tracking from scalable sparse observations.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 874–884, 2024.
  • Damen etal. [2022]Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, etal.Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, pages 1–23, 2022.
  • Du etal. [2023]Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, AliK. Thabet, and Artsiom Sanakoyeu.Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 481–490, 2023.
  • Gong etal. [2023]Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, and Rakesh Ranjan.Mmg-ego4d: Multimodal generalization in egocentric action recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6481–6491, 2023.
  • Grauman etal. [2024]Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, etal.Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024.
  • Huang etal. [2018]Yinghao Huang, Manuel Kaufmann, Emre Aksan, MichaelJ Black, Otmar Hilliges, and Gerard Pons-Moll.Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time.ACM Transactions on Graphics, 37(6):1–15, 2018.
  • Ionescu etal. [2013]Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu.Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  • Jiang etal. [2022a]Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz.Avatarposer: Articulated full-body pose tracking from sparse motion sensing.In European conference on computer vision, pages 443–460. Springer, 2022a.
  • Jiang etal. [2022b]Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, AlexanderW. Winkler, and C.Karen Liu.Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation.In SIGGRAPH Asia 2022 Conference Papers, pages 3:1–3:9. ACM, 2022b.
  • Joo etal. [2015]Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh.Panoptic studio: A massively multiview system for social motion capture.In Proceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015.
  • Kang etal. [2023]Taeho Kang, Kyungjin Lee, Jinrui Zhang, and Youngki Lee.Ego3dpose: Capturing 3d cues from binocular egocentric views.In SIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023.
  • Kingma and Ba [2015]Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  • Liu etal. [2022]Yuxuan Liu, Jianxin Yang, Xiao Gu, Yao Guo, and Guang-Zhong Yang.Ego+ x: An egocentric vision system for global 3d human pose estimation and social interaction characterization.In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5271–5277. IEEE, 2022.
  • Liu etal. [2023]Yuxuan Liu, Jianxin Yang, Xiao Gu, Yijun Chen, Yao Guo, and Guang-Zhong Yang.Egofish3d: Egocentric 3d pose estimation from a fisheye camera via self-supervised learning.IEEE Transactions on Multimedia, 25:8880–8891, 2023.
  • Loper etal. [2014]Matthew Loper, Naureen Mahmood, and MichaelJ Black.Mosh: motion and shape capture from sparse markers.ACM Trans. Graph., 33(6):220–1, 2014.
  • Loper etal. [2015]Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and MichaelJ. Black.SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  • Luo etal. [2024]Zhengyi Luo, Jinkun Cao, Rawal Khirodkar, Alexander Winkler, Kris Kitani, and Weipeng Xu.Real-time simulated avatar from head-mounted sensors.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 571–581, 2024.
  • Ma etal. [2024]Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, HyoJin Kim, etal.Nymeria: A massive collection of multimodal egocentric daily motion in the wild.arXiv preprint arXiv:2406.09905, 2024.
  • Mahmood etal. [2019]Naureen Mahmood, Nima Ghorbani, NikolausF Troje, Gerard Pons-Moll, and MichaelJ Black.Amass: Archive of motion capture as surface shapes.In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  • Millerdurai etal. [2024]Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, and Vladislav Golyanik.Eventego3d: 3d human motion capture from egocentric event streams.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1186–1195, 2024.
  • Park etal. [2023]Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha Rambhatla, and Paul Fieguth.Domain-guided spatio-temporal self-attention for egocentric 3d pose estimation.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1837–1849, 2023.
  • Pavlakos etal. [2019]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, AhmedAA Osman, Dimitrios Tzionas, and MichaelJ Black.Expressive body capture: 3d hands, face, and body from a single image.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019.
  • Qiu etal. [2020]Kejie Qiu, Tong Qin, Jie Pan, Siqi Liu, and Shaojie Shen.Real-time temporal and rotational calibration of heterogeneous sensors using motion correlation analysis.IEEE Transactions on Robotics, 37(2):587–602, 2020.
  • Radosavovic etal. [2020]Ilija Radosavovic, RajPrateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár.Designing network design spaces.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
  • Rai etal. [2021]Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and JuanCarlos Niebles.Home action genome: Cooperative compositional action understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184–11193, 2021.
  • Rhodin etal. [2016]Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt.Egocap: egocentric marker-less motion capture with two fisheye cameras.ACM Transactions on Graphics, 35(6):1–11, 2016.
  • Somasundaram etal. [2023]Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, JakobJulian Engel, Renzo DeNardi, and Richard Newcombe.Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023.
  • Sun etal. [2019]Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang.Deep high-resolution representation learning for human pose estimation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
  • Tome etal. [2019]Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino.xr-egopose: Egocentric 3d human pose from an hmd camera.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7728–7738, 2019.
  • Trumble etal. [2017]Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and JohnP Collomosse.Total capture: 3d human pose estimation fusing video and inertial sensors.In BMVC, pages 1–13. London, UK, 2017.
  • von Marcard etal. [2017]Timo von Marcard, Bodo Rosenhahn, MichaelJ. Black, and Gerard Pons-Moll.Sparse inertial poser: Automatic 3d human pose estimation from sparse imus.Comput. Graph. Forum, 36(2):349–360, 2017.
  • VonMarcard etal. [2017]Timo VonMarcard, Bodo Rosenhahn, MichaelJ Black, and Gerard Pons-Moll.Sparse inertial poser: Automatic 3d human pose estimation from sparse imus.In Computer graphics forum, pages 349–360. Wiley Online Library, 2017.
  • Wang etal. [2021]Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt.Estimating egocentric 3d human pose in global space.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11500–11509, 2021.
  • Wang etal. [2022]Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Diogo Luvizon, and Christian Theobalt.Estimating egocentric 3d human pose in the wild with external weak supervision.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13157–13166, 2022.
  • Wang etal. [2023]Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt.Scene-aware egocentric 3d human pose estimation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13031–13040, 2023.
  • Wang etal. [2024]Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, and Christian Theobalt.Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–787, 2024.
  • Winkler etal. [2022]Alexander Winkler, Jungdam Won, and Yuting Ye.Questsim: Human motion tracking from sparse sensors with simulated avatars.In SIGGRAPH Asia 2022 Conference Papers, pages 1–8, 2022.
  • Xu etal. [2019]Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, Hans-Peter Seidel, and Christian Theobalt.Mo2cap2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera.IEEE transactions on visualization and computer graphics, 25(5):2093–2101, 2019.
  • Yi etal. [2021]Xinyu Yi, Yuxiao Zhou, and Feng Xu.Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.ACM Transactions on Graphics, 40(4):1–13, 2021.
  • Yi etal. [2022]Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu.Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13167–13178, 2022.
  • Zhang etal. [2022]Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, and Siyu Tang.Egobody: Human body shape and motion of interacting people from head-mounted devices.In European conference on computer vision, pages 180–200. Springer, 2022.
  • Zhao etal. [2021]Dongxu Zhao, Zhen Wei, Jisan Mahmud, and Jan-Michael Frahm.Egoglass: Egocentric-view human pose estimation from an eyeglass frame.In 2021 International Conference on 3D Vision, pages 32–41. IEEE, 2021.
  • Zheng etal. [2023]Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, and Xiaojie Jin.Realistic full-body tracking from sparse observations via joint-level modeling.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14678–14688, 2023.
A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs (2024)

References

Top Articles
What Type of Holiday Baker Are You? (+ a recipe for my favourite holiday baking) - Smart Nutrition with Jessica Penner, RD
Curry Chicken Salad Recipe - The Recipe Critic
Craigslist Livingston Montana
Walgreens Harry Edgemoor
#ridwork guides | fountainpenguin
80 For Brady Showtimes Near Marcus Point Cinema
Top Financial Advisors in the U.S.
Sprague Brook Park Camping Reservations
Weapons Storehouse Nyt Crossword
Violent Night Showtimes Near Amc Fashion Valley 18
Corporate Homepage | Publix Super Markets
The Binding of Isaac
Blog:Vyond-styled rants -- List of nicknames (blog edition) (TouhouWonder version)
Dexter Gomovies
National Office Liquidators Llc
3476405416
/Www.usps.com/International/Passports.htm
Eine Band wie ein Baum
The EyeDoctors Optometrists, 1835 NW Topeka Blvd, Topeka, KS 66608, US - MapQuest
[PDF] PDF - Education Update - Free Download PDF
Best Sports Bars In Schaumburg Il
Craigslist Pennsylvania Poconos
Suspiciouswetspot
Best Middle Schools In Queens Ny
Kitchen Exhaust Cleaning Companies Clearwater
Restaurants In Shelby Montana
Doctors of Optometry - Westchester Mall | Trusted Eye Doctors in White Plains, NY
Encore Atlanta Cheer Competition
Tottenham Blog Aggregator
Armor Crushing Weapon Crossword Clue
Storelink Afs
Eaccess Kankakee
140000 Kilometers To Miles
Fox And Friends Mega Morning Deals July 2022
Gideon Nicole Riddley Read Online Free
Ma Scratch Tickets Codes
Daily Journal Obituary Kankakee
Rogers Centre is getting a $300M reno. Here's what the Blue Jays ballpark will look like | CBC News
Henry County Illuminate
Dying Light Nexus
Thanksgiving Point Luminaria Promo Code
Oriellys Tooele
R/Moissanite
Download Diablo 2 From Blizzard
21 Alive Weather Team
Costco Gas Foster City
412Doctors
Unit 11 Homework 3 Area Of Composite Figures
Anonib New
Michaelangelo's Monkey Junction
Ciara Rose Scalia-Hirschman
Varsity Competition Results 2022
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6220

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.