Zhen Fan, Peng Dai, Zhuo Su, Xu Gao, Zheng Lv,
Jiarui Zhang, Tianyuan Du, Guidong Wang, Yang Zhang
Equal contribution Corresponding author
PICO
Abstract
Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal Egocentric human Motion dataset with Head-Mounted Display (HMD) and body-worn IMUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
1 Introduction
Egocentric human pose estimation (HPE) has gained significant attention in computer vision, driven by the demand for accurate motion tracking in immersive VR/AR environments. Unlike traditional exocentric HPE which relies on external sensors, egocentric HPE employs body-worn sensors such as egocentric cameras or sparse IMUs. Although rapid progress has been made in this field, there remain challenges in obtaining accurate full-body poses from single-modal data due to issues like 1) self-occlusion and viewpoint variations in egocentric vision; and 2) sparsity and drifting of IMU data. Most importantly, the lack of real-world multimodal training data poses the most significant challenge.
Previous works[35, 47] introduced egocentric datasets using experimental fisheye camera setups to capture images and annotate 3D joints. However, these setups are impractical for real VR/AR products, which need compact, lightweight designs. Synthetic datasets[38, 2, 9] use physics engines for egocentric image rendering but suffer a domain gap with real images due to the complexity of human motion and environments. Meanwhile, the lower body may be occluded in a vision-based setting, and some body parts may fall outside the field of view (FOV) depending on the body pose. IMU-based datasets avoid occlusion but suffer from drift over time and ill-posed problems from sparse observations. Besides, existing methods[17, 52, 10] typically use synthetic IMU data from AMASS[28], which may not accurately reflect real-world noise and drift. Some datasets[39, 15] provide real 3 Degrees of Freedom(3DoF, rotation) data from XSens, while others[10] include 6DoF(rotation and position) data of head, hands, and 3DoF for lower legs, but these are small-scale and primarily used for evaluation. Recently, several large-scale multimodal datasets[27, 14] have been released, offering RGB images, upper body IMU data, and motion narrations. However, the forward-facing camera limits the egocentric view, and missing lower-body IMU signals can cause ambiguity.
Combining egocentric cameras and body-worn IMUs offers a promising multimodal solution due to their lightweight and flexible design. This configuration is also commonly found in VR scenarios. Our proposed EMHI dataset, as shown in Fig. 1, features a VR headset with two downward-sloping cameras for egocentric image capture, 6DoF head and hand tracking, and additional IMUs on an actual VR device for lower-leg 3DoF tracking. We use a markerless multi-view camera system for SMPL[25] ground truth acquirement, with accuracy and consistency refinement using IMU data, and synchronization via OptiTrack. Furthermore, we propose a new baseline method, MEPoser, integrating egocentric images and IMU data to perform real-time HPE on a standalone VR headset. The method employs a multimodal fusion encoder, a temporal feature encoder, and MLP-based regression heads to estimate SMPL body model parameters, effectively demonstrating the advantages of multimodal data fusion in enhancing pose accuracy and the value of our dataset. This approach paves the way for further research in egocentric HPE using multimodal inputs.
In summary, our work makes the following contributions:
- •
We first introduce a large-scale multimodal egocentric motion dataset EMHI on the real VR device, including stereo downward-sloping egocentric images, full-body IMU signals, and accurate human pose annotations.
- •
We propose a baseline method MEPoser, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads to perform real-time HPE on a standalone HMD.
- •
The experiment results demonstrate the rationality of our multimodal setting and the effectiveness of EMHI for addressing egocentric HPE.
2 Related Work
Dataset | Device | Real/Synth. | Sensor Modality | SMPL(x) | Statistic | |||
Egocentric Vision | Inertial | Actions | Subjects | Frames | ||||
Synth. | MonocularDownward-Facing | - | - | 3K | 700 | 530K | ||
EgoPW | Real. | MonocularDownward-Facing | - | - | 20 | 10 | 318K | |
EgoCap | Real | BinocularDownward-Facing | - | - | - | 8 | 30K | |
UnrealEgo | Synth. | BinocularDownward-Facing | - | - | 30 | 17 | 450K | |
DIP-IMU | Real | - | Full-Body3DoF6 | ✓ | 15 | 10 | 330K | |
FreeDancing | Real | - | Full-Body6DoF3, 3DoF3 | ✓ | - | 8 | 532.8K | |
Nymeria | Real | BinocularForward-Facing | Upper-Body6DoF3 | ✓ | 20 | 264 | 260M | |
Ego-Exo4D(Ego Pose) | Real | BinocularForward-Facing | Head6DoF1 | - | - | - | 9.6M | |
Ours | Real | BinocularDownward-Sloping | Full-Body6DoF3, 3DoF2 | ✓ | 39 | 58 | 3.07M |
2.1 Egocentric Motion Dataset
As shown in Tab. 1, existing egocentric motion datasets can be divided into vision-based, IMU-based, and multimodal datasets depending on the input modality.
Unlike previous HPE datasets captured by third-view cameras[16, 3, 19, 50, 5], vision-based egocentric motion datasets provide first-person perspective images using head-mounted monocular or binocular cameras, with the corresponding annotations of the wearer’s poses. Mo2Cap2[47] and xR-EgoPose[38] made an early effort to build the synthetic monocular dataset with a downward-facing fisheye camera. EgoPW[43] proposed the first in-the-wild dataset and was followed by EgoGTA[44] and ECHA[23] with the same camera setting. EgoWholeBody[45] is the latest synthetic dataset providing high-quality images and SMPL-X annotations. EgoCap[35] is a pioneer binocular dataset captured by helmet-mounted stereo cameras, containing 30K frames recorded in a lab environment. EgoGlass[51] optimized the binocular setup with two front-facing cameras mounted on the glasses frames. To relieve the dataset scale limitation, UnrealEgo[1] proposed a large-scale and highly realistic stereo synthetic dataset with 450K stereo views and was extended to 1.25M in UnrealEgo2[2]. SynthEgo[9] extended synthetic datasets with more identities and environments, annotated with SMPL-H for better body shape descriptions.
Sparse IMU-based datasets provide an alternative for this problem. AMASS[28] can provide a large-scale synthetic IMU dataset. TotalCapture[39] and DIP-IMU[15] offered real IMU data captured by Xsens and SMPL pose annotations obtained by marker-based optical mocap system and IMU-based method[41] respectively. PICO-FreeDancing[10] provided sparse IMU data with SMPL format GT fitting using OptiTrack data. However, real IMU-based datasets are generally limited in scale.
Multimodal datasets[11, 13, 8, 34] have attracted significant attention in recent years due to the complementarity of different data modalities.Ego-Exo4D[14], Nymeria[27] and SimXR[26] captured real-world images by Project Aria glasses[36], along with the IMU data in upper-body. Ego-Exo4D provided up to 9.6M image frames with annotations of the body and hand joint positions. Nymeria further offered SMPL format data derived from Xsens mocap suits, with limited clothing diversity of the captured body.However, in these datasets, either the forward-facing perspective restricts the perception range of the wearer’s body, or they have not integrated downward-sloping perspectives and sparse full-body IMU signals on actual VR/AR devices.
2.2 Egocentric Human Pose Estimation Methods
Existing egocentric HPE methods primarily utilize single-modal input. Vision-based methods have been widely investigated [35, 43, 44, 22, 23, 20, 30, 29, 9]. Jian Wang et al.[42] introduced a spatiotemporal optimization method for single-view egocentric sequences to obtain the 3D skeleton results. Recently, Wang et al.[45] proposed an egocentric motion capture method that combines the vision transformer for undistorted image patch feature extracting and uses diffusion-based motion priors for pose refinement. However, 3D pose estimation from a single image remains challenging due to the lack of depth information. To address this, UnrealEgo[1] introduced stereo egocentric skeleton tracking methods by integrating a weight-shared encoder for stereo heatmap generation and a multi-branch autoencoder for 3D pose prediction. Hiroyasu Akada et al.[2] further enhanced this with a transformer-based model utilizing 3D scene information and temporal features. Despite these advancements, challenges with invisible body parts due to self-occlusion and out-of-view joints persist.
Methods using sparse tracking signals from body-worn IMUs have garnered significant attentions[46, 40, 15, 48, 49, 18, 12, 7]. In egocentric VR and AR scenarios, there are inherently three 6DoF tracking points for the head and hands, with the option to add two additional 3DoF IMUs on the legs.AvatarPoser[17] proposed a global pose prediction framework combining transformer structures with inverse kinematics (IK) optimization, while AvatarJLM[52] introduced a two-stage approach that models joint-level features and uses them as spatiotemporal transformer tokens to achieve smooth action capture.HMD-Poser[10] integrated these inputs, presenting a lightweight temporal-spatial learning method for full-body global 6DoF body action recovery.However, IMU-based data faces challenges such as drift and sparsity. Many current methods rely on synthetic IMU data from AMASS[28], which often fails to capture real-world noise and drift accurately, leading to overfitting as models are not exposed to the complexities and imperfections of real-world conditions.
3 EMHI Dataset
EMHI is a multimodal egocentric motion dataset that contains 3.07M synchronized data pairs organized as 885 sequences recording at 30FPS. Each data pair contains stereo egocentric images (640 480), five IMUs data, and corresponding 3D SMPL pose and 2D keypoints. It is captured by 58 subjects, which are equally split into 29 male and 29 female, with a diverse range of body shapes. Each subject wears their daily clothing during data collection to ensure a wide variety of natural looks. We record 39 common actions of users experiencing games and social applications in VR scenarios and categorize them into upper-body motions, lower-body motions, and full-body motions. Additionally, this dataset is captured under three different environmental lighting conditions: dim light, natural light, and bright light for environment diversity.
3.1 Data Capture System
3.1.1 Hardware
As shown in Fig. 2, the overall hardware consists of three subsystems: EgoSensorKit system to collect sensor data, with a PICO4 headset, two hand controllers, and two leg trackers; Kinect system to obtain SMPL annotations, with 8 cameras recording simultaneously from outside-in viewpoint; Optitrack system for spatiotemporal synchronization between the above two systems, with Optical Rigid Body (ORBs) mounted at the VR headset and all Kinect cameras, allowing all camera moving.
3.1.2 Temporal Synchronization
Kinect and Optitrack systems rely on a signal transmitter device to trigger simultaneously, ensuring their inter-frame alignment.EgoSensorKit and Optitrack could also be synchronized offline with the headset IMU’s and ORB’s angular velocity, according to the motion correlation method in[32]. Finally, the data frames of Kinect and EgoSensorKit are aligned via Optitrack as a bridge. As the recorded frame rate is 30Hz, the maximum synchronization deviation will reach up to 16.5 ms which might be notable during fast motion. So the annotations are further post-processed with linear interpolation to better align with the EgoSensorKit’s timestamps.
3.1.3 Spatial Alignment
The 6DoF of the headset (IMU) in Optitrack coordination could be obtained by , where is a pre-calibrated rigid transformation between the IMU sensor in headset and its ORB, and the ORB’s 6DoF is tracked with Optitrack. Similarly, the extrinsic parameters of each Kinect RGB camera in the Optitrack coordinate system can be determined using the same method. Then, the spatial transformation between Kinect cameras and headset could be obtained by . Finally, the transformation matrix between the Kinects and egocentric cameras could be further calculated by , where is also a constant spatial relationship between the headset (IMU) and its egocentric cameras.
3.2 Ground-Truth Acquisition
3.2.1 Keypoints Annotation
We use HRNet[37] to detect 2D keypoints in multi-view Kinect RGB images with the body25 format[6]. Then, we follow HuMMan[5] to derive 3D keypoints annotations by triangulation with camera parameters obtained in the spatial alignment, in which we also import smoothness and bone length constraints for to reduce temporal jitter and improve human shape consistency.
3.2.2 SMPL Fitting
Multi-view SMPL fitting was a well-solved problem, with the inclusion of 3D joint, prior, smooth, and shape regularization errors.However, due to the occlusion of facial and hand areas by the EgoSensorKit (HMD and controllers) and the Kinect camera’s limited resolution, it’s challenging to ensure the accuracy of the corresponding joint detection.This results in unreasonable SMPL fitting for wrist and head joint rotations. To tackle this problem, we incorporate the global rotation of head and wrist joints , which are transferred from the collected IMU rotations of hand controllers and headset by and , where and are the constant transformation matrix obtained by statistical methods with a large amount of data collected in the standard sensor-wearing settings. Moreover, we leverage the calibrated leg motion tracker data, which represents the knee joint rotation , to constrain the lower leg pose. With the keypoint annotations and five joint rotations obtained above, we fit the SMPL parameters by minimizing the following energy function:
(1) |
where and are optimized SMPL pose and shape parameters and are balance weight (see Appendix).For occluded joints, we introduce the rotation term to encourage the pose consistency with transferred IMU data as follows:
(2) |
where and indicate a forward kinematic (FK) to get the joint global rotation.Other energy terms are like previous works [4, 5, 31], in which minimizes the 3D distance between and regressed SMPL joints. is Vposer prior from SMPLify-X. helps to keep smooth pose tracking, while the shape regularization term penalizes large shape variance.With the space alignment result, the SMPL results could transfer from world space to the egocentric camera coordinate and obtain 2D pose annotations on egocentric images.
4 A New Baseline Method: MEPoser
To demonstrate the significance of the EMHI dataset and to inspire new designs for multimodal egocentric HPE, we introduce a new baseline method called Multi-modal Egocentric Pose Estimator (MEPoser).MEPoser takes multimodal inputs, including stereo egocentric images and inertial measurements, to extract multimodal representations and perform real-time HPE on a standalone HMD.As shown in Fig.3, MEPoser consists of three components. (1) A multimodal fusion encoder extracts object representations at each frame from multimodal input data. (2) A temporal feature encoder composed of long short-term memory (LSTM) modules and feed-forward networks generates latent variables containing temporal information incorporated from past frames. (3) With the temporal aggregated multimodal features, two MLP-based (multi-layer perception) heads regress the pose and shape parameters of the SMPL model respectively.
4.1 Multimodal Fusion Encoder
The multimodal encoder first has separate feature encoders for different modalities, i.e., two weight-sharing CNN backbones for images and an MLP network for IMU data.To make MEPoser run in real-time on HMD, we use a lightweight RegNetY-400MF[33] backbone, which takes stereo images as inputs, and generates 2D image features represented as .These features are then concatenated and forwarded to a few convolution layers to infer a set of heatmaps . Here we predict 22 joints of the SMPL, i.e., .To train the RegNetY-400MF backbone, we calculate the binary cross-entropy with logits loss (BCEWithLogitsLoss) between the GT heatmaps and the estimated 2D heatmaps.Then, the predicted heatmaps are flattened and forwarded to an MLP network to obtain the image feature. We have obtained the IMU and image features so far. To boost the performance of the pose estimation, we added a 3D module to estimate the 3D joint positions in both the local camera coordinate and the global world coordinate. Specifically, given the image features from stereo heatmaps, an MLP network first encodes them to estimate the 3D joint positions in the local camera coordinate . Then, these joints are transferred to the global SMPL coordinate with the offline calibration results and the online headset’s 6DoF data. and are used to calculate the 3D joint loss and , respectively.Next, the joint positions are flattened and forwarded to an MLP network to obtain the 3D joint features.Finally, the IMU, image, and 3D joint features are concatenated to output the multimodal fused feature .
4.1.1 Temporal Feature Encoder
As demonstrated in HMD-Poser[10], temporal correlation information is the key to tracking accurate human motions. However, the multimodal fused features are still temporally isolated. To solve this problem, Transformer and RNN are adopted in existing methods. Although Transformer-based methods[52] have achieved state-of-the-art results in HPE, their computational costs are much higher than those of RNN-based methods. To ensure our method runs in real-time on HMDs, we introduce a lightweight LSTM-based temporal feature encoder. Specifically, the encoder is composed of a stack of identical blocks. And each block has two sub-layers. The first is an LSTM module to learn the temporal representation, and the second is a simple fully connected feed-forward network. We employ a residual connection followed by layer normalization.
4.2 SMPL Decoder
The SMPL decoder first adopts two regression heads to estimate the local pose parameters and the shape parameters of SMPL. Both regression heads are designed as a 2-layer MLP. Then, it uses an FK module to calculate all joint positions with , , and the online head’s 6DoF data from headset.We define the SMPL loss function as a combination of root orientation loss , local pose loss , global pose loss and joint position loss . All these losses are calculated as the mean of absolute errors (L1 norm) between the predicted results and the ground-truth values.
4.3 Training MEPoser
For the overall training loss, we combine a smooth loss with the above losses, including 2D heatmap loss , 3D joint loss , and SMPL loss . The smooth loss from HMD-Poser[10] is adopted to further enhance the temporal smoothness.
(3) |
where here are the balance weights (see Appendix).
Dataset | Method | MPJRE | MPJPE | PA-MPJPE | UpperPE | LowerPE | RootPE | Jitter |
Protocol 1 | UnrealEgo | - | 5.5 | 3.9 | 4.0 | 7.7 | 4.2 | 592.5 |
HMD-Poser | 4.6 | 5.8 | 2.8 | 4.8 | 7.1 | 5.8 | 114.9 | |
MEPoser(Ours) | 4.1 | 3.7 | 2.5 | 2.7 | 5.1 | 3.2 | 161.8 | |
Protocol 2 | UnrealEgo | - | 6.4 | 4.3 | 4.6 | 8.9 | 5.0 | 610.5 |
HMD-Poser | 4.9 | 7.0 | 3.4 | 5.2 | 9.7 | 7.2 | 165.7 | |
MEPoser(Ours) | 4.7 | 4.8 | 2.9 | 3.2 | 7.0 | 3.8 | 204.9 |
5 Experiment
5.1 Dataset Splitting
We split the dataset into three parts: one for training (70%) and two separate testing sets based on different protocols, as follows. The training set comprises 615 sequences captured by 38 individuals, covering 20 daily actions involving upper-body, lower-body, and full-body movements.For testing, Protocol 1 (16%) contains 141 sequences with the same set of actions but performed by 8 different subjects than those in the training set, to evaluate cross-subject generalization. Protocol 2 (14%) is designed to assess the model’s effectiveness and robustness in more general scenarios, consisting of 129 sequences involving 19 unseen actions and 20 unseen subjects not present in the training set.
5.2 Comparison
Our multimodal dataset fills a gap in egocentric VR scenarios. To validate the dataset and the corresponding baseline methods, we conducted the following comparison using our dataset between MEPoser against the latest single-modal methods:UnrealEgo[1] with stereo egocentric images as inputs and HMD-Poser[10] with IMU observations from HMD and 2 leg motion trackers.
The quantitative results in Tab. 2 show that MEPoser outperforms existing single-modal methods for egocentric HPE. Compared to UnrealEgo, our method reduces the MPJPE (Mean Per Joint Position Error, cm) by 32.7% and 25% in protocol1 and protocol2, respectively. Notably, MEPoser significantly enhances the smoothness of estimation results by using a temporal LSTM structure. In comparison with HMD-Poser, MEPoser shows 36.2% and 31.4% reduction in MPJPE on two test sets by combining the egocentric image features and slight improvement of joint rotation accuracy according to the MPJRE (Mean Per-Joint Rotation Error, ∘) results. The results also demonstrate the improved generalizability of MEPoser across subjects and actions, validating the effectiveness of the multimodal setting and the value of our dataset in solving egocentric HPE.
Qualitative comparisons are shown in Fig. 4 using various test sequences, featuring different actions and environments. MEPoser could relieve the limitations of the single-modal methods by exploiting the complementary between the vision and IMU signals. Specifically, our method could deal with issues like self-occlusion and out-of-FOV problems in egocentric images by utilizing IMU features and temporal information to get more accurate pose results. Additionally, our approach mitigates the sparsity, drifting, and ambiguous measurements in slow motions of IMU signals by incorporating visible body joints in egocentric images.
5.3 Annotation Cross-validation
To further assess the accuracy of our dataset annotations, we randomly capture 9 motion sequences for cross-validation by simultaneously using both our system and an optical marker-based motion capture system.The subject wears a suit attached with reflective markers tracked by the OptiTrack system. Marker-based SMPL parameters are then derived from the MoCap data using MoSh++[24], with a temporal filter applied to reduce jittering.As shown in Tab. 3, the low error metrics and variance demonstrate the robustness of our annotation pipeline and the high quality of the dataset. Besides the validation, every sequence of our dataset has been inspected manually to eliminate the data with erroneous annotations.
Action | MPJPE | Action | MPJPE |
walking | 2.61 | kicking | 2.12 |
hand-waving | 2.72 | lunge | 2.23 |
taichi | 2.98 | boxing | 2.73 |
shuttleco*ck-kicking | 2.65 | dancing | 2.49 |
marching-in-place | 2.08 | Average | 2.51 |
6 Conclusion
In this paper, we introduce EMHI, a novel multimodal human motion dataset designed for egocentric HPE. It includes synchronized egocentric images and IMU signals from a real VR product suite, with SMPL annotations in the same world coordinate system. To enhance generalization in real-world applications, we collected a diverse range of data across various actions and individuals. We also present MEPoser, a new baseline HPE method that combines image and IMU inputs for real-time HPE on a standalone HMD. MEPoser effectively demonstrates the benefits of multimodal fusion, improving accuracy and addressing the limitations of previous single-modal methods. This approach serves as an initial exploration, inviting further research of egocentric HPE with multimodal data. We believe releasing this dataset and method will accelerate the practical implementation of HPE with body-worn sensors in future VR/AR products.
Training / Testing Protocol 1 | Testing Protocol 2 | |
Upper-body | chest expansion(39), T/A pose(40),cutting(39), clicking(39) | hand waving(12), arm swing(2),raising hand(5), random waving(3) |
Lower-body | standing(40), high stepping(42), sitting(15),side kicking(41), forward kicking(38),shuttleco*ck kicking(37), lunge(36), squat(38) | backward kicking(6), backward stepping(6),leapfrog(12), standing up(7), kneeling(9),random kicking(4),leg lifting(6) |
Whole-body | marching in place(38), playing basketball(40),walking(33), running(40), boxing(39),dancing(41), fencing(42), taichi(39) | f/b cross jumping(6), l/r cross jumping(6),playing golf(3), squat and stand(7),dancing on dance pad(5), free-style(6),playing table tennis(12), playing football(12) |
Appendix
Appendix A Details of GT Fitting
A.1 Energy Term Details.
To provide a detailed understanding of the SMPL fitting process outlined in the main manuscript, we elaborate on the energy terms minimized during optimization in this supplementary material. The fitting process involves minimizing a weighted sum of various energy terms [4, 5, 31], each contributing to the accuracy and plausibility of the final SMPL parameters. Below, we define each energy term included in this equation:
(4) |
1) Rotation Alignment Term:This term enforces the alignment between the SMPL model’s joint rotations and the rotations obtained from external sensors (IMU data). It alleviates the HMD-occlusion problem and encourages the pose consistency with IMU data:
(5) |
where and indicate a forward kinematic (FK) to get the joint global rotation.
2) Joint Term:This term measures the discrepancy between the 3D joint locations of the SMPL model and those detected in the multi-view setup. It ensures that the SMPL model closely follows the observed joint positions and is formulated as:
(6) |
where returns SMPL joints, the 3D position of -th jointis , and is the corresponding joint detected from the multi-view setup.
3) Prior Term:
(7) |
The term represents the pose latent vector generated by the VPoser model [31], while denotes the center of the Gaussian distribution in the latent space.
4) Smooth term enforces temporal smoothness in the estimated joint rotations to prevent abrupt changes in pose, which could lead to unrealistic motion artifacts. The smoothness term can be expressed as:
(8) |
where represents the SMPL pose parameters for joint at time .
5) The shape regularization term regularizes the SMPL shape parameters to ensure that the estimated body shape remains within plausible human body shapes. The regularization typically penalizes deviations from the mean shape and large shape variance:
(9) |
where are the SMPL shape parameters representing variations in body shape.
A.2 Optimization Process
The optimization process begins after triangulating all frames. We start by loading the SMPL model and initializing the SMPL parameters. Then we fit the SMPL shape using 3D limb data and initialize body rotation and translation parameters. Next, we refine the poses by optimizing pose parameters, including global rotation and translation. This optimization minimizes the above energy function. Finally, we get the estimated SMPL parameters.
A.3 Balance Weights
The balance weights () used in the optimization process are crucial for controlling the influence of each energy term. The specific weights applied are as follows:
- for the rotation alighment term.
- for the 3D joint distance term.
- for regularizing the pose parameters.
- for the pose smoothness term.
- for regularizing the shape parameters.
Appendix B Training Details of the Baseline Method
In our training process, the MEPoser model was trained for a total of 20 epochs using the Adam optimizer[21], with an initial learning rate of . The learning rate was decayed by a factor of 0.1 after 7 and 14 epochs to ensure stable convergence.We used a batch size of 32 during training. The training was conducted on a machine with NVIDIA V100 GPUs, and the entire training process took approximately 72 hours.
The loss function weights were carefully tuned based on validation performance. Specifically, we set the weight for the 2D heatmap loss to 1.0, the weight for the local 3D joint loss to 1.0, and the weight for the global 3D joint loss to 1.0. The SMPL loss was set to 5.0, and the temporal smoothness loss was set to 0.5 to maintain the temporal coherence of the predicted poses.Our approach represents an initial attempt to perform multimodal egocentric pose estimation, and we will open-source our baseline method to inspire better designs in this area.
Appendix C Hardware Details
As illustrated in Fig. 5, the Optitrack system and the multi-view Kinect system are physically connected to the host PC via cables, while the EgoSensorKit operates wirelessly, offering the subject greater flexibility to perform a range of actions. During recording, two clock systems are employed: the Host-PC-Clock for Kinect and Optitrack data, and the Headset-Clock for EgoSensorKit data. Synchronization between these two systems is achieved using the method described in the Temporal Synchronization section of the main paper.
Eight Azure Kinect cameras are used to capture multiple third-view RGB images. These cameras are connected in a daisy-chain configuration using audio cables to ensure temporal synchronization, with our tests showing a time offset of less than 5 microseconds between the trigger signals of the first and last camera. To ensure full-body visibility of the subject, the cameras are positioned evenly in eight directions around the subject, ensuring that any given point is visible from at least two cameras. Given the bandwidth demands of transmitting data from eight cameras, we set the resolution to and the frame rate to 30Hz, ensuring stable data transmission during acquisition. The activity area for the subject is set to approximately 1 meter in diameter, with the Kinect cameras positioned around 2 meters from the center of this area.
Appendix D Dataset
D.1 Dataset Splitting Details
For dataset splitting, we provide the specific actions under different protocols as shown in Tab.4.
D.2 Dataset Visualization
Fig. 6 presents additional visualization results from our dataset, illustrating the qualitative accuracy of our annotations. The alignment of the projected SMPL mesh with the subject in the third-view images highlights the precision of our SMPL fitting. Furthermore, the 2D skeleton annotations on egocentric images confirm both the spatiotemporal accuracy and the reliability of the 2D annotations.
Appendix E Qualitative Results for Cross-validation
In addition to the quantitative cross-validation in the main manuscript, we also show qualitative results to further validate our dataset annotations. As shown in Fig.7, we visually compared the predicted poses from our system (red skeletons) against the poses obtained from the optical marker-based motion capture system (green skeletons). The visual comparisons across multiple frames demonstrate that our annotation pipeline accurately captures the subject’s motion, with minimal deviations from marker-based ones. These results provide qualitative evidence supporting the robustness and high quality of our dataset annotations.
References
- Akada etal. [2022]Hiroyasu Akada, Jian Wang, Soshi Shimada, Masaki Takahashi, Christian Theobalt, and Vladislav Golyanik.Unrealego: A new dataset for robust egocentric 3d human motion capture.In European Conference on Computer Vision, pages 1–17. Springer, 2022.
- Akada etal. [2024]Hiroyasu Akada, Jian Wang, Vladislav Golyanik, and Christian Theobalt.3d human pose perception from egocentric stereo videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 767–776, 2024.
- Andriluka etal. [2014]Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele.2d human pose estimation: New benchmark and state of the art analysis.In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
- Bogo etal. [2016]Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and MichaelJ Black.Keep it smpl: Automatic estimation of 3d human pose and shape from a single image.In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 561–578. Springer, 2016.
- Cai etal. [2022]Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, etal.Humman: Multi-modal 4d human dataset for versatile sensing and modeling.In European Conference on Computer Vision, pages 557–577. Springer, 2022.
- Cao etal. [2017]Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinity fields.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Castillo etal. [2023]Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbeláez, Ali Thabet, and Artsiom Sanakoyeu.Bodiffusion: Diffusing sparse observations for full-body human motion synthesis.In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 4221–4231, 2023.
- Cha etal. [2021]Young-Woon Cha, Husam Shaik, Qian Zhang, Fan Feng, Andrei State, Adrian Ilie, and Henry Fuchs.Mobile. egocentric human body motion reconstruction using only eyeglasses-mounted cameras and a few body-worn inertial sensors.In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pages 616–625. IEEE, 2021.
- Cuevas-Velasquez etal. [2024]Hanz Cuevas-Velasquez, Charlie Hewitt, Sadegh Aliakbarian, and Tadas Baltrušaitis.Simpleego: Predicting probabilistic body pose from egocentric cameras.In 2024 International Conference on 3D Vision, pages 1446–1455. IEEE, 2024.
- Dai etal. [2024]Peng Dai, Yang Zhang, Tao Liu, Zhen Fan, Tianyuan Du, Zhuo Su, Xiaozheng Zheng, and Zeming Li.Hmd-poser: On-device real-time human motion tracking from scalable sparse observations.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 874–884, 2024.
- Damen etal. [2022]Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, etal.Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, pages 1–23, 2022.
- Du etal. [2023]Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, AliK. Thabet, and Artsiom Sanakoyeu.Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 481–490, 2023.
- Gong etal. [2023]Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, and Rakesh Ranjan.Mmg-ego4d: Multimodal generalization in egocentric action recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6481–6491, 2023.
- Grauman etal. [2024]Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, etal.Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024.
- Huang etal. [2018]Yinghao Huang, Manuel Kaufmann, Emre Aksan, MichaelJ Black, Otmar Hilliges, and Gerard Pons-Moll.Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time.ACM Transactions on Graphics, 37(6):1–15, 2018.
- Ionescu etal. [2013]Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu.Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
- Jiang etal. [2022a]Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz.Avatarposer: Articulated full-body pose tracking from sparse motion sensing.In European conference on computer vision, pages 443–460. Springer, 2022a.
- Jiang etal. [2022b]Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, AlexanderW. Winkler, and C.Karen Liu.Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation.In SIGGRAPH Asia 2022 Conference Papers, pages 3:1–3:9. ACM, 2022b.
- Joo etal. [2015]Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh.Panoptic studio: A massively multiview system for social motion capture.In Proceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015.
- Kang etal. [2023]Taeho Kang, Kyungjin Lee, Jinrui Zhang, and Youngki Lee.Ego3dpose: Capturing 3d cues from binocular egocentric views.In SIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023.
- Kingma and Ba [2015]Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Liu etal. [2022]Yuxuan Liu, Jianxin Yang, Xiao Gu, Yao Guo, and Guang-Zhong Yang.Ego+ x: An egocentric vision system for global 3d human pose estimation and social interaction characterization.In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5271–5277. IEEE, 2022.
- Liu etal. [2023]Yuxuan Liu, Jianxin Yang, Xiao Gu, Yijun Chen, Yao Guo, and Guang-Zhong Yang.Egofish3d: Egocentric 3d pose estimation from a fisheye camera via self-supervised learning.IEEE Transactions on Multimedia, 25:8880–8891, 2023.
- Loper etal. [2014]Matthew Loper, Naureen Mahmood, and MichaelJ Black.Mosh: motion and shape capture from sparse markers.ACM Trans. Graph., 33(6):220–1, 2014.
- Loper etal. [2015]Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and MichaelJ. Black.SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- Luo etal. [2024]Zhengyi Luo, Jinkun Cao, Rawal Khirodkar, Alexander Winkler, Kris Kitani, and Weipeng Xu.Real-time simulated avatar from head-mounted sensors.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 571–581, 2024.
- Ma etal. [2024]Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, HyoJin Kim, etal.Nymeria: A massive collection of multimodal egocentric daily motion in the wild.arXiv preprint arXiv:2406.09905, 2024.
- Mahmood etal. [2019]Naureen Mahmood, Nima Ghorbani, NikolausF Troje, Gerard Pons-Moll, and MichaelJ Black.Amass: Archive of motion capture as surface shapes.In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
- Millerdurai etal. [2024]Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, and Vladislav Golyanik.Eventego3d: 3d human motion capture from egocentric event streams.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1186–1195, 2024.
- Park etal. [2023]Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha Rambhatla, and Paul Fieguth.Domain-guided spatio-temporal self-attention for egocentric 3d pose estimation.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1837–1849, 2023.
- Pavlakos etal. [2019]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, AhmedAA Osman, Dimitrios Tzionas, and MichaelJ Black.Expressive body capture: 3d hands, face, and body from a single image.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019.
- Qiu etal. [2020]Kejie Qiu, Tong Qin, Jie Pan, Siqi Liu, and Shaojie Shen.Real-time temporal and rotational calibration of heterogeneous sensors using motion correlation analysis.IEEE Transactions on Robotics, 37(2):587–602, 2020.
- Radosavovic etal. [2020]Ilija Radosavovic, RajPrateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár.Designing network design spaces.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
- Rai etal. [2021]Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and JuanCarlos Niebles.Home action genome: Cooperative compositional action understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184–11193, 2021.
- Rhodin etal. [2016]Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt.Egocap: egocentric marker-less motion capture with two fisheye cameras.ACM Transactions on Graphics, 35(6):1–11, 2016.
- Somasundaram etal. [2023]Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, JakobJulian Engel, Renzo DeNardi, and Richard Newcombe.Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023.
- Sun etal. [2019]Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang.Deep high-resolution representation learning for human pose estimation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
- Tome etal. [2019]Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino.xr-egopose: Egocentric 3d human pose from an hmd camera.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7728–7738, 2019.
- Trumble etal. [2017]Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and JohnP Collomosse.Total capture: 3d human pose estimation fusing video and inertial sensors.In BMVC, pages 1–13. London, UK, 2017.
- von Marcard etal. [2017]Timo von Marcard, Bodo Rosenhahn, MichaelJ. Black, and Gerard Pons-Moll.Sparse inertial poser: Automatic 3d human pose estimation from sparse imus.Comput. Graph. Forum, 36(2):349–360, 2017.
- VonMarcard etal. [2017]Timo VonMarcard, Bodo Rosenhahn, MichaelJ Black, and Gerard Pons-Moll.Sparse inertial poser: Automatic 3d human pose estimation from sparse imus.In Computer graphics forum, pages 349–360. Wiley Online Library, 2017.
- Wang etal. [2021]Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt.Estimating egocentric 3d human pose in global space.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11500–11509, 2021.
- Wang etal. [2022]Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Diogo Luvizon, and Christian Theobalt.Estimating egocentric 3d human pose in the wild with external weak supervision.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13157–13166, 2022.
- Wang etal. [2023]Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt.Scene-aware egocentric 3d human pose estimation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13031–13040, 2023.
- Wang etal. [2024]Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, and Christian Theobalt.Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–787, 2024.
- Winkler etal. [2022]Alexander Winkler, Jungdam Won, and Yuting Ye.Questsim: Human motion tracking from sparse sensors with simulated avatars.In SIGGRAPH Asia 2022 Conference Papers, pages 1–8, 2022.
- Xu etal. [2019]Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, Hans-Peter Seidel, and Christian Theobalt.Mo2cap2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera.IEEE transactions on visualization and computer graphics, 25(5):2093–2101, 2019.
- Yi etal. [2021]Xinyu Yi, Yuxiao Zhou, and Feng Xu.Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.ACM Transactions on Graphics, 40(4):1–13, 2021.
- Yi etal. [2022]Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu.Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13167–13178, 2022.
- Zhang etal. [2022]Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, and Siyu Tang.Egobody: Human body shape and motion of interacting people from head-mounted devices.In European conference on computer vision, pages 180–200. Springer, 2022.
- Zhao etal. [2021]Dongxu Zhao, Zhen Wei, Jisan Mahmud, and Jan-Michael Frahm.Egoglass: Egocentric-view human pose estimation from an eyeglass frame.In 2021 International Conference on 3D Vision, pages 32–41. IEEE, 2021.
- Zheng etal. [2023]Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, and Xiaojie Jin.Realistic full-body tracking from sparse observations via joint-level modeling.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14678–14688, 2023.