Dynamic Human-Scene Cooperative Novel View Synthesis Method Based on 3D Gaussian Splatting

Jinghan Wang

doi:10.71222/wh9aj202

Authors

Jinghan Wang North China Electric Power University, Beijing, 102206, China Author

DOI:

https://doi.org/10.71222/wh9aj202

Keywords:

3D reconstruction, natural scene, parametric model, 3D gaussian splatting, scene decoupling

Abstract

Dynamic human-scene cooperative novel view synthesis holds significant application value in fields such as Virtual Reality (VR), Augmented Reality (AR), film production, and digital humans. In Chapter 4, we implemented high-fidelity novel view synthesis of real human body surface details based on Neural Radiance Fields (NeRF). Although the synthesis of dynamic human surface details achieved promising results, the slow inference speed of NeRF and its implicit modeling of continuous space — lacking explicit geometric structures — make it difficult to decouple the human body from the scene. Consequently, NeRF fails to meet the requirements for dynamic human-scene cooperative novel view synthesis. Moreover, the absence of accurate semantic segmentation of humans and scenes in three-dimensional space poses a critical challenge in accurately decomposing dynamic human Gaussians and static scene Gaussians. To address these issues, this chapter proposes an efficient dynamic human-scene cooperative novel view synthesis framework based on the 3D Gaussian Splatting (3DGS) method. The framework standardizes the spatial coordinate systems of the human body and the scene to ensure geometric consistency and employs a triplane representation to reconstruct human Gaussians. Finally, a joint training strategy is adopted to simultaneously optimize the human and scene models. Comparative experiments on publicly available datasets demonstrate that the proposed method effectively corrects Gaussian misalignment caused by geometric coupling between the human body and the scene. This results in more accurate decoupling of the human body and the scene, enabling flexible recombination of human and scene elements without additional training, thereby achieving high-quality dynamic human-scene cooperative novel view synthesis.

References

1. C. Guo, T. Jiang, X. Chen, J. Song, and O. Hilliges, “Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 12858–12868, doi: 10.1109/CVPR52729.2023.01236.

2. W. Jiang, K. M. Yi, G. Samei, O. Tuzel, and A. Ranjan, “Neuman: Neural human radiance field from a single video,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Cham, Switzerland: Springer, 2022, pp. 402–418, doi: 10.1007/978-3-031-19824-3_24.

3. C. Weng, B. Curless, P. P. Srinivasan, J. T. Barron and I. Kemelmacher-Shlizerman, "HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video," in 2022 Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., New Orleans, LA, USA, 2022, pp. 16189-16199, doi: 10.1109/CVPR52688.2022.01573.

4. S. Peng et al., "Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dy-namic Humans," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, 2021, pp. 9050–9059, doi: 10.1109/CVPR46437.2021.00894.

5. A. W. Bergman, P. Kellnhofer, W. Yifan, E. R. Chan, D. B. Lindell, and G. Wetzstein, "Generative neural articulated radiance fields," arXiv preprint arXiv:2206.14314, 2022, doi: 10.48550/arXiv.2206.14314.

6. Z. Dong, X. Chen, J. Yang, M. J. Black, O. Hilliges, and A. Geiger, "AG3D: Learning to generate 3D avatars from 2D image col-lections," Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Paris, France, 2023, pp. 14870–14881, doi: 10.1109/ICCV51070.2023.01370.

7. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, "NeRF: Representing scenes as neural ra-diance fields for view synthesis," in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. M. Frahm, Eds., vol. 12346, Lecture Notes in Computer Science, Cham, Switzerland: Springer, 2020, pp. 565–580. ISBN: 9783030584511.

8. H. Xu, T. Alldieck, and C. Sminchisescu, "H-NeRF: Neural radiance fields for rendering and temporal reconstruction of humans in motion," arXiv e-prints, arXiv:2110, doi: 10.48550/arXiv.2110.13746.

9. X. Gao, J. Yang, J. Kim, S. Peng, Z. Liu, and X. Tong, "MPS-NeRF: Generalizable 3D human rendering from multiview images," IEEE Trans. Pattern Anal. Mach. Intell., doi: 10.1109/TPAMI.2022.3205910.

10. B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, "3D Gaussian splatting for real-time radiance field rendering," ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 1–14, 2023, doi: 10.1145/3592433.

11. A. Moreau, J. Song, H. Dhamo, R. Shaw, Y. Zhou, and E. Pérez-Pellitero, "Human Gaussian splatting: Real-time rendering of animatable avatars," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2024, pp. 788–798, doi: 10.1109/CVPR52733.2024.00081.

12. Z. Li, Z. Zheng, L. Wang, and Y. Liu, "Animatable Gaussians: Learning pose-dependent Gaussian maps for high-fidelity human avatar modeling," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2024, pp. 19711–19722, doi: 10.1109/CVPR52733.2024.01864.

13. H. Jung, N. Brasch, J. Song, E. Pérez-Pellitero, Y. Zhou, Z. Li, N. Navab, and B. Busam, "Deformable 3D Gaussian splatting for animatable human avatars," arXiv preprint arXiv:2312.15059, 2023, doi: 10.48550/arXiv.2312.15059.

14. M. Li, J. Tao, Z. Yang, and Y. Yang, "Human101: Training 100+ fps human Gaussians in 100s from 1 view," arXiv preprint arXiv:2312.15258, 2023, doi: 10.48550/arXiv.2312.15258.