- Weng, Zhenzhen;
- Bravo-Sánchez, Laura;
- Wang, Zeyu;
- Howard, Christopher;
- Xenochristou, Maria;
- Meister, Nicole;
- Kanazawa, Angjoo;
- Milstein, Arnold;
- Bergelson, Elika;
- Humphreys, Kathryn;
- Sanders, Lee;
- Yeung-Levy, Serena
We introduce HARMONI, a three-dimensional (3D) computer vision and audio processing method for analyzing caregiver-child behavior and interaction from observational videos. HARMONI operates at subsecond resolution, estimating 3D mesh representations and spatial interactions of humans, and adapts to challenging natural environments using an environment-targeted synthetic data generation module. Deployed on 500 hours from the SEEDLingS dataset, HARMONI generates detailed quantitative measurements of 3D human behavior previously unattainable through manual efforts or 2D methods. HARMONI identifies longitudinal trends in child-caregiver interaction, including child movement, body pose, dyadic touch, visibility, and conversational turns. The integrated visual and audio analysis further reveals multimodal trends, including associations between child conversational turns and movement. Open-sourced for large-scale analysis, HARMONI facilitates advancements in human development research. HARMONI achieves 63 to 80% consistency on key attributes with human annotators on SEEDLingS and 84 to 93% consistency on videos taken from a laboratory setting while achieving >100 times savings in time.