Abstract
Real-time engagement estimation holds significant potential across various research areas, particularly in the realm of human-computer interaction. It empowers artificial agents to dynamically adjust their responses based on user engagement levels, fostering more intuitive and immersive interactions. Despite the strides in automating real-time engagement estimation, the task remains challenging in real-world settings, especially when handling multi-modal human social signals. Capitalizing on human body and audio signals, this paper explores the appropriate feature representations of different modalities and effective modelling of dual conversations. This results in a novel and efficient multi-modal engagement detection model.We thoroughly evaluated our method in the MultiMediate'23 grand challenge. It performs consistently, with a notable improvement over the baseline model. Specifically, while the baseline achieves a concordance correlation coefficient (CCC) of 0.59, our approach yields a CCC of 0.70, suggesting its promising efficacy in real-life engagement detection. Copyright © 2023 held by the owner/author(s).
Original language | English |
---|---|
Title of host publication | Proceedings of the 31st ACM International Conference on Multimedia, MM '23 |
Place of Publication | USA |
Publisher | Association for Computing Machinery |
Pages | 9601-9605 |
ISBN (Electronic) | 9798400701085 |
DOIs | |
Publication status | Published - 2023 |
Citation
Yang, C., Wang, K., Chen, P. Q., Cheung, M. K. M., Zhang, Y., Fu, E. Y., & Ngai, G. (2023). MultiMediate 2023: Engagement level detection using audio and video features. In Proceedings of the 31st ACM International Conference on Multimedia, MM '23 (pp. 9601-9605). Association for Computing Machinery. https://doi.org/10.1145/3581783.3612873Keywords
- Engagement
- Machine learning
- Neural networks