Audio-Video detection of the active speaker in meetings - Structuration, Analyse et Modélisation de documents Vidéo et Audio Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Audio-Video detection of the active speaker in meetings

Résumé

Meetings are a common activity that provide certain challenges when creating systems that assist them. Such is the case of the Speaker recognition, which can provide useful information for human interaction modeling, or human-robot interaction. Speaker recognition is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose a speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyse several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public benchmarks in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the speaker recognition.
Fichier principal
Vignette du fichier
ICPR.pdf (5.33 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03125600 , version 1 (29-01-2021)

Identifiants

Citer

Jorge Francisco Madrigal Diaz, Frédéric Lerasle, Lionel Pibre, Isabelle Ferrané. Audio-Video detection of the active speaker in meetings. IEEE 25th International Conference on Pattern Recognition (ICPR 2020), IAPR : International Association of Pattern Recognition, Jan 2021, Milan (virtual), Italy. ⟨10.1109/ICPR48806.2021.9412681⟩. ⟨hal-03125600⟩
269 Consultations
294 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More