[Meta has developed an ambitious project and resource to enhance the potential for medium-as-social-actor presence responses to human avatars, as described in this story from App Developer Magazine. –Matthew]

Modeling natural conversational dynamics
As Meta researchers push the boundaries of virtual interaction, modeling natural conversational dynamics with seamless interaction emerges to redefine how humans and AI connect through lifelike gestures, expressions, and immersive conversational dynamics.
By Austin Harris
July 25, 2025
Meta’s Fundamental AI Research (FAIR) team has introduced a family of audiovisual behavioral motion models designed to advance human connection technology. These models generate facial expressions and body gestures based on audio-visual inputs from two people, aiming to create more natural and interactive virtual agents. The models enable fully embodied avatars in both 2D video and 3D Codec Avatars, which could transform telepresence technologies in virtual and augmented reality settings.
Modeling natural conversational dynamics: The seamless interaction dataset
To support this work, Meta is releasing the Seamless Interaction Dataset, a large-scale collection comprising over 4,000 hours of two-person interactions with over 4,000 unique participants. This dataset captures diverse, in-person conversational dynamics, providing a foundation for audiovisual behavioral models to understand and generate human-like social behaviors.
Communication between individuals involves continuous adjustment of speech, intonation, and gesture, a process often described as a conversational dance. Modeling dyadic (two-party) conversation dynamics requires understanding the interplay of vocal, verbal, and visual signals, as well as interpersonal behaviors such as listening, visual synchrony, and turn-taking. Meta’s Dyadic Motion Models aim to render speech, whether human-generated or produced by language models, into full-body gestures and active listening behaviors. These capabilities have the potential to create virtual agents that engage in social interactions with human-like expressiveness across immersive environments.
Technical details and innovations
The Audio-Visual (AV) Dyadic Motion Models introduced by Meta can jointly generate facial expressions and body gestures based on audio inputs from two individuals or from speech generated by large language models (LLMs). These models visualize the emotions, gestures, and movements implied by conversations, producing both speaking and listening behaviors, as well as turn-taking cues.
By incorporating visual inputs alongside audio, the models learn synchrony cues such as mirrored smiles or joint gaze, enriching the realism of generated interactions. The models also include controllability parameters, enabling users or designers to adjust avatar expressivity. This flexibility can be guided implicitly by LLM speech output, providing visual direction to the motion model.
Furthermore, the models output intermediate face and body motion codes, allowing adaptation to a range of applications, including 2D video generation and the animation of 3D Codec Avatars. Meta’s Codec Avatars lab has contributed baseline reference implementations and datasets to assist the research community in advancing metric telepresence.
Building the dataset
The Seamless Interaction Dataset is the largest known collection of high-quality, in-person two-person interactions, capturing simultaneous facial and body signals. Grounded in contemporary psychological theory, the dataset encompasses over 4,000 hours of audiovisual interactions with more than 4,000 participants. It features approximately 1,300 conversational and activity-based prompts, including naturalistic, improvised, and scripted content to cover a wide emotional spectrum, ranging from surprise and disagreement to determination and regret.
About one-third of the recordings involve familiar pairs (family, friends, colleagues), allowing exploration of relationship-driven behaviors, while another third involves professional actors portraying diverse roles and emotions. All sessions were recorded in person to preserve embodied interaction qualities, avoiding the limitations of remote, video-based communication.
In addition to the raw recordings, the dataset offers rich contextualization with participant-level relationships, personality metadata, and nearly 5,000 video annotations.
Evaluation methodology
Alongside the dataset, Meta has published a technical report detailing the methodology and findings of the research. The report proposes an evaluation methodology that includes both subjective and objective metrics, helping to assess the progress of audiovisual behavioral models. The evaluation protocol focuses on speaking, listening, and turn-taking behaviors, offering a blueprint for future research in this emerging field.
Privacy, ethics, and safeguards
Meta emphasizes privacy, ethics, and data quality throughout the research process. Participants consented to recorded interactions, were advised to avoid sharing personal information, and participated in scripted sessions when needed to minimize disclosure risks. A multi-stage quality assurance process, combining human review, transcript analysis, and video language model inspection, was employed to detect and remove sensitive material or personally identifiable information.
Additionally, Meta employs AudioSeal and VideoSeal watermarking technologies to embed hidden signals in generated content, ensuring traceability and authenticity even after post-processing.
Future outlook
The Dyadic Motion Models and Seamless Interaction Dataset represent major steps toward the development of social technologies that enhance daily life, provide entertainment, and foster connection. Meta’s commitment to responsible AI practices aims to build trust and deliver technology that benefits all. The company looks forward to seeing how the research community leverages the dataset and technical report to push the boundaries of social AI.
Leave a Reply