Call: International Conference on Multimodal Interaction (ICMI 2012)

2012 International Conference on Multimodal Interaction (ICMI 2012)
October 22nd-26th, 2012
Doubletree Suites, Santa Monica, California, USA


The International Conference on Multimodal Interaction, ICMI 2012, will take place in Santa Monica, California (USA), October 22nd-26th, 2012. ICMI is the premier international forum for multidisciplinary research on multimodal human-human, human-robot, and human-computer interaction, interfaces, and system development. The conference focuses on theoretical and empirical foundations, component technologies, and combined multimodal processing techniques that define the field of multimodal interaction analysis, interface design, and system development. ICMI 2012 will feature a single-track main conference which includes: keynote speakers, technical full and short papers (including oral and poster presentations), special sessions, demonstrations, exhibits and doctoral spotlight papers. The conference will be followed by workshops. The proceedings of ICMI 2012 will be published by ACM as part of their series of International Conference Proceedings.


Main conference paper submission:  ** May 4th, 2012 **
Demo proposals due:  June 22nd, 2012
Main conference author notification:  July 23rd, 2012
Camera-ready copy (long, short, demo):  August 20th, 2012
Exhibit proposals due:  September 26th, 2012
Challenge Meetings:  October 22nd, 2012
Main Conference:  October 23rd-25th, 2012
Workshops:  October 26th, 2012


Instructions for authors:

Submission Website:


Infusing Physical World into User Interfaces

Ivan Poupyrev
Senior Research Scientist, Walt Disney Corporation

Ivan Poupyrev will talk about various approaches of infusing physical world into user interfaces.

Dr. Ivan Poupyrev directs an Interaction Technology group in Disney Research’s Pittsburgh Lab, a unit of Walt Disney Imagineering tasked with dreaming up and developing future technologies for Disney parks, resorts, and cruises. Dr. Poupyrev’s group focuses on inventing new interactive technologies for the seamless blending of digital and physical properties in devices, everyday objects, and living environments, a direction he broadly refers to as physical computing.

It’s span a broad range of research fields including haptic user interfaces, tangible interfaces, shape-changing and flexible computers, augmented and virtual reality as well as spatial 3D interaction. His research was broadly published and received awards at prestigious academic conferences, exhibited world-wide, extensively reported in popular media and released on the market in various consumer products and applications. Prior to Disney, Ivan has worked as a researcher at Sony Computer Science Laboratories and the Advanced Telecommunication Research Institute International, both in Japan. He also had a stint at the Human Interface Technology Laboratory at the University of Washington as a Visiting Scientist while working on his Ph.D. dissertation in Hiroshima University, Japan.

Using Psychophysical Techniques to Design and Evaluate Multi-modal Interfaces

Roberta Klatzky
Professor of Psychology, Carnegie Mellon University

“Psychophysics” is an approach to evaluating human perception and action capabilities that emphasizes control over the stimulus environment. Virtual environments provide an ideal setting for psychophysical research, as they facilitate not only stimulus control but precise measurement of performance. In my research I have used the psychophysical approach to inform the design and evaluation of multi-modal interfaces that enable action in remote or virtual worlds or that compensate for sensori-motor impairment in the physical environment of the user. This talk will describe such projects, emphasizing the value of behavioral science to interface engineering.

Roberta Klatzky is Professor of Psychology at Carnegie Mellon University, where she is also on the faculty of the Center for the Neural Basis of Cognition and the Human-Computer Interaction Institute. She received a B.S. in mathematics from the University of Michigan and a Ph.D. in cognitive psychology from Stanford University. She is the author of over 200 articles and chapters, and she has authored or edited 6 books,. Her research investigates perception, spatial thinking and action from the perspective of multiple modalities, sensory and symbolic, in real and virtual environments. Klatzky’s basic research has been applied to tele-manipulation, image-guided surgery, navigation aids for the blind, and neural rehabilitation. Klatzky is a fellow of the American Association for the Advancement of Science, the American Psychological Association, and the Association for Psychological Science, and a member of the Society of Experimental Psychologists (honorary). For her research on perception and action, she received an Alexander von Humboldt Research Award and the Kurt Koffka Medaille from Justus-Liebig-University of Giessen, Germany. Her professional service includes governance roles in several societies and membership on the National Research Council’s Committees on International Psychology, Human Factors, and Techniques for Enhancing Human Performance. She has served on research review panels for the National Institutes of Health, the National Science Foundation, and the European Commission. She has been a member of many editorial boards and is currently an associate editor of ACM Transactions in Applied Perception and IEEE Transactions on Haptics.

The Combinatorial Structure of Human Action

Charles Goodwin
Social Sciences Division, University of California at Los Angeles, USA.

Charles Goodwin is Professor of Applied Linguistics at UCLA. He received his Ph.D. from the Annenberg School of Communications, University of Pennsylvania, 1977, and a Doctor of Philosophy Honoris Causa from the Faculty of Arts and Sciences, Linköping University in 2009. His interests include video analysis of talk-in-interaction (including study of the discursive practices used by hearers and speakers to construct utterances, stories, and other forms of talk), grammar in context, cognition in the lived social world, gesture, gaze and embodiment as interactively organized social practices, aphasia in discourse, language in the professions and the ethnography of science. He has been involved in numerous studies including family interaction in the United States, the work of oceanographers in the mouth of Amazon, archaeologists in the United States and Argentina, geologists in the back country of Yellowstone, the organization of talk, vision and embodied action in the midst of surgery, and how a man with severe aphasia is able to function as a powerful speaker in conversation. As part of the Workplace Project at Xerox PARC he investigated cognition and talk-in-interaction in complex work setting. His publications include Embodied Interaction: Language and Body in the Material World (edited with Jürgen Streeck and Curtis LeBaron (Cambridge U. Press), Rethinking Context (Cambridge U. Press) (co-edited with Alessandro Duranti), Conversation and Brain Damage (Oxford U. Press), Conversational Organization (Academic Press), Il Senso del Vedere: Pratiche Sociali della Significazione (Melterri Editore) “Professional Vision” (American Anthropologist), “Action and Embodiment” (Journal of Pragmatics), “Constructing Meaning Through Prosody in Aphasia” (Prosody in Interaction ed. by Barth-Weingarten, et. al.), “Ensembles of Emotion” (with M.H. Goodwin, et. al in Emotion in Interaction, edited by Sorjonen and Perakyla), “How geoscientists think and learn” (with Kastens et. al, EOS Transactions, American Geophysical Union), “Can you see the cystic artery yet?” with Koschmann et. al, Journal of Pragmatics.


Human action is built by actively and simultaneously combining materials with intrinsically different properties into situated contextual configurations where they can mutually elaborate each other to create a whole that is both different from, and greater than, any of its constitutive parts. These resources include many different kinds of lexical and syntactic structures, prosody, gesture, embodied participation frameworks, sequential organization, and different kinds of materials in the environment, including tools created by others that structure local perception. The simultaneous use of different kinds of resources to build single actions has a number of consequences. First, different actors can contribute different kinds of materials that are implicated in the construction of a single action. For example embodied visual displays by hearers operate simultaneously on emerging talk by a speaker so that both the utterance and the turn have intrinsic organization that is both multi-party and multimodal. Someone with aphasia who is unable to produce lexical and syntactic structure can nonetheless contribute crucial prosodic and sequential materials to a local action, while appropriating the lexical contributions of others, and thus become  a powerful speaker in conversation, despite catastrophically impoverished language. One effect of this simultaneous, distributed heterogeneity is that frequently the organization of action cannot be easily equated with the activities of single individuals, such as the person speaking at the moment, or with phenomena within a single medium such as talk. Second, subsequent action is frequently built through systematic transformations of the different kinds of materials provided by a prior action. In this process some elements of the prior contextual configuration, such as the encompassing participation framework, may remain unchanged, while others undergo significant modification. A punctual perspective on action, in which separate actions discretely follow one another, thus becomes more complex when action is seen to emerge within an unfolding mosaic of disparate materials and time frames which make possible not only systematic change, but also more enduring frameworks that provide crucial continuity. Third, the distributed, compositional structure of action provides a framework for developing the skills of newcomers within structured collaborative action. Fourth, human tools differ from the tools of other animals in that, like actions in talk, they are built by combining unlike materials into a whole not found in any of the individual parts (for example using a stone, a piece of wood and leather thongs to make an ax). This same combinatorial heterogeneity sits at the heart of human action in interaction, including language use. It creates within the unfolding organization of situated activity itself the distinctive forms of transformative collaborative action in the world, including socially organized perceptual and cognitive structures and the mutual alignment of bodies to each other, which constitutes us as humans.


Developing systems that can robustly understand human-human communication or respond to human input requires identifying the best algorithms and their failure modes. In fields such as computer vision, speech recognition, and computational linguistics, the availability of datasets and common tasks have led to great progress. This year we invited the ICMI community to collectively define and tackle scientific Grand Challenges in multimodal interaction for the next 5 years. We received a good response to the call, and we are hosting four Challenge events at the ICMI 2012 conference. Multimodal Grand challenges are driven by ideas that are bold, innovative, and inclusive. We hope they will inspire new ideas in the ICMI community and create momentum for future collaborative work.

Grand Challenge Chairs

Daniel Gatica-Perez
Idiap Research Institute

Stefanie Tellex
Massachusetts Institute of Technology

AVEC 2012: 2nd International Audio/Visual Emotion Challenge and Workshop

The Audio/Visual Emotion Challenge and Workshop (AVEC 2012) will be the second competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and audiovisual emotion analysis, with all participants competing under strictly the same conditions. The goal of the challenge is to provide a common benchmark test set for individual multimodal information processing and to bring together the audio and video emotion recognition communities, to compare the relative merits of the two approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. A second motivation is the need to advance emotion recognition systems to be able to deal with naturalistic behavior in large volumes of un-segmented, non-prototypical and non-preselected data as this is exactly the type of data that both multimedia retrieval and human-machine/human-robot communication interfaces have to face in the real world.

We are calling for teams to participate in emotion recognition from acoustic audio analysis, linguistic audio analysis, video analysis, or any combination of these. As benchmarking database the SEMAINE database of naturalistic dialogues will be used. Emotion will have to be recognized in terms of continuous time, continuous valued dimensional affect in four dimensions: arousal, expectation, power and valence. Besides participation in the Challenge we are calling for papers addressing the overall topics of this workshop, in particular works that address the differences between audio and video processing of emotive data, and the issues concerning combined audio-visual emotion recognition.

Please visit our website for more information:


Björn Schuller
Institute for Man-Machine Communication, Technische Universität München, Germany

Michel Valstar
Intelligent Behaviour Understanding Group, Imperial College London, U.K.

Roddy Cowie
School of Psychology, Queen’s University Belfast, U.K.

Maja Pantic
Intelligent Behaviour Understanding Group, Imperial College London, U.K.

Important dates:

Paper submission:  July 1, 2012
Notification of acceptance:  July 16, 2012
Camera ready paper:  July 31, 2012

Haptic Voice Recognition Grand Challenge 2012

Haptic Voice Recognition (HVR) Grand Challenge 2012 is a research oriented competition designed to bring together researchers across multiple disciplines to work on Haptic Voice Recognition (HVR), a novel multimodal text entry method for modern mobile devices. HVR combines both voice and touch inputs to achieve better efficiency and robustness. As it is now a commonplace that modern portable devices are equipped with both microphones and touchscreen display, it will be interesting to explore possible ways of enhancing text entry on these devices by combining information obtained from these sensors. The purpose of this grand challenge is to define a set of common challenge tasks for researchers to work on in order to address the challenges faced and to bring the technology to the next frontier. Basic tools and setups are also provided to lower the entry barrier so that research teams can participate in this grand challenge without having to work on all aspects of the system. This grand challenge will also be accompanied by a workshop, which will be held at the International Conference on Multimodal Interaction (ICMI) 2012. Participants are given the opportunity to share their innovative findings and engage in discussions concerning current and future research directions during the workshop. Participants will also submit papers to the workshop to report their research findings. Accepted papers will be included in the proceedings of ICMI 2012.

Please visit our website for more information:


Dr. Khe Chai SIM
School of Computing, National University of Singapore

Dr. Shengdong Zhao
School of Computing, National University of Singapore

Dr. Kai Yu
Engineering Department, Cambridge University

Dr. Hank Liao
Google Inc.

Important dates:

Release of development data:  March 1, 2012
Release of challenge data:  July 1, 2012
Paper submission:  July 31, 2012
Notification of acceptance:  August 13, 2012
Camera ready paper:  August 20, 2012

D-META Grand Challenge: Data sets for Multimodal Evaluation of Tasks and Annotations

The D-META Grand Challenge, sets up the basis for comparison, analysis, and further improvement of multimodal data annotations and multimodal interactive systems. Such machine learning-based challenges do not exist in the multimodal interaction community. The main goal of this Grand Challenge is to foster research and development in multimodal communication and to further elaborate algorithms and techniques for building various multimodal applications. Held by two coupled pillars, method benchmarking and annotation evaluation, the D-META challenge envisions a starting point for transparent and publicly available application and annotation evaluation on multimodal data sets.

Please visit our website for more information:


Xavier Alameda-Pineda
Perception Team, INRIA Rhône-Alpes, University of Grenoble, France

Dirk Heylen
Human Media Interaction, University of Twente, The Netherlands

Kristiina Jokinen
Department of Behavioural, University of Helsinki, Finland

Important dates:

Paper submission:  June 15, 2012
Notification of acceptance:  August 24, 2012
Camera ready paper:  September 24, 2012

BCI Grand Challenge Brain-Computer Interfaces as intelligent sensors for enhancing Human-Computer Interaction

The field of physiological computing consists of systems that use data from the human nervous system as control input to a technological system. Traditionally these systems have been grouped into two categories, those where physiological data is used as a form of input control and a second where spontaneous changes in physiology are used to monitor the psychological state of the user. Brain-Computer Interfaces (BCI) are traditionally conceived as a control for interfaces, a device that allows you to “act on” external devices as a form of input control. However, most BCI do not provide a reliable and efficient means of input control and are difficult to learn and use relative to other available modes. We propose to change the conceptual use of “BCI as an actor” (input control) into “BCI as an intelligent sensor” (monitor). This shift of emphasis promotes the capacity of BCI to represent spontaneous changes in the state of the user in order to induce intelligent adaptation at the interface. BCIs can be increasingly used as intelligent sensors, which “read” passive signals from the nervous system and infer user states to adapt human-computer, human-robot, or human-human interaction (resp. HCI, HRI, HHI). This perspective on BCI challenges researchers to understand how information about the user state should support different types of interaction dynamics, from supporting the goals and needs of the user to conveying state information to other users. What adaptation to which user state constitutes opportune support? How does the feedback of the changing HCI and HRI affect brain signals? Many research challenges need to be tackled here.

Please visit the BCI Grand Challenge website for more information:


Femke Nijboer
University of Twente, The Netherlands

Mannes Poel
University of Twente, The Netherlands

Anton Nijholt
University of Twente, The Netherlands

Egon L. van den Broek
TNO Technical Sciences, The Netherlands

Stephen Fairclough
Liverpool John Moore University, United Kingdom

Important dates:

Deadline for submission:  June 15, 2012
Notification of acceptance:  July 7, 2012
Final papers due:  August 15, 2012


Leave a Reply

Your email address will not be published. Required fields are marked *

ISPR Presence News

Search ISPR Presence News: