VoiceXML Italian User Group. Intervista di Paolo Baggia alla Dr. Deborah Dahl

Interview to Dr. Deborah Dahl by Paolo Baggia (Loquendo) - July 2003.
(Cliccare qui per la traduzione italiana)

Considering the incresing intereset in multimodal applications, I have kindly asked to Dr. Deborah Dahl, chairman of the W3C Multimodal Interaction Working Group for answering to this interview.
I hope it will stimulate some thinking of the new challenges that voice will face in the near future.

Paolo Baggia, representative of Loquendo at W3C MMIWG and VBWG

[Paolo Baggia] Why is there an increased interest in multimodal applications?

[Dr. Deborah Dahl] Although multimodality has been of academic interest for many years, it is becoming of increased commercial interest because three important aspects of the infrastructure that supports multimodal applications are now coming into place. First, since multimodal applications depend on speech recognition, technical improvements in the accuracy and robustness of speech recognition over the last few years are making commercial multimodal applications much more feasible. The second aspect of multimodal infrastructure that's coming into place is the explosive development of mobile telephony. Although mobile devices are becoming increasingly capable, they are also becoming smaller, which makes keypad or keyboard input often slow and awkward. Speech input is natural in this context. Finally, and most recently, the voice and web infrastructure and availability of development tools is making application development much easier and more efficient.

[P.B.] Which are the proposals on the table?

[Dr. Deborah Dahl] Older multimodal applications relied primarily on proprietary technologies, although open speech API's like SAPI and JSAPI have significantly helped to reduce the complexity of application development with speech. However, as I mentioned earlier, the integration of voice and the web and the associated standards and tools that are becoming available are reducing the complexity of application development even further. There are two important open web-based multimodal specifications currently available. First, IBM and Opera Software have defined an approach to integrating XHTML with VoiceXML (called X+V), and second, the SALT Forum has also defined a multimodal specification, SALT 1.0. Many of the ideas in these approaches are quite similar, but SALT programming is generally at a lower level than X+V programming. The World Wide Web Consortium's Multimodal Interaction Working Group, which I chair, is working on defining a single standard for multimodal interaction in a web environment. Both the X+V and the SALT specification have been offered to the W3C on a royalty-free basis a contributions to this activity.
Other emerging standards such as the W3C's SRGS and SSML for speech recognition and text to speech are also applicable to multimodal as well as to voice-only applications.

[P.B.] When do you think the multimodal application will be emerging?

[Dr. Deborah Dahl] There are many specialized multimodal applications in existence already, such as language learning systems and tools for the disabled, but they reach relatively few users. Mobile multimodal applications show the most promise for making modality truly mainstream. Many companies, particularly carriers, are working internally on multimodal applications and trials, but they aren't ready to release information publicly about what they're doing. However, I believe that we're likely to see some announcements of trials and deployments during the fourth quarter of this year.

[P.B.] Are there examples of multimodal applications today?

[Dr. Deborah Dahl] There aren't too many deployed multimodal applications in the mainstream marketplace yet. As one example, LogicTree has deployed systems that provide public transformation information for local government customers. There are many more applications in trial stages, for example, trials of multimodal applications have been publicly announced by vendors such as Kirusa (voice portal applications). Other interesting areas being explored, but not yet deployed, are SpeechWorks's multimodal automobile interface for the Ford Model U Concept SUV. This interface allows users to perform functions such as navigation and adjusting the heating and air conditioning. Multimodal speech therapy systems are also being used with stroke patients to rehabilitate their language abilities.

[P.B.] How the voice market will be affected by multimodal application development?

[Dr. Deborah Dahl] Multimodal applications will help expand the voice market by making applications possible that either can't be done at all with voice-only interfaces— for example, applications where a visual display of an image or video is an integral part of the application — or tasks where interaction through a voice-only interface is very time-consuming, such as selection from a long list. I don't think it's likely that the market for voice-only applications will be reduced, because the prevalence of telephones without displays means that voice-only applications will continue to be needed for those customers.

[P.B.] Do you think there will be some differences between the US and EU market for these technologies?

[Dr. Deborah Dahl] European users have a reputation for being more sophisticated mobile users than people in the US, so I expect that multimodal applications will become widespread in Europe earlier than they will in the US. However, I don't expect that the applications will be different in kind.

[P.B.] What are good applications for multimodality?

[Dr. Deborah Dahl] I don't think anyone understands yet what the most compelling multimodal applications will turn out to be. At a minimum, speech absolutely needs to be perceived as adding value to the application— it doesn't work to just include speech in an application because it's a cool technology. Speech clearly adds value to applications that run on small devices with cumbersome keypads as well as to applications where the user's hands or eyes are busy. Most multimodal demonstrations I've seen have been demonstrations of form-filling applications. However, I think that voice-based navigation may actually be a more compelling task than form-filling, since working your way through many levels of menus in a GUI interface is very tedious on small devices. As developers become more familiar with the possibility of multimodality and with using multimodal tools, I think we'll start to see many innovative applications. I would encourage anyone with interest in this area to try developing multimodal applications with some of the available platforms such as X+V from IBM or SALT from Microsoft. Although it's often claimed that future multimodal applications will be developed primarily by web developers, I think that developers with expertise in speech applications will actually be able to develop better applications than GUI web developers because of their familiarity with the speech interface.

Other interesting websites