VoiceXML Italian User Group. Intervista di Paolo Baggia a T.V. Raman

Interview to Dr. T.V. Raman by Paolo Baggia (Loquendo) - July 2004.
(Cliccare qui per la traduzione italiana)

This time we will ask a few questions to Dr. T.V. Raman, who works at IBM Research in Almaden. Dr. Raman is one of the inventors of the multimodal X+V language and he is an expert of integrating speech into Web applications.

Paolo Baggia, representative of Loquendo at W3C MMIWG and VBWG

[Paolo Baggia] What is the contribution of X+V to multimodal applications?

[Dr. T.V. Raman] I would like to place XHTML+VoiceXML, called X+V, in its context. W3C Consortium does lot of standards: XHTML, SVG, SMIL, etc. Multimodal means to bring together different modalities, such as speech, touch, gesture, etc.. We want to combine W3C standards to cooperate in it. X+V is an example, it is a formula, a way to take XHTML and VoiceXML to work together by means of a glue layer. You do not create a new multimodal language, but you take the best of breed of visual (XHTML) and the best of breed of voice application (VoiceXML) and you combine them. Who likes the Vector Graphics can follow the same formula and come out with a XHTML plus VoiceXML plus SVG. That is the key contribution of X+V to multimodality.

[P.B.] Can you clarify what do you mean with 'glue layer'?

[Dr. T.V. Raman] The glue that you need is at the user interaction level. I give an example: the user click on button that's an event, if you, if you use a voice modality when you say something you raise another event. So you need a common event model that allows it.
In X+V, we use XML Events, which is the authoring syntax for creating DOM event handler and event binding. What happens is that a mouse click can fire VoiceXML dialogs as event handlers.
VoiceXML can collect data from user by voice and then to raise and event to XHTML to update the GUI.
If you have a Menu you can say: "I want to go to Torino", that event raises an event to XHTML that selects Torino in the visual menu. You can do this in the viceversa direction too, a mouse click on Torino will raise an event to the VoiceXML that can ask: "Do you to want to go to Torino?". This is the multimodal framework created by X+V.

Therefore the main contributions are two:
1.bring the W3C standards to work together
2.Use XML Event as a glue for bring the standards together at runtime.

[P.B.] Do you think there is an interest in multimodal applications today?

[Dr. T.V. Raman] This in an emerging space. You can use multimodality where you already have a GUI. This is very interesting for car application (hands-free), PDA, home-entertainment. All of them have a display and speech may be the second thing to come along.

[P.B.] Why there so few multimodal applications active today?

[Dr. T.V. Raman] Some of the hype was earlier, even today voice-only applications are very important. This is an important reason to build MMI based on VoiceXML. This is scalable: You give money on VoiceXML today and at the same time you test multimodal application with X+V.
Leverage the investment on VoiceXML and use the investment to do experiments. [P.B.] Do you think that there is a competition between voice-only and mutlimodal applications?

[Dr. T.V. Raman] They can cooperate and coexist. The number of telephones are not to change, all phones will not go away. The will complement each other.

[P.B.] There differences among the US market and the European or other regions'?

[Dr. T.V. Raman] Europe can be further ahead, because the cellphone network is further developed than in the US. The Europe can escape the PC only. It has more opportunities today to develop new kind of applications.

[P.B.] And about other markets?

[Dr. T.V. Raman] In Asia it is very important because it is difficult to use the keyboard. In India is very relevant because literacy reasons. The "Simputer" (a kind of a kiosk) is use in the villages and it is mainly mutlimodal. It's used to show the prices of products, which can be updated on the fly. For someone that cannot read to see the pictures and understand the prices is very important, therefore to use voice and pictures is very important in that context. All over world, for the emerging markets the multimodal market is much bigger, the people are ready to use those things.

[P.B.] But it maybe very expensive, isn't it?

[Dr. T.V. Raman] The "Simputer" in India costs 150$ for the device, but the village peoples has to buy a 1$ smartcard to use it. Multimodality is much better suited than monomodal computers.

[P.B.] Who did the Simputer?

[Dr. T.V. Raman] An open source did the HW spec and they are ready to do it in other countries too.