Cisco IOS VoiceXML Technology
As Internet technologies have advanced, access to Internet content and applications is evolving beyond a Web browser on a PC in your home or office. Mobile phones, traditional wireline phones, Cisco IP Phones, pagers, personal digital assistants (PDAs), and other devices now also provide access to the Internet, enabling people to communicate and to receive information more effectively anywhere and at any time.
This technology advancement is driving productivity in the workplace by improving the quality and consistency of business and personal communication. As more and more Internet-enabled voice applications are developed, the interface between the telephone and Internet content will become more widespread, allowing businesses to support customers more efficiently and to reduce the cost and complexity of managing multiple customer communication channels.
An open technology standard, VoiceXML enables the extension of World Wide Web applications to telephones. Cisco is enabling the advance of new communication services with VoiceXML, promoting the convergence of customer communication channels into standard IP networks and application platforms.
With intellectual resources that span the disciplines of voice and data networking, Cisco has integrated support for VoiceXML into Cisco IOS® Software for Cisco AS5000 Series universal gateways and for Cisco 3600 Series multiservice routers in partnership with Nuance and Speechworks for automated speech recognition (ASR) and text-to-speech (TTS) services. These new voice-browsing capabilities enable service providers and enterprises to expand their portfolio of revenue-generating services without large capital and operating expenses.
This white paper describes Cisco IOS VoiceXML technology, giving to service providers, enterprises, and IT managers practical information about VoiceXML applications in IP networks.
VoiceXML is a standard markup language, similar to HTML, used to create next-generation Web-enabled interactive-voice-response (IVR) applications. New services with interactive voice menus and spoken voice responses require increased application capabilities, and VoiceXML provides flexibility and adaptability in application development. With VoiceXML, existing Internet technologies such as Hypertext Transfer Protocol (HTTP), Web applications, and streaming media servers can be extended for telephony and speech-enabled interaction. This means that existing Web design skills are easily adapted to voice application development. Many developers and systems integrators already have the skills necessary to support the integration of voice, Web applications, and data services using VoiceXML. By adding speech recognition and synthesis to telephony user interfaces (TUIs), enterprises and service providers can improve customer satisfaction, derive new sources of revenue, and reduce capital and operating expenses.
The VoiceXML Forum, founded in 1999 and backed by more than 500 companies, created an open specification with the "aim to drive the market for voice- and phone-enabled Internet access by promoting a standard specification for VoiceXML, a computer language used to create Web content and services that can be accessed by phone." VoiceXML builds on earlier technologies such as Phone Markup Language (developed by AT&T and Lucent), VoXML (Motorola), and SpeechML (IBM). These early technologies developed out of the concept of allowing Web content to be accessible from a telephone.
The VoiceXML Forum submitted the VoiceXML 1.0 specification to the World Wide Web Consortium (W3C). The W3C Voice Browser Working Group now manages new developments and standardization of VoiceXML 2.0 as part of the W3C Speech Interface Framework. Cisco is committed to the standardization and interoperability of VoiceXML platforms and actively participates in the W3C Voice Browser and Multimodal Interaction Working Groups and the IETF SPEECHSC Working Group. Cisco is also a co-founder of the Speech Application Language Tags (SALT) Forum that has contributed to the W3C the SALT specification as input for future revisions of the VoiceXML standard.
A voice application requires interaction among several network components, including voice-over-IP (VoIP) gateways, Web and application servers, streaming media and database servers, as well as the VoiceXML voice browser platforms that deliver IVR presentation services. Cisco IOS Software integrates the VoiceXML voice browser directly into the Cisco voice gateways. Figure 1 illustrates the way information is transferred in a VoiceXML-enabled network.
Web Architecture and VoiceXML Application Components
Table 1 shows the components in a VoiceXML architecture and describes their interactions and interfaces.
Table 1: VoiceXML Components
A VoiceXML document specifies dialogs that are executed by a VoiceXML browser, such as the one in the Cisco AS5000 and the Cisco 3600 Series voice gateways. A set of VoiceXML documents makes up the script for the TUI for a voice application. For example, a voice-mail service may consist of separate VoiceXML documents for sending messages, checking and replying to messages, and menus to manage a unified messaging mailbox.
VoiceXML documents are retrieved from a Web server. VoiceXML browsers request VoiceXML documents from the Web server, which responds by providing static or dynamically generated VoiceXML documents. A Web server runs the application logic and may interface to external databases or application servers. The process for generating VoiceXML documents is the same as for visual Web pages and can use server-side scripting such as PHP Hypertext Preprocessor (PHP), Java Server Page (JSP), Active Server Pages (ASP), Perl, or other site-creation and management tools.
A VoiceXML browser provides the IVR dialog for the caller by interpreting and executing VoiceXML documents. The browser acts as an HTTP client to the Web server. The browser also supports interfaces to speech and telephony resources, including ASR, TTS, audio play and record functions, and collection of dual tone multifrequency (DTMF) digits for telephone touch-tone input. Cisco provides VoiceXML browsers integrated with Cisco AS5000 and Cisco 3600 Series gateways and also as the IP IVR software product for Cisco Call Manager networks.
|Streaming media server||
A server that supports the IETF Real-Time Streaming Protocol (RTSP) and Real-Time Transport Protocol (RTP) for audio playback, recording, or live audio content streaming. ASR and synthesis media server software products are available from Cisco partners Nuance and Speechworks. The Cisco IOS VoiceXML browser interfaces to speech servers using open extensions of the standard RTSP protocol, developed jointly by Cisco, Nuance, and Speechworks.
Traditional speech recognition and IVR platforms have typically required specialized hardware and proprietary interfaces. Complex integration requirements, limited interoperability between vendors or technologies, and the complexity and scale of the traditional telephone system have made voice application development difficult and slow.
Because VoiceXML takes advantage of existing Web infrastructure and Web development resources, service providers and enterprises are able to build and integrate voice applications and enhanced services on top of the existing data infrastructures and realize significant cost savings. As an open W3C standard, VoiceXML provides the benefits of code portability, interoperability, and reuse across platforms and tools. Open standards give service providers options to select among the best solutions.
Based on Extensible Markup Language (XML), VoiceXML is easy to learn and easy to use. Web developers who are already familiar with HTML, Wireless Markup Language (WML), or Extensible Hypertext Markup Language (XHTML) can easily learn VoiceXML without extensive telephony experience or knowledge of telephony hardware. Following a client-server paradigm, VoiceXML browsers allow a separation of application and presentation logic. This separation simplifies application development. Application logic resides in Web servers, databases, and legacy systems, and presentation logic resides in VoiceXML gateways that interpret VoiceXML documents and interface with telephony networks.
Traditional visual Web applications use technologies like PHP, JSP, Perl, ASP, or WebSphere to generate HTML Web pages dynamically by executing database queries and other application logic on a Web server. VoiceXML brings this paradigm to the phone, giving developers an easy, integrated way to deliver a telephone- and voice-enabled version of their applications.
VoiceXML is suitable for developing a variety of voice applications, including:
Unified messaging services allow a user to access e-mail, voice mail, and fax from one interface using multimedia applications. VoiceXML, TTS, and ASR applications can be used to translate among different media, such as reading e-mail messages over a phone. VoiceXML applications are also used to record e-mail messages.
Other telephone services such as Find Me, Follow Me and voice-paging services for meeting reminders and wake-up calls can use VoiceXML applications for voice-activated dialing, accessing directories or address books to route calls, or for delivering services. Personal-assistant applications can also be developed in VoiceXML.
VoiceXML is well-suited for information-retrieval applications. Often referred to as voice-enabled Web services or voice portals, these applications allow users to dial a specific number to access consumer or enterprise information such as stock quotes, account balances, telephone directory information, weather forecast, news, flight information, driving directions, or movie listings. Users typically navigate through menus with touch-tone (DTMF) inputs or by speaking the menu option that is interpreted by ASR engines. Outputs to the caller can be prerecorded audio or TTS generated audio. Portals often provide the initial call treatment before a call is routed to call center customer service representatives, reducing the agent time per call and operating costs.
Voice XML can also be used to provide next-generation, transaction-oriented IVR services that enable e-commerce, banking, airline and travel reservations, catalog browsing, order processing, and package tracking. Again, users dial a specific number and navigate through menus and forms, responding to audio prompts by speaking. They may be transferred to a customer service representative as part of the IVR dialog.
VoiceXML documents are similar to HTML Web pages in their simplicity and in the way they present information. HTML documents provide instructions for the display of text, images, and user interactions with Web browsers. Similarly, VoiceXML documents provide IVR dialogs: recorded or TTS synthesized audio outputs with choices among responses using spoken commands or telephone keypad entry. The dialogs are interpreted by a VoiceXML browser so that information can be accessed over a telephone.
As an XML-based markup language, VoiceXML syntax defines a set of element and attribute tags. A VoiceXML document specifies one or more dialog items that are executed by a VoiceXML browser. The VoiceXML browser processes one dialog item at a time before transitioning to the next one within a VoiceXML document or in another VoiceXML document.
Two kinds of dialog items are used to create the presentation logic of a VoiceXML application, forms, and menus:
- A form is a dialog for collecting information, such as a credit card number. Within a VoiceXML document,
<form>tags group sections of related input or output together.
- A menu offers a user a choice of options and transitions to another dialog based on selecting one of the choices.
Cisco IOS voice gateways (Cisco AS5000 universal access servers and Cisco 3600 Series multiservice routers) provide VoiceXML browsing capabilities to support voice services such as unified messaging and call center IVR applications.
The VoiceXML feature in Cisco IOS Software provides a VoiceXML browser integrated in a Cisco voice gateway.
Table 2 indicates what happens when a voice application is triggered by a call.
Table 2: Sequence When Caller Activates a Voice Application
A caller dials a number and is connected to a Cisco VoiceXML-enabled voice gateway.
The gateway looks up the dialed number in the Digital Number Identification Service (DNIS) map and retrieves the associated URL for the VoiceXML document.
The gateway retrieves the VoiceXML document from a local cache or from the Web server by sending an HTTP GET request. Alternatively, the gateway may retrieve the VoiceXML document from local Flash memory or a file server using Trivial File Transfer Protocol (TFTP).
The gateway's voice browser interprets the VoiceXML document and provides interaction dialogs to the user. Typically these interaction dialogs play prerecorded or synthesized audio prompts and request user input in the form of spoken commands or telephone DTMF touch-tone inputs.
Based on input received from the user, the gateway processes more interaction dialogs, as directed by the VoiceXML document or submits the input collected from the caller to the Web server using HTTP POST or GET requests, with the server responding with another VoiceXML document.
The interaction between the gateway and server is illustrated in Figure 2.
Call Scenario Using VoiceXML
The VoiceXML feature in Cisco IOS Software is cost-effective, secure, and scalable, enabling a high-performance voice-application-hosting environment. Because the VoiceXML interpreter is integrated directly into Cisco voice gateways, service providers can deliver productivity-enhancing services without investing in costly additional telephony servers. In addition, Cisco voice gateways terminate media streams, reducing WAN usage and lowering bandwidth consumption by eliminating an additional call leg from the gateway to a telephony server. This efficiency also results in faster response time, which is critical for all voice applications.
Cisco IOS Software supports the following VoiceXML features (Table 3):
Table 3: VoiceXML Features in Cisco IOS Software
|Document server interface||
|Voice signaling protocols||
|MGCP scripting support||
|Call origination and call control||
VoiceXML is an international standard defined by the W3C that enables the extension of multimedia Web applications to telephones. VoiceXML builds on existing Internet technologies such as HTTP, XML, and Multipurpose Internet Mail Extensions (MIME), reusing the principles of these protocols and languages for voice applications so that existing skills in Web server application development can be easily applied to create speech-enabled telephony applications.
Using VoiceXML is as easy as building a Web page, and the technology is enabling productivity-enhancing service features such as speech recognition and text-to-speech output. With support for VoiceXML integrated into Cisco IOS Software, Cisco is enabling the advance of new communications services and increasing the value of your investment in Cisco voice gateways and networks.
For more information about packet voice technology, Cisco IOS VoiceXML and Cisco voice gateways, visit:
For more detailed information about VoiceXML, see the World Wide Web Consortium's Voice Browser Working Group at:
For information about Nuance and Speechworks, visit: