SPEECHSC S. Maes Internet Draft IBM Document: draft-maes-speechsc-web-services-00 A. Sakrajda Category: Informational IBM Expires: December, 2002 June 23, 2002 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Discussion of this and related documents is on the MRCP list. To subscribe, send the message "subscribe mrcp" to majordomo@snowshore.com. The public archive is at http://flyingfox.snowshore.com/mrcp_archive/maillist.html. NOTE: This mailing list will be superseded by an official working group mailing list, cats@ietf.org, once the WG is formally chartered. 1. Abstract This document proposes the use of the web service framework based on XML protocols to implement speech engine remote control protocols (SERCP). This document is informational. It illustrates how web services could be used. It is not a detailed specification. This is expected to be the output of the SPEECHSC activity, if it is decided to go in this direction. It also enumerates the requirements that have led to selecting a web service framework. Speech engines (speech recognition, speaker, recognition, speech synthesis, recorders and playback, NL parsers, and any other speech processing engines (e.g. speech detection, barge-in detection etc) etc...) as well as audio sub-systems (audio input and output sub-systems) can be considered as web services that can be described and asynchronously programmed via WSDL (on top of SOAP), combined in a flow described via WSFL, discovered via UDDI and Maes & Sakrajda Informational - Expires December 2002 1 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 asynchronously controlled and via SOAP that also enables asynchronous exchanges between the engines. This solution presents the advantage to provide flexibility, scalability and extensibility while reusing an existing framework that fits the evolution of the web: web services and XML protocols [15]. This document proposes using web services as a framework for SERCP. The proposed framework enables speech applications to control remote speech engines using the standardized mechanism of web services. The control messages may be tuned to the controlled speech engines. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. 3. Introduction This document uses the terminology SERCP (Speech Engine Remote Control Protocols) to be consistent with the terminology used in other documents exchanged at ETSI, 3GPP and OMA while distinguishing from the detailed specification proposed by MRCP. SERCP addresses a same set of high level "SPEECHSC" objectives: the capability to distribute the automatic processing of speech away from the audio sub-system and the associated controlling speech application. The need for SERCP has been identified in different forums. Originally, the need for SERCP was formulated in the context of the multimodal architecture proposal at ETSI Aurora STQ [3] and followed by explicit SERCP requirements in the context of Distributed Speech Recognition (DSR)[4]. This was followed by two concrete proposals that suggested to rely on web services [5,6]. Later, IETF has initiated the SPEECHSC BOF activity [7] around the MRCP proposals: - draft-shanmugham-mrcp-01.txt - draft-robinson-mrcp-sip-00.txt that provided additional justifications and requirements for a SERCP framework. A preliminary requirement document [16] and use cases [17] have also been published. In general, SERCP will support two classes of usage scenarios where speech processing is distributed away from the audio sub-systems and Maes & Sakrajda Informational - Expires December 2002 2 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 the speech engines are controlled: - By the source of the audio. A typical scenario is a voice enabled application running on a wireless terminal but using server side speech recognition. In [3] and [13], this is exemplified by fat client MVC multi-modal browser configuration with use of remote engines. - By a third party controller (i.e. application). A typical scenario is a server side application (e.g. VoiceXML browser) that relies on speech recognition performed elsewhere in the network. Numerous voice portal or IVR (Interactive Voice Response) systems rely on such concepts of distribution of the speech processing resources. This is consistent with the framework described in [17]. 4. Design Requirements At a high level, a distributed speech recognition framework should aim at enabling the application developer or service provider to seamlessly use remote engine: - The location of the engine SHOULD NOT be important: the system behaves as if the engine was local to the application runtime. - The performances of the speech engines SHOULD NOT be affected by distribution of the engines and the presence of the network - The functionality achievable by the speech engines MUST be at least equivalent to what can be achieved with local engines. The rest of section summarizes and expands on the requirements identified so far that drive the proposal to rely on web services. 4.1 General considerations There are numerous challenges to the specification of an appropriate SERCP framework. In addition to the MRCP internet drafts, numerous proprietary or standardized fixed engine APIs have been proposed (e.g. SRAPI, SVAPI, SAPI, JSAPI, etc ...). None have been significantly adopted so far! Besides strong assumptions in terms of the underlying platform, such APIs typically provide often too constrained functions. Only very limited common denominator engine operations are defined. In particular, it is often difficult to manipulate results and intermediate results (usually exchanged with proprietary formats). On the other hand, it would not have been practical to add more capabilities to these APIs. Therefore, we propose that: - SERCP SHOULD NOT be designed as a fixed speech engine API, but - SERCP MUST be designed as a rich, flexible and extensible framework that allows the use of numerous engines with numerous levels of capabilities. Maes & Sakrajda Informational - Expires December 2002 3 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 4.2 Speech engine interoperability: replaceable engines or common protocols? The considerations made above raise fundamental issues in terms of standardization and interoperability. What is the objective of SPEECHSC? - (target-1): to enable the replacement of a speech engine provided by one speech vendors by an engine provided by another and still be able to immediately run the same speech application without any other change or - (target-2): to enable speech applications to control remote speech engines using a standardized mechanism but messages tuned to the controlled speech engines? (target-1) is very difficult to achieve. Today, speech engine are adapted to particular tasks. Speech data files (acoustic models, engine configurations and settings, front-end features, internal algorithms, grammars, etc ...) differ significantly from vendor to vendor. Even for a same vendor, the deployment of performing conversational applications requires numerous engine settings and data file tuning from task to task. In addition, conversational applications and engines still constitute an emerging field, where numerous changes of behavior, interfaces and capabilities must be supported to enable rapid introduction of new conversational capabilities (e.g. support of free flow dialogs, NL parsing etc..). Eventually, in usage scenarios [17] where SERCP would be used by a terminal to drive remote engines or by a voice portal to perform efficient and scalable load balancing, the application / controller knows exactly the engine that it needs to control. The value of SPEECHSC is to rely on a standardized way to implement this remote control. It may be possible to define a framework where a same application can directly drive engines from different vendors. We prefer to consider this as particular cases of the (target-2) framework rather than (target-1) that would introduce unnecessary usage limitations on the output of the SPEECHSC activity. Wireless deployments like 3G will require end-to-end specification of such a standard framework. At this stage, it is more valuable to start with an extensible framework (target-2) and when appropriate, provide a framework that address (target-1). Therefore, SERCP is designed to focus on (target-2), while providing mechanisms to achieve (target-1) when it makes sense. This translates into the following key design requirement for SERCP: - SERCP MUST provide a standard framework for an application to remotely control speech engines and audio sub-systems. The Maes & Sakrajda Informational - Expires December 2002 4 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 associated SERCP messages MAY be tuned to the particular speech engine. - SERCP MUST NOT aim at supporting application interoperability across different speech engines with no changes of the SERCP messages. - SERCP SHOULD aim at distinguishing and defining messages that are invariant across engine changes from messages that are engine specific. As a result, adding support of speech engines from another vendor MAY require changes of the SERCP messages and therefore changes of the application or dialog manager to support these new messages. In the web service framework proposed below, it results into changing the WSDL (XML) instructions exchanged with the engines. However it does not imply any changes other than adaptation of the XML files exchanged with the engines (and possibly new speech engine data files). 4.3 Requirements identified in the context of SPEECHSC In [16], the following requirements have been proposed: - SERCP SHOULD reuse existing protocols. - SERCP MUST maintain integrity of existing protocols. - SERCP SHOULD avoid duplication of existing protocols. - SERCP SHOULD satisfy the TTS requirements as described in draft-burger-mrcp-reqts-00.txt, section 6 (expressed according to the terminology defined in RFC-2119 [2]). - SERCP SHOULD satisfy the ASR requirements as described in draft-burger-mrcp-reqts-00.txt, section 7 (expressed according to the terminology defined in RFC-2119 [2]). [7] provides additional considerations in terms of security, dual-mode usage (speech recognition and synthesis provided by same system) etc... Fllowing [17], we assumes from the onset that SERCP will drive engines that act on uplink (audio-subsystem to engine)and downlink (from engines to audio sub-system) speech. 4.4 Requirements identified in the context of ETSI Aurora DSR In the context of the ETSI Aurora distributed speech recognition framework, the following requirements have been considered. These have also driven the design of SERCP. Note that the DSR framework is not limited to the use of DSR optimized codecs but it can be used in general to distribute speech recognition functions over packet switched network with any encoding scheme. - SERCP MUST control the different speech engines involved to carry a dialog with the user. As such, - SERCP SHOULD NOT distinguish between controlling single Maes & Sakrajda Informational - Expires December 2002 5 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 engine or several engines responsible to process speech input and generate speech or audio output. - SERCP SHOULD NOT be limited to ASR or TTS engines - SERCP SHOULD enable control of the audio sub-systems and additional processors (e.g. control of settings of codecs, acoustic front-end, handling of voice activity detection, barge-in, noise subtraction, etc...). - Audio sub-systems and seech processors MAY be considered as "engines" that may be controlled by the application using SERCP messages. - SERCP MUST support control speech engines and audio sub-systems by: - An application located on the component where audio-system functions are located (e.g. wireless terminal) - An application located elsewhere on the network (i.e. not collocated with speech engines or audio input or output sub-systems). - SERCP SHOULD NOT specify call-control and session control (re-direction etc...) and other platform/network specific functions based on dialog, load balancing or resource considerations. - However SERCP MUST support the request to expect or establish streaming sessions between target addresses of speech engines and audio-sub-systems. - Session establishment and control MUST rely On existing protocols - SERCP MUST NOT address the transport of audio. - SERCP MAY address the exchange of result messages between speech engines. - SERCP MUST support the combination (serial or parallel) of different engines that will process the incoming audio stream or post-process recognition results. For example, it should be possible to specify an ASR system able to provide an N-Best list followed by another engine able to complete the recognition via detailed match or to pass raw recognition results to a NL parser that will tag them before passing the results to the application dialog manager. More details are provided in [17]. - The framework SHOULD enable engines to advertise their capabilities, their state or the state of their local system. This is especially important when the framework is used for resource management purpose. - SERCP SHOULD NOT constrain the format, commands or interface that an engine can or should support. - SERCP MUST be vendor neutral: - SERCP MUST support any engine technology and capability - SERCP MUST provide efficient extensibility mechanisms mechanisms to support any type of engine functionality: existing and future. - SERCP MUST support vendor specific commands, results and engine combination through a well specified extensible framework - SERCP MUST be asynchronous. - SERCP MUST be able to stop, suspend, resume and reset the Maes & Sakrajda Informational - Expires December 2002 6 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 engines. - SERCP MUST NOT be subject to racing conditions. This requirement is extremely important. It is often difficult from a specification or a deployment point of view to efficiently handle the racing conditions that may occur when hand holding the engine to load appropriate speech data files (e.g. grammars, language model, acoustic models etc...) and report / handle error conditions while simultaneous racing with the incoming audio stream. It should be noted that if the requirements described above are satisfied, it would be possible to support the use case identified in [17]. 4.5 Additional design considerations Eventually, the following requirements have been driven the design: - Scalability and robustness of the solution - Simplicity of deployment - Transmission across firewalls, gateways and wireless networks. - This implies that the end-to-end specification of SERCP and the assumed protocols that it may use for transport MUST be supported by the target deployment infrastructure. This is especially important for 3G deployments. - Need to support the exchange of additional meta-information useful to the application or the speech engines (e.g. speech activity (speech-no-speech), barge-in messages, end of utterance, possible DTMF exchanges, front-end setting and noise compensation parameters, client messages -- settings of audio-sub-system, client events, externally acquired parameters --, annotations (e.g. partial results), application specific messages). 5. Speech engines and audio-sub-systems as web We propose the framework of web services as an efficient, extensible and scalable way to implement SERCP that satisfy the different requirements enumerated in section 4 and supports the use cases identified in [17]. According to the proposed framework, speech engines (audio sub-systems, engines, speech processors) are defined as web services that are characterized by an interface that consists of some of the following ports: - "control in" port(s): It sets the engine context, i.e. all the settings required for a speech engine to run. It may include addresses where to get or send the streamed audio or results. - "control out" port(s): It produces the non-audio engine output (i.e. results and events). It may also involve some session control exchanges. - "audio in" port(s): It receives streamed input data. - "audio out" port(s): It produces streamed output data. Maes & Sakrajda Informational - Expires December 2002 7 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 Audio sub-systems can also be treated as web services that can produce streamed data or play incoming streamed data as specified by the control parameters. The "control in" or "control out" messages can be out-of-band or sent or received interleaved with "audio in or out" data. This can be determined in the context (setup) of the web services. Speech engines and audio sub-systems are pre-programmed as web services and composed into more advanced services. Once programmed by the application / controller, audio-sub-systems and engines await an incoming event (established audio session, etc...) to execute the speech processing that they have been programmed to do and send the results as programmed. Speech engines as web services are typically programmed to handle completely a particular speech processing task, including handling of possible errors. For example, as speech engine is programmed to perform recognition of the next incoming utterance with a particular grammar, to send result to a NL parser and to contact a particular error recovery process if particular errors occur. 5.1 Examples of SERCP web services The following list of services and control types is not exhaustive. It is provide purely as illustration. These examples assume that all control messages are sent as "control in" and "control out". As explained above, the framework could support such exchanges implemented by interleaving with the streamed audio, etc... The following are examples of SERCP web services: - Audio input Sub-system - Uplink Signal processing: - control in: silence detection / barge-in configuration, codec context (i.e. setup parameters), asynchronous stop - control out: indication of begin and end of speech, barge-in, client events, ... - audio in: bound to platform - audio out: encoded audio to be streamed to remote speech engines - Audio output Sub-Systems - Downlink Signal processing: - control in: codec / play context, barge-in configuration, play, ... - control out: done playing, barge-in events - audio in: from speech engines (e.g. TTS) - audio out: to platform - Speech recognizer (ASR): - control in: recognition context, asynchronous stop - control out: recognition result, barge-in events - audio in: from input sub-system source, - audio out: none - Speech synthesizer (TTS) or pre-recorded prompt player: - control in: annotated text to synthesize, asynchronous stop Maes & Sakrajda Informational - Expires December 2002 8 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 - control out: status (what has been synthesized so far) - audio in: none - audio out: audio streamed to audio output sub-system (or other processor) - Speaker recognizer (identifier/verifier): - control in: claimed user id (for verification) and context - control out: identification/verification result, enrollment data - audio in: from audio input sub-system - audio out: none - DTMF Transceiver. Note that this example illustrates how web services can also handle DTMF in a consistent manner. - control in: how to process (DTMF grammar), expected output format,... - control out: appropriately encoded DTMF key or string (e.g. RFC2833). - audio in: bound to platform events (possibly programmed by control-in) - audio out: None - Natural language parser: - control in: combined recognition and DTMF detector results - control out: natural language results - audio in: none - audio out: none Variations and additional examples of speech engines as web service examples can be considered. Pre and post processing can alsobe considered as other web services. 5.2 Advantages of a web service framework for SERCP The use of web services enables pre-allocating and pre-programming the speech engines. This way, the web services framework automatically handles the racing conditions issues that may otherwise occur, especially between the streamed audio and setting up the engines. This is especially critical when engines are remote controlled across wireless networks where control and stream transport layer may be treated in significantly different manners. This approach also allows decoupling handling streamed audio from Configuration, control and application level exchanges. This simplifies deployment and increase scalability. By using the same framework as web services, it is possible to rely on the numerous tools and services that have been developed to support authoring, deployment, debugging and management (load balancing, routing etc...) of web services. 6. Controlling Speech Engines and Audio Sub-Systems With such a web service view, the specification of SERCP can directly re-use of protocols like SOAP [8], WSDL [9], WSFL [10] and UDDI [11]. Maes & Sakrajda Informational - Expires December 2002 9 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 Contexts can be queried via WSDL [9] or advertised via UDDI [11]. Detailed specifications will be provided when this documents evolves into an internet draft. 6.1 WSDL Using WSDL [9], it is possible to asynchronously program each speech engine and audio sub-systems. To illustrate the proposal, let us consider the case where speech engines are allocated via an external routing / load balancing mechanism. A particular engine can be allocated to a particular terminal, telephony port and task on an utterance or session basis. Upon allocation, the application sets the context via WSDL. This includes the addresses of the source or target control and audio ports. As an example, consider a speech recognition engine allocated to a particular application and telephony port. WSDL instructions program the web service to recognize any incoming audio stream from that telephony port address with a particular grammar, what to do in case of error (what event to throw where), how to notify of barge-in detection, what to do upon completion of the recognition (where to send result and end of recognition events). Similarly the telephony port is programmed via WSDL to stream incoming audio to the audio port of the allocated ASR web service. When the user speaks, audio is streamed by the port to the ASR engine that performs the pre-programmed recognition task and sends recognition results to the pre-programmed port for example of the application (e.g. VoiceXML browser [12]). The VoiceXML browser generates a particular prompts and programs its allocated TTS engine to start generating audio and stream it to the telephony port. The cycle can continue. 6.2 WSFL WSFL [10] provides a generic framework from combining web services through flow composition. We recommend using WSFL to define the flow of the speech engines as web services and configure the overall system. Accordingly, sources, targets of web services and overall flow a be specified with WSFL. The use of web services in general and WSFL particular greatly simplify the remote configuration and control of chained engines that process the result of the previous engine or engines that process a same audio stream. 6.3 UDDI UDDI [11] is a possible way to enable discovery of speech engines. Other web services approaches can be considered. Speech engines advertise their capability (context) and availability. Applications Maes & Sakrajda Informational - Expires December 2002 10 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 or resource allocation servers interrogate to UDDI repository to discover available engines that can be allocated for the next utterance or session. 6.4 SOAP SERCP transports WSDL and WSFL on top of SOAP [8]. It is also particularly attractive as events and other messages between controllers and web services as well as among speech engine / audio sub-systems web services can also be transported via SOAP. Exchanges of results and events (including, stop, resume reset etc...) among speech engine and audio sub-system web services and between web services and the controller or application, can be done via SOAP. In the future, more advanced coordination mechanisms can be used for example following frameworks as proposed in [14]. SOAP presents the advantage that: - SOAP is a distributed protocol that is independent of the platform or language. - SOAP is a lightweight protocol, requiring a minimal amount of overhead. - SOAP runs over HTTP. This allows access though firewalls. - SOAP can run over multiple transport protocols such as HTTP, SMTP, and FTP. This should simplify its transport through wireless networks and gateways - SOAP is based on XML which is a highly recognized language used within the Web community. - SOAP/XML is gaining increasing popularity in B2B transactions and other non-telephony applications. - SOAP/XML is appealing to the Web and IT development community due to the fact that is a current technology that they are familiar with. - SOAP can carry XML documents. 7. Syntax 7.1 Introduction The SERCP syntax and semantics should be extensible to satisfy (target-2). For these reasons, we propose a XML-based syntax with clear extensibility guidelines. The web service framework is inherently extensible and enables the introduction of additional parameters and capabilities. The SERCP syntax and semantics is designed to support the widest possible interoperability between engines by relying on message invariant across engine changes as discussed in section 4.2. This Maes & Sakrajda Informational - Expires December 2002 11 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 should enable to minimize the need for extensions in as many situations as possible. Existing speech APIs, [5] and the MRCP syntax have been considered as starting points. Speech engines as web services are considered to come with internal contexts that typically consist of the context beyond the scope of the invariant-based SERCP syntax and semantics. In as much as possible, the semantics and syntax rely on the W3C Voice Activity specifications [12] to describe the different speech data files required by the engines. The application software requests from a broker a reference to an SERCP channel and after obtaining one, all interaction between SERCP and the user consists of XML requests posted to the SERCP server followed by result responses. The interface to the broker is used by the user only once per lifetime of the user process to bind to a SERCP channel. All SERCP channels are created equal, they become qualified as of certain type only when the user attaches itself to it (i.e. they assume application name). The connection to the broker is used by SERCP channel to acquire speech services as needed--this is hidden from the user. The following sub-sections present a sketch of possible content of the Body element of the SOAP messages. ItÆs intended for illustrative purposes, not as an actual specification. 7.2 sercp Namespace All messages are defined in the sercp namespace, e.g.: ... the body ... The content of the SOAP message body consists of a set of tags defining action to be performed in the control and/or audio channels. The sequencing and grouping by functionality differs for different categories: - audio in--the tag prompt is used with speak, text or audio elements, each one can be repeated more than once in one message but they cannot be mixed in same request, i.e. one request consists only of one type; speak implies use of TTS server and play from socket in the audio sink server, audio without speak implies play of pre-recorded audio and tones from URI resolved in the audio server; - audio out--the tag listen activates recording in the audio server and defines set of speech services and context of execution for them; it contains as well prescription for digit collection and treatment Maes & Sakrajda Informational - Expires December 2002 12 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 - audio in and out--the listen and prompt tags can be submitted in one request; it implies single connection to the audio server and sequencing of play/record/collect dtmf based on attributes (e.g. barge in) - stop--asynchronous stop can be issued at any time and propagates to all components involved; some requests are atomic (e.g. playing dtmf) and asynchronous stop is just tolerated but has no real impact The following assumptions are made in the following part: - audio server (aud-s)--there is a reachable point where audio stream(s) are present and available for processing, constant for a duration of the session (i.e. the location can change by means external to SERCP); an example would be the first host receiving/sending audio from/to telephony channel - speech server (asr-s,tts-s,siv-s)--all represent reachable points capable of establishing a connection to aud-s allocated and aggregated by means external to SERCP - the XML request and responses can be passed with mime-like attachements 7.3 prompt--Audio Out Element The prompt element may have the following attributes: and may take one of the following formats: ... or or ... The prompt element defines content and method of obtaining and handling audio to be generated in the outbound channel. The content following speak MUST contain SSML message (which implies mixt of synthesized speech and pre-recorded audio streamed by TTS server). Maes & Sakrajda Informational - Expires December 2002 13 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 The content following text contains text annotated using OEM specific notations. The omission of speak or text tag implies that it is just pre-recorded audio retrieved and streamed by the audio server. The attribute src following audio element may use "cid:" scheme and in this case the audio segment is passed by value as attachment. Sending DTMF digits or call progress tones is a special case of audio generation dealt with by the audio server. For prompt, the tone, dtmf and mf attributes are defined with following syntax: tone="(dialtone|ringback|busy|reorder|f1[-f2])" where f1[-f2] is used to specify single or dual frequency, and may be accompanied by additional attributes: duration="NNms" // total duration in ms timeon="NNms" // time on for pulsed tones in ms timeoff="NNms" // time off for pulsed tones--on, off pattern is fitted into duration level="-NNdBm" // level at which the tone should be played The digits can be sent using attributes dtmf="555*#" or mf="01KS" // K=KP, S=ST The nicknamed tones have the timeon/timeoff/duration/level pre-defined, any of these tones can be redefined by supplying full tone spec with frequencies and all attributes. Samples: 7.4 listen--Audio In Element The listen element defines context and method of handling audio in the inbound channel. It implies activating "recording", i.e. reading from the audio stream. Handling of the digits embedded in the input audio is defined through an element digits with a set of attributes. The speech services to be activated and applied to the input stream need a context--this is supplied within blocks tagged by service type: asr, siv,...(extensible--new services can be added). ... ... ... The listen element uses attributes to provide detailed prescription for control of recording and signal processing: The meaning of attributes is as follows: - bargein--none disables barge in, speech turns on energy based barge in, asr turns on recognition based barge in (i.e. only successful recognition stops prompt), default: none - beep--0 disables beep indicating begin of recording, 1 turns it on; if barge in is enabled, the setting is ignored, default: 0 - echocancel--0 disables echo cancellation, 1 turns it on; it is applied to firmware in the telephony driver and control is provided primarily for testing/ data collection; if the underlying hardware is not capable of echo cancellation, the setting is silently ignored; default: 1 - endprompt--the "pacifier" prompt to be played immediately after end of recording; it will be stopped on receipt of a subsequent request; it may be used to cover the time used e.g. for backend queries, default: "". - endsilence--end silence duration, silence of this duration after speech has been seen triggers end of recording, -1 or 0 reserved for infinite, default: infinite - initialsilence--initial silence duration, expiration of the timeout stops recording and terminates the request with final result "silence", -1 or 0 reserved for infinite, default: infinite - maxspeech--guard timer, expiration of this timer stops recording, -1 or 0 reserved for infinite, default: infinite - minspeech--minimum amount of speech triggering speech detection, 0 means silence detection is disabled, "start record" request submitted to audio server starts streaming, default: 0 - retrieve-- requests that the recorded audio in specified mime type is returned in the response by value. If omitted, audio is not retrieved. - save--requests that the recorded audio be saved on the audio server. The response will contain uri of the saved audio. The value of the attribute is a mime type, including x-ep...subtypes marking endpointed pcm. If omitted, audio is not saved. - source--uri of the audio source, location from which the recipient of the message retrieves audio chunks; itÃÆs mandatory. - steponbeep--on/off flag requesting that detection of speech in Maes & Sakrajda Informational - Expires December 2002 15 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 the very first samples of recording (20-50 ms) triggers end of recording reporting "step- on-beep" result; if barge in is on, the setting is ignored and value of 0 is used; default: 0 - stopondtmf--requests that the recording stops when a dtmf tone is detected; default: 1 7.4.1 asr--Speech Recognition listen The asr tag content implies use of ASR service and defines context of execution for recognition .. spelling ... The context tag attaches a name to a collection of grammars and/or vocabularies. The attributes are: - nbest--max number of results to retrieve - completetimeoutÃù-amount of silence triggering ASR response when the ASR engine has complete "in-grammar" result - incompletetimeoutÃùamount of silence triggering ASR response when the ASR engine has partial result Vocabulary is just a flat list of entries with optional pronunciations and soundslikes, it can be embedded in a grammar or it can be standalone. Any of the entries (context, grammar, vocabulary) can be passed by reference by supplying URI. The scheme "cid:" is used to specify attachments. 7.4.2 digits--Digits In The digits element uses attributes to provide detailed prescription for digits collection: The meaning of attributes is as follows: - length--number of digits to collect (if omitted, 1 is assumed) - firstdigit--how long to wait for first digit (ms); the timer starts on completion of play (if omitted, configuration defined value is assumed) Maes & Sakrajda Informational - Expires December 2002 16 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 - nextdigit--defines interdigit interval (if omitted, configuration defined value is assumed) - termdigit--optional string constructed from 0123456789*#ABCDabcd; any digit in the string detected terminates digit collection; the default value is "" 7.4.3 prompt Response Tags The response structure done|hangup|error 7.4.4 listen Response Tags The response structure done|dtmf|hangup|error ... spelling ... text Maes & Sakrajda Informational - Expires December 2002 17 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 ... ... spelling ... ... 8. Usage examples The elements prompt and listen can be used separately over two distinct channels--in this case synchronization of the channels needs to be managed by the user (e.g. stopping play on speech detected when barge in is enabled). It can be assisted by bits of control information passed in the audio channel (e.g. audio server should be capable of stopping tts server through the audio channel). The server responses have to be need results received 8.1 Prompt The use of prompt element 8.1.1 Play Playing pre-recorded audio is possible by delivering message directly to audio server, there is no need to involve TTS. The audio server can determine when to stop (e.g. on digit or speech). Playing synthesized audio is possible by delivering message ... 8.2 Listen The listen tag implies recording. The speech servers if any receive uri of the audio server defining scheme and location used to retrieve audio. Maes & Sakrajda Informational - Expires December 2002 18 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 8.2.1 recording The listen element can be used to specify just recording of the audio. The recording can be returned as attachment to the response. 8.2.2 asr The recognition request can contain explicit context stop start go back ... or just reference The "cid:" scheme can be used to pass context by value as attachment. 8.2.3 asr and siv Two speech servers can be attached to single audio source, e.g. recognizer and speaker identifier. ... The mechanism of delivering the end-pointed audio to both servers is up to the audio server. 8.3 Prompt and Listen Presence of both tags (prompt and listen) in a message implies dispatch of one turn consisting of play/record and optional end play. The actions are either parallel or sequential--it depends on barge in setting--but from the user perspective itÃÆs a single request-response sequence. 8.3.1 Play and Collect Maes & Sakrajda Informational - Expires December 2002 19 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 Asking for choice please press 1 for yes, 0 for no or variable length input: please enter pin number followed by a pound Asking for selection please press any digit when you hear your choice; one two ... The response contains number of samples played which together with marker offsets allows to determine the choice made. 8.3.2 Play and Recognize Asking for choice please say yes or no or variable length input: please say pin number Asking for selection Maes & Sakrajda Informational - Expires December 2002 20 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 please say stop when you hear your choice; one two ... ... The dual mode input--speech or dtmf--can be handled by please say or enter from the keypad pin number A digit pressed before speech is detected stops recording (if the asr server is allocated on speech detected itÃÆs never used) and prompt, the dtmf timers take over control of the digits collection. The asr barge in can be used to ignore out of grammar speech likely during playback of long text (e.g. synthesis of a long e-mail) ...e-mail text... The speech detected does not stop the prompt, successful recognition of a command (e.g. "stop") is needed to trigger the stop. The recording stops at the end of prompt. 9. Security Considerations SERCP may raise several security issues that are to be considered as OPEN ISSUES: - Engine remote control may come from non-authorized sources that may request un-authorized processing (e.g. extraction of voice prints, modification of the played back text, recording of a dialog, corruption of the request, re-routing of recognition results, corrupting recognition), with significant security, privacy or IP/ copyright issues. The SPEECHSC activity SHOULD address these issues. Web services are confronted to the same issues and same approaches (encryption, request authentication, content integrity check and secure architecture etc...) can be used with SERCP. - Engine remote control may enable third party to request speech data files (e.g. grammar or vocabulary) that are considered as proprietary (e.g. hand crafted complex grammar) or that contain Maes & Sakrajda Informational - Expires December 2002 21 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 private information (e.g. the list of names of the customer of a bank) etc... The SPEECHSC activity SHOULD address how to maintain control on the distribution of the speech data files needed by web services and therefore not only the authentication of SERCP exchanges but also of target speech engine web services. The exchange of encoded audio streams may raise also important security issues. However they are not different from conventional voice and VoIP exchanges. This SHOULD be considered as beyond the scope of the SPEECH activity. 10. References [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 [3] Maes, S. H., Muthusamy Y. and Wajda W., Multi-modal Browser Architecture Recommendation, "Clayman proposal" to the ETSI DSR Application and Protocol Working Group, ETSI, January 16, 2000 [4] Maes, S. H. and Reng R., "Requirements and Recommendations for Conversational Distributed Protocols and Conversational Engine Remote Control; Version 0.5", AU/310/01, May 3, 2001. [5] Maes, S. H., Sakrajda, A., Conversational Engine Remote Control Protocols, Proposal to ETSI DSR STQ Application and Protocol Working Group, June 26, 2001 [6] Coles, A., Use of SIP and SOAP as Basis for a Speech Engine Control Protocol, ETSI STQ Aurora DSR Applications and Protocols working group, June 28, 2001 [7] E. Burger and D. Oran, Control of ASR and TTS Servers BOF(cats), http://www.ietf.org/ietf/02mar/cats.txt [8] Simple Object Access Protocol (SOAP) http://www.w3c.org/2002/ws/ [9] Web Services Description Language (WSDL 1.1), W3C Note 15 March 2001, http://www.w3.org/TR/wsdl. [10] Leymann, F., Web Service Flow Language, WSFL 1.0, May 2001, http://www-4.ibm.com/software/solutions/webservices/pdf/WSFL.pdf [11] UDDI, http://www.uddi.org/specification.html [12] W3C Voice Activity, http://www.w3c.org/Voice/ [13] S. Maes, Multi-modal and Multi-device Interaction, Input document to 3GPP T2 and W3C MMI, http://www.w3.org/2002/mmi/2002/MM-Arch-Maes-20010820.pdf [14] WSXL - Web Service eXperience Language submitted to OASIS WSIA and WSRP - WSXL - Web Service eXperience Language submitted to OASIS WSIA and WSRP [15] W3C Web Services, http://www.w3c.org/2002/ws/ [16] Burger, E. and Oran, D., "Requirements for Distributed Control of ASR, SV and TTS Resources", draft-burger-speechsc-reqts-00, June 13, 2002. [17] Maes,S. and Sakrajda, A., "Usage Scenarios for Speech Service Control", draft-maes-speechsc-use-cases-00.text, June 23, 2002 Maes & Sakrajda Informational - Expires December 2002 22 Speech Engine Remote Control Protocols by treating Speech Engines and Audio Sub-systems as Web Services June 2002 11. Author's Addresses St‰phane H. Maes IBM T.J. Watson Research Center PO Box 218, Yorktown Heights, NY 10598 Phone: +1-914-945-2908 Email: smaes@us.ibm.com Andrzej Sakrajda IBM T.J. Watson Research Center PO Box 218, Yorktown Heights, NY 10598 Phone: +1-914-945-4362 Email: ansa@us.ibm.com