SPEECHSC				              S. Maes
Internet Draft					      IBM
Document: draft-maes-speechsc-web-services-00	      A. Sakrajda 
Category: Informational				      IBM
Expires: December, 2002				      June 23, 2002

Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services

    Status of this Memo

This document is an Internet-Draft and is in full conformance with 
all provisions of Section 10 of RFC2026 [1]. 

Internet-Drafts are working documents of the Internet Engineering 
Task Force (IETF), its areas, and its working groups. Note that 
other groups may also distribute working documents as 
Internet-Drafts. Internet-Drafts are draft documents valid for a 
maximum of six months and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use 
Internet-Drafts as reference material or to cite them other than as
"work in progress." 

The list of current Internet-Drafts can be accessed at 
http://www.ietf.org/ietf/1id-abstracts.txt 

The list of Internet-Draft Shadow Directories can be accessed at 
http://www.ietf.org/shadow.html.

Discussion of this and related documents is on the MRCP list.  To 
subscribe, send the message "subscribe mrcp" to 
majordomo@snowshore.com. The public archive is at 
http://flyingfox.snowshore.com/mrcp_archive/maillist.html. 
    
NOTE: This mailing list will be superseded by an official working 
group mailing list, cats@ietf.org, once the WG is formally 
chartered. 

    1. Abstract

This document proposes the use of the web service framework based 
on XML protocols to implement speech engine remote control 
protocols (SERCP).

This document is informational. It illustrates how web services 
could be used. It is not a detailed specification. This is expected
to be the output of the SPEECHSC activity, if it is decided to go 
in this direction. It also enumerates the requirements that have 
led to selecting a web service framework.

Speech engines (speech recognition, speaker, recognition, speech 
synthesis, recorders and playback, NL parsers, and any other speech
processing engines (e.g. speech detection, barge-in detection etc) 
etc...) as well as audio sub-systems (audio input and output 
sub-systems) can be considered as web services that can be 
described and asynchronously programmed via WSDL (on top of SOAP), 
combined in a flow described via WSFL, discovered via UDDI and 
Maes & Sakrajda  Informational - Expires December 2002		    1
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

asynchronously controlled and via SOAP that also enables 
asynchronous exchanges between the engines.

This solution presents the advantage to provide flexibility, 
scalability and extensibility while reusing an existing framework 
that fits the evolution of the web: web services and XML protocols 
[15].

This document proposes using web services as a framework for SERCP.
 
The proposed framework enables speech applications to control 
remote speech engines using the standardized mechanism of web 
services. The control messages may be tuned to the controlled 
speech engines. 

    2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in 
this document are to be interpreted as described in RFC-2119 [2].

    3. Introduction 

This document uses the terminology SERCP (Speech Engine Remote 
Control Protocols) to be consistent with the terminology used in 
other documents exchanged at ETSI, 3GPP and OMA while 
distinguishing from the detailed specification proposed by MRCP. 

SERCP addresses a same set of high level "SPEECHSC" objectives: the
capability to distribute the automatic processing of speech away 
from the audio sub-system and the associated controlling speech 
application. 

The need for SERCP has been identified in different forums.

Originally, the need for SERCP was formulated in the context of the 
multimodal architecture proposal at ETSI Aurora STQ [3] and followed
by explicit SERCP requirements in the context of Distributed Speech 
Recognition (DSR)[4]. This was followed by two concrete proposals 
that suggested to rely on web services [5,6].

Later, IETF has initiated the SPEECHSC BOF activity [7] around the 
MRCP proposals:
	- draft-shanmugham-mrcp-01.txt
	- draft-robinson-mrcp-sip-00.txt
that provided additional justifications and requirements for a SERCP
framework.

A preliminary requirement document [16] and use cases [17] have also
been published.

In general, SERCP will support two classes of usage scenarios where 
speech processing is distributed away from the audio sub-systems and
Maes & Sakrajda  Informational - Expires December 2002		    2
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

the speech engines are controlled:
    - By the source of the audio. A typical scenario is a voice 
    enabled application running on a wireless terminal but using 
    server side speech recognition. In [3] and [13], this is 
    exemplified by fat client MVC multi-modal browser configuration 
    with use of remote engines.
    - By a third party controller (i.e. application). A typical 
    scenario is a server side application (e.g. VoiceXML browser) 
    that relies on speech recognition performed elsewhere in the 
    network. Numerous voice portal or IVR (Interactive Voice 
    Response) systems rely on such concepts of distribution of the 
    speech processing resources.
This is consistent with the framework described in [17]. 

    4. Design Requirements 

At a high level, a distributed speech recognition framework should 
aim at enabling the application developer or service provider to 
seamlessly use remote engine:
    - The location of the engine SHOULD NOT be important: the system
    behaves as if the engine was local to the application runtime.
    - The performances of the speech engines SHOULD NOT be affected 
    by distribution of the engines and the presence of the network
    - The functionality achievable by the speech engines MUST be at
    least equivalent to what can be achieved with local engines.

The rest of section summarizes and expands on the requirements 
identified so far that drive the proposal to rely on web services.

    4.1 General considerations 

There are numerous challenges to the specification of an appropriate
SERCP framework. 

In addition to the MRCP internet drafts, numerous proprietary or 
standardized fixed engine APIs have been proposed (e.g. SRAPI, 
SVAPI, SAPI, JSAPI, etc ...). None have been significantly adopted 
so far!

Besides strong assumptions in terms of the underlying platform, such
APIs typically provide often too constrained functions. Only very 
limited common denominator engine operations are defined. In 
particular, it is often difficult to manipulate results and 
intermediate results (usually exchanged with proprietary formats). 
On the other hand, it would not have been practical to add more 
capabilities to these APIs.

Therefore, we propose that:
    - SERCP SHOULD NOT be designed as a fixed speech engine API, 
but
    - SERCP MUST be designed as a rich, flexible and extensible 
    framework that allows the use of numerous engines with numerous 
    levels of capabilities.
Maes & Sakrajda  Informational - Expires December 2002		    3
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    4.2 Speech engine interoperability: replaceable engines or 
    common protocols?

The considerations made above raise fundamental issues in terms of 
standardization and interoperability. What is the objective of 
SPEECHSC?
    - (target-1): to enable the replacement of a speech engine 
    provided by one speech vendors by an engine provided by another 
    and still be able to immediately run the same speech application
    without any other change
or
    - (target-2): to enable speech applications to control remote 
    speech engines using a standardized mechanism but messages tuned
    to the controlled speech engines?

(target-1) is very difficult to achieve. Today, speech engine are 
adapted to particular tasks. Speech data files (acoustic models, 
engine configurations and settings, front-end features, internal 
algorithms, grammars, etc ...) differ significantly from vendor to 
vendor. Even for a same vendor, the deployment of performing 
conversational applications requires numerous engine settings and 
data file tuning from task to task.

In addition, conversational applications and engines still 
constitute an emerging field, where numerous changes of behavior, 
interfaces and capabilities must be supported to enable rapid 
introduction of new conversational capabilities (e.g. support of 
free flow dialogs, NL parsing etc..). 

Eventually, in usage scenarios [17] where SERCP would be used by a 
terminal to drive remote engines or by a voice portal to perform 
efficient and scalable load balancing, the application / controller 
knows exactly the engine that it needs to control. The value of 
SPEECHSC is to rely on a standardized way to implement this remote
control. 

It may be possible to define a framework where a same application 
can directly drive engines from different vendors. We prefer to 
consider this as particular cases of the (target-2) framework rather
than (target-1) that would introduce unnecessary usage limitations 
on the output of the SPEECHSC activity.

Wireless deployments like 3G will require end-to-end specification 
of such a standard framework. At this stage, it is more valuable to 
start with an extensible framework (target-2) and when appropriate, 
provide a framework that address (target-1).

Therefore, SERCP is designed to focus on (target-2), while providing
mechanisms to achieve (target-1) when it makes sense.

This translates into the following key design requirement for SERCP:
    - SERCP MUST provide a standard framework for an application to 
    remotely control speech engines and audio sub-systems. The 
Maes & Sakrajda  Informational - Expires December 2002		    4
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002
   
    associated SERCP messages MAY be tuned to the particular speech 
    engine.
    - SERCP MUST NOT aim at supporting application interoperability 
    across different speech engines with no changes of the SERCP 
    messages.
    - SERCP SHOULD aim at distinguishing and defining messages that 
    are invariant across engine changes from messages that are 
    engine specific.

As a result, adding support of speech engines from another vendor 
MAY require changes of the SERCP messages and therefore changes of 
the application or dialog manager to support these new messages. In 
the web service framework proposed below, it results into changing 
the WSDL (XML) instructions exchanged with the engines. However it 
does not imply any changes other than adaptation of the XML files 
exchanged with the engines (and possibly new speech engine data 
files).

    4.3 Requirements identified in the context of SPEECHSC

In [16], the following requirements have been proposed:
    - SERCP SHOULD reuse existing protocols.
    - SERCP MUST maintain integrity of existing protocols.
    - SERCP SHOULD avoid duplication of existing protocols.
    - SERCP SHOULD satisfy the TTS requirements as described in 
    draft-burger-mrcp-reqts-00.txt, section 6 (expressed according 
    to the terminology defined in RFC-2119 [2]).
    - SERCP SHOULD satisfy the ASR requirements as described in  
    draft-burger-mrcp-reqts-00.txt, section 7 (expressed according 
    to the terminology defined in RFC-2119 [2]).

[7] provides additional considerations in terms of security, 
dual-mode usage (speech recognition and synthesis provided by same 
system) etc...

Fllowing [17], we assumes from the onset that SERCP will drive 
engines that act on uplink (audio-subsystem to engine)and downlink 
(from engines to audio sub-system) speech.

    4.4 Requirements identified in the context of ETSI Aurora DSR 

In the context of the ETSI Aurora distributed speech recognition 
framework, the following requirements have been considered. These 
have also driven the design of SERCP. 

Note that the DSR framework is not limited to the use of DSR 
optimized codecs but it can be used in general to distribute speech 
recognition functions over packet switched network with any encoding
scheme.

    - SERCP MUST control the different speech engines involved to 
    carry a dialog with the user. As such,
        - SERCP SHOULD NOT distinguish between controlling single 
Maes & Sakrajda  Informational - Expires December 2002		    5
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

      engine or several engines responsible to process speech 
	input and generate speech or audio output.
	- SERCP SHOULD NOT be limited to ASR or TTS engines
    - SERCP SHOULD enable control of the audio sub-systems and 
    additional processors (e.g. control of settings of codecs, 
    acoustic front-end, handling of voice activity detection, 
    barge-in, noise subtraction, etc...). 
        - Audio sub-systems and seech processors MAY be considered 
	as "engines" that may be controlled by the application using
	SERCP messages.
    - SERCP MUST support control speech engines and audio 
    sub-systems by:
   	- An application located on the component where audio-system
	functions are located (e.g. wireless terminal)
        - An application located elsewhere on the network (i.e.	not 
	collocated with speech engines or audio input or output 
	sub-systems).
    - SERCP SHOULD NOT specify call-control and session control 
    (re-direction etc...) and other platform/network specific 
    functions based on dialog, load balancing or resource 
    considerations.
    - However SERCP MUST support the request to expect or establish 
    streaming sessions between target addresses of speech engines 
    and audio-sub-systems. 
        - Session establishment and control MUST rely On existing 
	protocols
    - SERCP MUST NOT address the transport of audio.
    - SERCP MAY address the exchange of result messages between 
    speech engines.
    - SERCP MUST support the combination (serial or parallel) of 
    different engines that will process the incoming audio stream or
    post-process recognition results. For example, it should be 
    possible to specify an ASR system able to provide an N-Best list
    followed by another engine able to complete the recognition via 
    detailed match or to pass raw recognition results to a NL parser
    that will tag them before passing the results to the application
    dialog manager. More details are provided in [17].
    - The framework SHOULD enable engines to advertise their 
    capabilities, their state or the state of their local system. 
    This is especially important when the framework is used for 
    resource management purpose.
    - SERCP SHOULD NOT constrain the format, commands or interface 
    that an engine can or should support.
    - SERCP MUST be vendor neutral:
        - SERCP MUST support any engine technology and capability
	- SERCP MUST provide efficient extensibility mechanisms 
	mechanisms to support any type of engine functionality: 
	existing and future.
	- SERCP MUST support vendor specific commands, results and 
	engine combination through a well specified extensible 
	framework
    - SERCP MUST be asynchronous.
    - SERCP MUST be able to stop, suspend, resume and reset the 
Maes & Sakrajda  Informational - Expires December 2002		    6
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002
 
    engines. 
    - SERCP MUST NOT be subject to racing conditions. This 
    requirement is extremely important. It is often difficult from a
    specification or a deployment point of view to efficiently 
    handle the racing conditions that may occur when hand holding 
    the engine to load appropriate speech data files (e.g. grammars,
    language model, acoustic models etc...) and report / handle 
    error conditions while simultaneous racing with the incoming 
    audio stream.

It should be noted that if the requirements described above are 
satisfied, it would be possible to support the use case identified 
in [17].

    4.5 Additional design considerations 

Eventually, the following requirements have been driven the design:
    - Scalability and robustness of the solution
    - Simplicity of deployment
    - Transmission across firewalls, gateways and wireless networks.
        - This implies that the end-to-end specification of SERCP 
	and the assumed protocols that it may use for transport MUST
	be supported by the target deployment infrastructure. This 
	is especially important for 3G deployments.
    - Need to support the exchange of additional meta-information 
    useful to the application or the speech engines (e.g. speech 
    activity (speech-no-speech), barge-in messages, end of 
    utterance, possible DTMF exchanges, front-end setting and noise 
    compensation parameters, client messages -- settings of 
    audio-sub-system, client events, externally acquired parameters 
    --, annotations (e.g. partial results), application specific 
    messages). 

    5. Speech engines and audio-sub-systems as web

We propose the framework of web services as an efficient, extensible
and scalable way to implement SERCP that satisfy the different 
requirements enumerated in section 4 and supports the use cases 
identified in [17].

According to the proposed framework, speech engines (audio 
sub-systems, engines, speech processors) are defined as web services
that are characterized by an interface that consists of some of the 
following ports:
    - "control in" port(s): It sets the engine context, i.e. all the
    settings required for a speech engine to run. It may include 
    addresses where to get or send the streamed audio or results.
    - "control out" port(s): It produces the non-audio engine output
    (i.e. results and events). It may also involve some session 
    control exchanges.
    - "audio in" port(s): It receives streamed input data. 
    - "audio out" port(s): It produces streamed output data. 

Maes & Sakrajda  Informational - Expires December 2002		    7
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

Audio sub-systems can also be treated as web services that can 
produce streamed data or play incoming streamed data as specified by
the control parameters.

The "control in" or "control out" messages can be out-of-band or 
sent or received interleaved with "audio in or out" data. This can 
be determined in the context (setup) of the web services. 

Speech engines and audio sub-systems are pre-programmed as web 
services and composed into more advanced services. Once programmed 
by the application / controller, audio-sub-systems and engines await
an incoming event (established audio session, etc...) to execute the
speech processing that they have been programmed to do and send the 
results as programmed. 

Speech engines as web services are typically programmed to handle 
completely a particular speech processing task, including handling 
of possible errors. For example, as speech engine is programmed to 
perform recognition of the next incoming utterance with a particular
grammar, to send result to a NL parser and to contact a particular 
error recovery process if particular errors occur.

    5.1 Examples of SERCP web services 

The following list of services and control types is not exhaustive. 
It is provide purely as illustration. These examples assume that all
control messages are sent as "control in" and "control out". As 
explained above, the framework could support such exchanges 
implemented by interleaving with the streamed audio, etc...

The following are examples of SERCP web services:
    - Audio input Sub-system - Uplink Signal processing:
        - control in: silence detection / barge-in configuration, 
	codec context (i.e. setup parameters), asynchronous stop
	- control out: indication of begin and end of speech, 
	barge-in, client events, ... 
	- audio in: bound to platform 
	- audio out: encoded audio to be streamed to remote speech
	 engines
    - Audio output Sub-Systems - Downlink Signal processing:
        - control in: codec / play context, barge-in configuration, 
	play, ...
	- control out: done playing, barge-in events
	- audio in: from speech engines (e.g. TTS) 
	- audio out: to platform
    - Speech recognizer (ASR):
        - control in: recognition context, asynchronous stop
	- control out: recognition result, barge-in events
	- audio in: from input sub-system source,
	- audio out: none
    - Speech synthesizer (TTS) or pre-recorded prompt player:
        - control in: annotated text to synthesize, asynchronous 
	stop
Maes & Sakrajda  Informational - Expires December 2002		    8
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

	- control out: status (what has been synthesized so far)
	- audio in: none
	- audio out: audio streamed to audio output sub-system (or 
	other processor)
    - Speaker recognizer (identifier/verifier):
        - control in: claimed user id (for verification) and context
	- control out: identification/verification result, 
	enrollment data
	- audio in: from audio input sub-system 
	- audio out: none
    - DTMF Transceiver. Note that this example illustrates how web 
    services can also handle DTMF in a consistent manner. 
        - control in: how to process (DTMF grammar), expected output
	format,... 
	- control out: appropriately encoded DTMF key or string 
	(e.g. RFC2833).
	- audio in: bound to platform events (possibly programmed by
	control-in)
	- audio out: None 
    - Natural language parser:
        - control in: combined recognition and DTMF detector results
	- control out: natural language results
	- audio in: none
	- audio out: none

Variations and additional examples of speech engines as web service 
examples can be considered. Pre and post processing can alsobe 
considered as other web services.

    5.2 Advantages of a web service framework for SERCP 

The use of web services enables pre-allocating and pre-programming 
the speech engines. This way, the web services framework 
automatically handles the racing conditions issues that may 
otherwise occur, especially between the streamed audio and setting 
up the engines. This is especially critical when engines are remote 
controlled across wireless networks where control and stream 
transport layer may be treated in significantly different manners.

This approach also allows decoupling handling streamed audio from 
Configuration, control and application level exchanges. This 
simplifies deployment and increase scalability.

By using the same framework as web services, it is possible to rely 
on the numerous tools and services that have been developed to 
support authoring, deployment, debugging and management (load 
balancing, routing etc...) of web services. 

    6. Controlling Speech Engines and Audio Sub-Systems 

With such a web service view, the specification of SERCP can 
directly re-use of protocols like SOAP [8], WSDL [9], WSFL [10] and 
UDDI [11]. 
Maes & Sakrajda  Informational - Expires December 2002		    9
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

Contexts can be queried via WSDL [9] or advertised via UDDI [11]. 

Detailed specifications will be provided when this documents evolves
into an internet draft.


    6.1 WSDL 

Using WSDL [9], it is possible to asynchronously program each speech
engine and audio sub-systems. 

To illustrate the proposal, let us consider the case where speech 
engines are allocated via an external routing / load balancing 
mechanism. A particular engine can be allocated to a particular 
terminal, telephony port and task on an utterance or session basis. 
Upon allocation, the application sets the context via WSDL. This 
includes the addresses of the source or target control and audio 
ports.

As an example, consider a speech recognition engine allocated to a 
particular application and telephony port. WSDL instructions 
program the web service to recognize any incoming audio stream from 
that telephony port address with a particular grammar, what to do in
case of error (what event to throw where), how to notify of barge-in
detection, what to do upon completion of the recognition (where to 
send result and end of recognition events). Similarly the telephony 
port is programmed via WSDL to stream incoming audio to the audio 
port of the allocated ASR web service. When the user speaks, audio 
is streamed by the port to the ASR engine that performs the 
pre-programmed recognition task and sends recognition results to 
the pre-programmed port for example of the application (e.g. 
VoiceXML browser [12]). The VoiceXML browser generates a particular 
prompts and programs its allocated TTS engine to start generating 
audio and stream it to the telephony port. The cycle can continue.

    6.2 WSFL 

WSFL [10] provides a generic framework from combining web services 
through flow composition. We recommend using WSFL to define the flow
of the speech engines as web services and configure the overall 
system. Accordingly, sources, targets of web services and overall 
flow a be specified with WSFL.

The use of web services in general and WSFL particular greatly 
simplify the remote configuration and control of chained engines 
that process the result of the previous engine or engines that 
process a same audio stream.

    6.3 UDDI 

UDDI [11] is a possible way to enable discovery of speech engines. 
Other web services approaches can be considered. Speech engines 
advertise their capability (context) and availability. Applications 
Maes & Sakrajda  Informational - Expires December 2002		   10
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

or resource allocation servers interrogate to UDDI repository to 
discover available engines that can be allocated for the next 
utterance or session.

    6.4 SOAP 

SERCP transports WSDL and WSFL on top of SOAP [8]. 

It is also particularly attractive as events and other messages 
between controllers and web services as well as among speech engine 
/ audio sub-systems web services can also be transported via SOAP. 

Exchanges of results and events (including, stop, resume reset 
etc...) among speech engine and audio sub-system web services and 
between web services and the controller or application, can be done 
via SOAP.

In the future, more advanced coordination mechanisms can be used for
example following frameworks as proposed in [14].

SOAP presents the advantage that:
    - SOAP is a distributed protocol that is independent of the 
    platform or language.
    - SOAP is a lightweight protocol, requiring a minimal amount of
    overhead.
    - SOAP runs over HTTP. This allows access though firewalls.
    - SOAP can run over multiple transport protocols such as HTTP, 
    SMTP, and FTP. This should simplify its transport through 
    wireless networks and gateways
    - SOAP is based on XML which is a highly recognized language 
    used within the Web community.
    - SOAP/XML is gaining increasing popularity in B2B transactions 
    and other non-telephony applications. 
    - SOAP/XML is appealing to the Web and IT development community
    due to the fact that is a current technology that they are 
    familiar with.
    - SOAP can carry XML documents.

    7. Syntax

    7.1 Introduction 

The SERCP syntax and semantics should be extensible to satisfy 
(target-2). 

For these reasons, we propose a XML-based syntax with clear 
extensibility guidelines. The web service framework is inherently 
extensible and enables the introduction of additional parameters and
capabilities.  

The SERCP syntax and semantics is designed to support the widest 
possible interoperability between engines by relying on message 
invariant across engine changes as discussed in section 4.2. This 
Maes & Sakrajda  Informational - Expires December 2002		   11
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

should enable to minimize the need for extensions in as many 
situations as possible.  

Existing speech APIs, [5] and the MRCP syntax have been considered 
as starting points.

Speech engines as web services are considered to come with internal 
contexts that typically consist of the context beyond the scope of 
the invariant-based SERCP syntax and semantics.

In as much as possible, the semantics and syntax rely on the W3C 
Voice Activity specifications [12] to describe the different speech 
data files required by the engines.

The application software requests from a broker a reference to an 
SERCP channel and after obtaining one, all interaction between SERCP
and the user consists of XML requests posted to the SERCP server 
followed by result responses.   The interface to the broker is used 
by the user only once per lifetime of the user process to bind to a 
SERCP channel.  All SERCP channels are created equal, they become 
qualified as of certain type only when the user attaches itself to 
it (i.e. they assume application name).  The connection to the 
broker is used by SERCP channel to acquire speech services as 
needed--this is hidden from the user.

The following sub-sections present a sketch of possible content of 
the Body element of the SOAP messages.  It�s intended for 
illustrative purposes, not as an actual specification. 

    7.2 sercp Namespace 

All messages are defined in the sercp namespace, e.g.:

    <SOAP-ENV:Body>
	<sercp:audiocontrol>  ... the body ... </sercp:audiocontrol>
    </SOAP-ENV:body>

The content of the SOAP message body consists of a set of tags 
defining action to be performed in the control and/or audio 
channels.  The sequencing and grouping by functionality differs for 
different categories:

    - audio in--the tag prompt is used with speak, text or audio 
    elements, each one can be repeated more than once in one message
    but they cannot be mixed in same request, i.e. one request 
    consists only of one type; speak implies use of TTS server and 
    play from socket in the audio sink server, audio without speak 
    implies play of pre-recorded audio and tones from URI resolved 
    in the audio server;
    - audio out--the tag listen activates recording in the audio 
    server and defines set of speech services and context of 
    execution for them; it contains as well prescription for digit 
    collection and treatment
Maes & Sakrajda  Informational - Expires December 2002		   12
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    - audio in and out--the listen and prompt tags can be submitted 
    in one request; it implies single connection to the audio server
    and sequencing of play/record/collect dtmf based on attributes 
    (e.g. barge in)
    - stop--asynchronous stop can be issued at any time and 
    propagates to all components involved; some requests are atomic 
    (e.g. playing dtmf)  and asynchronous stop is just tolerated but
    has no real impact

The following assumptions are made in the following part:
    - audio server (aud-s)--there is a reachable point where audio 
    stream(s) are present and available for processing, constant for
    a duration of the session (i.e. the location can change by means
    external to SERCP); an example would be the first host 
    receiving/sending audio from/to telephony channel
    - speech server (asr-s,tts-s,siv-s)--all represent reachable 
    points capable of establishing a connection to aud-s allocated 
    and aggregated by means external to SERCP
    - the XML request and responses can be passed with mime-like 
    attachements

    7.3 prompt--Audio Out Element 

The prompt element may have the following attributes:
      <prompt sink="<uri-audio port>"
		dictionary="<dictionar(y|ies) uri>"
		retrieve="<mime type>"
		save="<mime type>"
		repeatcount="NN"  
		stopondtmf="0|1"  
		stoponspeech="0|1"  
		bargeindtmf="#">
and may take one of the following formats:
      <prompt>
		<speak>...<audio src="audio-uri"/>....</speak>
      </prompt>
or
      <prompt>
		<audio src="audio-uri"/>
		...
		<audio src="audio-uri">
		<audio tone="tone">
      </prompt>
or
      <prompt>
		<text x-oem="name">...<audio src="audio-uri"/>....
		</text>
     </prompt>

The prompt element defines content and method of obtaining and 
handling audio to be generated in the outbound channel.  The content
following speak MUST contain SSML message (which implies mixt of 
synthesized speech and pre-recorded audio streamed by TTS server).
Maes & Sakrajda  Informational - Expires December 2002		   13
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002
  
The content following text contains text annotated using OEM 
specific notations. The omission of speak or text tag implies that 
it is just pre-recorded audio retrieved and streamed by the audio 
server. 
 
The attribute src following audio element may use "cid:" scheme and 
in this case the audio segment is passed by value as attachment.

Sending DTMF digits or call progress tones is a special case of 
audio generation dealt with by the audio server.  For prompt, the
tone, dtmf and mf attributes are defined with following syntax:
tone="(dialtone|ringback|busy|reorder|f1[-f2])"
where f1[-f2] is used to specify single or dual frequency, and may
be accompanied by additional attributes:

      duration="NNms"  // total duration in ms
      timeon="NNms"  // time on for pulsed tones in ms
      timeoff="NNms"  // time off for pulsed tones--on, off pattern 
		      is fitted into duration
      level="-NNdBm"  // level at which the tone should be played

The digits can be sent using attributes

      dtmf="555*#"   or    mf="01KS"  // K=KP, S=ST

The nicknamed tones have the timeon/timeoff/duration/level 
pre-defined, any of these tones can be redefined by supplying full 
tone spec with frequencies and all attributes. 

Samples:
      <prompt src="prompt1-uri"/><prompt src="prompt2-uri"/>
     <prompt dtmf="5"/>

    7.4 listen--Audio In Element

The listen element defines context and method of handling audio in 
the inbound channel.  It implies activating "recording", i.e. 
reading from the audio stream.  Handling of the digits embedded in 
the input audio is defined through an element digits with a set of 
attributes.  The speech services to be activated and applied to the 
input stream need a context--this is supplied within blocks tagged 
by service type: asr, siv,...(extensible--new services can be 
added).
      <listen ...attributes...>
		<asr>...</asr>
		<siv>...</siv>
		<digits>...</digits>
      </listen>

The listen element uses attributes to provide detailed prescription 
for control of recording and signal processing:
      <listen bargein="none|speech|asr"
 
Maes & Sakrajda  Informational - Expires December 2002		   14
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

		beep="0|1"
		echocancel="0|1"
		endprompt="prompt-uri"
		endsilence="Nms"
		initialsilence="Nms"
		maxspeech="Nms"
		minspeech="Nms"
		retrieve="<mime type>"
		save="<mime type>"
		source="<audio uri>"
		steponbeep="0|1"
		stopondtmf="0|1"/>

The meaning of attributes is as follows:
 
    - bargein--none disables barge in, speech turns on energy based 
    barge in, asr turns on recognition based barge in (i.e. only 
    successful recognition stops prompt), default: none
    - beep--0 disables beep indicating begin of recording, 1 turns 
    it on; if barge in is enabled, the setting is ignored, 
    default: 0
    - echocancel--0 disables echo cancellation, 1 turns it on; it is
    applied to firmware in the telephony driver and control is 
    provided primarily for testing/ data collection; if the 
    underlying hardware is not capable of echo cancellation, the 
    setting is silently ignored; default: 1
    - endprompt--the "pacifier" prompt to be played immediately 
    after end of recording; it will be stopped on receipt of a 
    subsequent request; it may be used to cover the time used e.g. 
    for backend queries, default: "".
    - endsilence--end silence duration, silence of this duration 
    after speech has been seen triggers end of recording, -1 or 0 
    reserved for infinite, default: infinite
    - initialsilence--initial silence duration, expiration of the 
    timeout stops recording and terminates the request with final 
    result "silence", -1 or 0 reserved for infinite, default: 
    infinite
    - maxspeech--guard timer, expiration of this timer stops 
    recording, -1 or 0 reserved for infinite, default: infinite
    - minspeech--minimum amount of speech triggering speech 
    detection, 0 means silence detection is disabled, "start record"
    request submitted to audio server starts streaming, default: 0
    - retrieve-- requests that the recorded audio in specified mime 
    type is returned in the response by value.  If omitted, audio is
    not retrieved.
    - save--requests that the recorded audio be saved on the audio 
    server. The response will contain uri of the saved audio. The 
    value of the attribute is a mime type, including 
    x-ep...subtypes marking endpointed pcm.  If omitted, audio is 
    not saved.
    - source--uri of the audio source, location from which the 
    recipient of the message retrieves audio chunks; it��s mandatory.
    - steponbeep--on/off flag requesting that detection of speech in
Maes & Sakrajda  Informational - Expires December 2002		   15
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    the very first samples of recording  (20-50 ms) triggers end of 
    recording reporting "step- on-beep" result; if barge in is on, 
    the setting is ignored and value of 0 is used; default: 0
    - stopondtmf--requests that the recording stops when a dtmf tone
    is detected; default: 1

    7.4.1 asr--Speech Recognition listen

The asr tag content implies use of ASR service and defines context 
of execution for recognition
      <asr nbest="N" completetimeout="Nms" incompletetimeout="Nms"
  			 x-oem-...="..." >
  		<context name="uri" cache="call|persistent|none">
  			<grammar name="uri" weight="0..100">..
  				<vocabulary name="uri">
					    <entry 
					    pronunciation="..." 
					    soundslike="...">
  						spelling 
  					</entry>
  				</vocabulary>
  			</grammar>
  			<vocabulary name="uri"> ... </vocabulary>
  		</context>
      </asr>
  
The context tag attaches a name to a collection of grammars and/or 
vocabularies.  The attributes are:
   
    - nbest--max number of results to retrieve
    - completetimeout��-amount of silence triggering ASR response 
    when the ASR engine has complete "in-grammar" result
    - incompletetimeout��amount of silence triggering ASR response
    when the ASR engine has partial result
   
Vocabulary is just a flat list of entries with optional 
pronunciations and soundslikes, it can be embedded in a grammar or 
it can be standalone.  Any of the entries (context, grammar, 
vocabulary) can be passed by reference by supplying URI.  The scheme
"cid:" is used to specify attachments.

    7.4.2 digits--Digits In

The digits element uses attributes to provide detailed prescription 
for digits collection:
      <digits length="N" firstdigit="Nms" nextdigit="Nms"
		termdigit="#*"/>
The meaning of attributes is as follows:
 
    - length--number of digits to collect (if omitted, 1 is assumed)
    - firstdigit--how long to wait for first digit (ms); the timer 
    starts on completion of play (if omitted, configuration defined 
    value is assumed)
Maes & Sakrajda  Informational - Expires December 2002		   16
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    - nextdigit--defines interdigit interval (if omitted, 
    configuration defined value is assumed)
    - termdigit--optional string constructed from 
    0123456789*#ABCDabcd; any digit in the string detected 
    terminates digit collection; the default value is ""

   7.4.3 prompt Response Tags

The response structure 

      <SOAP-ENV:Body>
      	<sercp:audioControlResult>
		<result reason="(userstop|sinkstop)">
			done|hangup|error
		</result>
		<playinfo samplestoplay="NN" samplesplayed="NN"
			starttime="seconds.ms" endtime="seconds.ms" 
			underruns="NN" 
			markers="marker1:NN marker2:NN"
		</playinfo>
		</sercp:audioControlResult>
     </SOAP-ENV:Body>

7.4.4 listen Response Tags

The response structure
      <SOAP-ENV:Body>
      <sercp:audioControlResult>
		<result 
		reason="(userstop|steponbeep|timeout|silence)">
			done|dtmf|hangup|error
		</result>
 
		<recordinfo samplesrecorded="NNms" speech="NNms"
		speechstart="NNms" starttime="seconds.ms"
				endtime="seconds.ms" underruns="NN">
		</recordinfo>
		<digitsinfo digitsreceived="" tone=""/>
		<asrinfo status="recognized|rejected|error">
			<errormsg>...</errormsg>
			<phrase rank="1" score="N.N" wordcount="N">
<word 	score="N.N" start="Nms" end="Nms"
src="grammarname" annotation="annotation"
				soundslike="soundslike"
				>spelling</word>
				...
				<word score="N.N" 
				start="Nms" end="Nms"
					src="vocabname"
					src="annotation"
					soundslike="soundslike"
				>text</word>
			</phrase>
Maes & Sakrajda  Informational - Expires December 2002		   17
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

				...
			<phrase rank="2"...>
				...
			</phrase>
			<missingpronunciations 
			grammar="grammarname">
			<entry pronunciation="..." soundslike="...">
					spelling 
				</entry>
				...
			</missingpronunications>
		</asrinfo>
		<sivinfo 
		status="verified|identified|rejected|error">
				...
		</sivinfo>
	</sercp:audioControlResult>
    <SOAP-ENV:Body>
  
    8. Usage examples

The elements prompt and listen can be used separately over two 
distinct channels--in this case synchronization of the channels 
needs to be managed by the user (e.g. stopping play on speech 
detected when barge in is enabled).  It can be assisted by bits of 
control information passed in the audio channel (e.g. audio server 
should be capable of stopping tts server through the audio channel).
The server responses have to be need results received

    8.1 Prompt

The use of prompt element

    8.1.1 Play

Playing pre-recorded audio is possible by delivering message
      <prompt sink="uri">
		<audio src="uri">
      </prompt>
directly to audio server, there is no need to involve TTS.  The 
audio server can determine when to stop (e.g. on digit or speech).
Playing synthesized audio is possible by delivering message
      <prompt sink="uri">
		<speak>
			...
		</speak>
     </prompt>

    8.2 Listen 

The listen tag implies recording.  The speech servers if any receive
uri of the audio server defining scheme and location used to 
retrieve audio.
Maes & Sakrajda  Informational - Expires December 2002		   18
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002


    8.2.1 recording

The listen element can be used to specify just recording of the 
audio.
      <listen source="uri" retrieve="audio/basic"/>
The recording can be returned as attachment to the response.

    8.2.2 asr

The recognition request can contain explicit context
      <listen source="uri">
		<asr> 
			<context name="cmd">
				<vocabulary>
					<entry>stop</entry>
					<entry>start</entry>
					<entry>go back</entry>
					...
				</vocabulary>
			</context>
		</asr>
      </listen>
or just reference
      <listen source="uri">
		<asr> 
			<context name=".//cmd"/>
		</asr>
      </listen>
The "cid:" scheme can be used to pass context by value as 
attachment.

    8.2.3 asr and siv

Two speech servers can be attached to single audio source, e.g. 
recognizer and speaker identifier.
      <listen source="uri">
		<asr> <context name="uri"/></asr>
		<siv>...</siv>
      </listen>
The mechanism of delivering the end-pointed audio to both servers is
up to the audio server.

    8.3 Prompt and Listen 

Presence of both tags (prompt and listen) in a message implies 
dispatch of one turn consisting of play/record and optional end 
play.  The actions are either parallel or sequential--it depends on 
barge in setting--but from the user perspective it��s a single 
request-response sequence.

    8.3.1 Play and Collect

Maes & Sakrajda  Informational - Expires December 2002		   19
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

Asking for choice
      <prompt sink="uri">
		<speak> please press 1 for yes, 0 for no </speak>
      </prompt>
      <listen>
		<digits length="1" firstdigit="4s"/>
      </listen>
or variable length input:
      <prompt sink="uri">
		<speak> please enter pin number followed by a pound
		</speak>
      </prompt>
      <listen>
<digits length="16" firstdigit="4s" nextdigit="2s" termdigits="#" />
      </listen>
Asking for selection 
      <prompt sink="uri">
<speak> please press any digit when you hear your choice; one 
<marker> two <marker> ...
		</speak>
      </prompt>
      <listen>
		<digits length="1"/>
      </listen>
The response contains number of samples played which together with 
marker offsets allows to determine the choice made.

    8.3.2 Play and Recognize

Asking for choice
      <prompt sink="uri">
		<speak> please say yes or no </speak>
      </prompt>
      <listen>
		<asr nbest="1">
			<context name="yesno">
				<grammar name="cid:yesno">
			</context name="yesno">
		</asr>
      </listen>
 
or variable length input:
      <prompt sink="uri">
		<speak> please say pin number
		</speak>
      </prompt>
      <listen>
		<asr nbest="1">
			<context name=".//digits"/>
		</asr>
      </listen>
Asking for selection 
      <prompt sink="uri">
Maes & Sakrajda  Informational - Expires December 2002		   20
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

<speak> please say stop when you hear your choice; one <marker> two
	<marker> ...
		</speak>
      </prompt>
      <listen source="uri" stopondtmf="1">
		<asr> 	 ...</asr>
      </listen>
The dual mode input--speech or dtmf--can be handled by
      <prompt sink="uri">
		<speak> please say or enter from the keypad pin 
		number
		</speak>
      </prompt>
      <listen>
		<asr nbest="1"> <context name=".//digits"/> </asr>
		<digits length="16" firstdigit="4s" nextdigit="2s" 
		termdigits="#" />
      </listen>

A digit pressed before speech is detected stops recording (if the 
asr server is allocated on speech detected it��s never used) and 
prompt, the dtmf timers take over control of the digits collection.
The asr barge in can be used to ignore out of grammar speech likely 
during playback of long text (e.g. synthesis of a long e-mail)
      <prompt sink="uri">
		<speak> ...e-mail text...
		</speak>
      </prompt>
      <listen bargein="asr" recordduration="1">
		<asr nbest="1"> <context name=".//cmds"/> </asr>
      </listen>
 
The speech detected does not stop the prompt, successful recognition
of a command (e.g. "stop") is needed to trigger the stop.  The 
recording stops at the end of prompt.

    9. Security Considerations

SERCP may raise several security issues that are to be considered as
OPEN ISSUES:
    - Engine remote control may come from non-authorized sources 
    that may request un-authorized processing (e.g. extraction of 
    voice prints, modification of the played back text, recording of
    a dialog, corruption of the request, re-routing of recognition 
    results, corrupting recognition), with significant security, 
    privacy or IP/ copyright issues. The SPEECHSC activity SHOULD 
    address these issues. Web services are confronted to the same 
    issues and same approaches (encryption, request authentication, 
    content integrity check and secure architecture etc...) can be 
    used with SERCP.
    - Engine remote control may enable third party to request speech
    data files (e.g. grammar or vocabulary) that are considered as 
    proprietary (e.g. hand crafted complex grammar) or that contain 
Maes & Sakrajda  Informational - Expires December 2002		   21
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    private information (e.g. the list of names of the customer of a
    bank) etc...  The SPEECHSC activity SHOULD address how to 
    maintain control on the distribution of the speech data files 
    needed by web services and therefore not only the authentication
    of SERCP exchanges but also of target speech engine web 
    services.

The exchange of encoded audio streams may raise also important 
security issues. However they are not different from conventional 
voice and VoIP exchanges. This SHOULD be considered as beyond the 
scope of the SPEECH activity.

    10. References

[1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP
    9, RFC 2026, October 1996.   
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement 
    Levels", BCP 14, RFC 2119, March 1997 
[3] Maes, S. H., Muthusamy Y. and Wajda W., Multi-modal Browser 
    Architecture Recommendation, "Clayman  proposal" to the ETSI DSR
    Application and Protocol Working Group, ETSI, January 16, 2000
[4] Maes, S. H. and Reng R., "Requirements and Recommendations for 
    Conversational Distributed Protocols and Conversational Engine 
    Remote Control; Version 0.5", AU/310/01, May 3, 2001.
[5] Maes, S. H., Sakrajda, A., Conversational Engine Remote Control 
    Protocols, Proposal to ETSI DSR STQ Application and Protocol 
    Working Group, June 26, 2001
[6] Coles, A., Use of SIP and SOAP as Basis for a Speech Engine 
    Control Protocol, ETSI STQ Aurora DSR Applications and Protocols
    working group, June 28, 2001
[7] E. Burger and D. Oran, Control of ASR and TTS Servers BOF(cats),
    http://www.ietf.org/ietf/02mar/cats.txt  
[8] Simple Object Access Protocol (SOAP) http://www.w3c.org/2002/ws/
[9] Web Services Description Language (WSDL 1.1), W3C Note 15 March 
    2001, http://www.w3.org/TR/wsdl.
[10] Leymann, F., Web Service Flow Language, WSFL 1.0, May 2001, 
     http://www-4.ibm.com/software/solutions/webservices/pdf/WSFL.pdf
[11] UDDI, http://www.uddi.org/specification.html 
[12] W3C Voice Activity, http://www.w3c.org/Voice/ 
[13] S. Maes, Multi-modal and Multi-device Interaction, Input 
     document to 3GPP T2 and W3C MMI, 
     http://www.w3.org/2002/mmi/2002/MM-Arch-Maes-20010820.pdf
[14] WSXL - Web Service eXperience Language submitted to OASIS WSIA 
     and WSRP - WSXL - Web Service eXperience Language submitted to 
     OASIS WSIA and WSRP 
[15] W3C Web Services, http://www.w3c.org/2002/ws/
[16] Burger, E. and Oran, D., "Requirements for Distributed Control
     of ASR, SV and TTS Resources", draft-burger-speechsc-reqts-00, 
     June 13, 2002.
[17] Maes,S. and Sakrajda, A., "Usage Scenarios for Speech Service 
     Control", draft-maes-speechsc-use-cases-00.text, June 23, 2002


Maes & Sakrajda  Informational - Expires December 2002		   22
Speech Engine Remote Control Protocols by treating Speech Engines 
and Audio Sub-systems as Web Services 		            June 2002

    11. Author's Addresses

St�phane H. Maes 
IBM T.J. Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-2908
Email: smaes@us.ibm.com

Andrzej Sakrajda 
IBM T.J. Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-4362
Email: ansa@us.ibm.com