Internet Engineering Task Force Saravanan Shanmugham
Internet-Draft Cisco Systems Inc.
draft-shanmugham-mrcp-01 Peter Monaco
Expires: May 20, 2002 Nuance Communications
Brian Eberman
Speechworks Inc.
November 20, 2001
MRCP: Media Resource Control Protocol
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (1999). All Rights Reserved.
Abstract
The Media Resource Control Protocol(MRCP), is an application level
protocol to control media service resources like Speech
Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax
Servers etc. over a network. This protocol is designed to work with
streaming protocols like RTSP (Real Time Streaming Protocol) or
SIP(Session Initiation Protocol) which help establish control
connections to external media streaming devices, and media delivery
mechanisms like RTP (Real Time Protocol)
Table of Contents
Status of this Memo.................................................1
Copyright Notice....................................................1
Abstract............................................................1
S. Shanmugham, et. al. Page 1
Media Resource Control Protocol November 2001
Table of Contents...................................................1
1. Introduction:................................................4
2. Architecture:................................................4
2.1. Resources and Services:.....................................5
2.2. Server and Resource Addressing..............................5
3. MRCP Protocol Basics.........................................5
3.1. Establishing Control Session and Media Streams..............6
3.2. MRCP over RTSP..............................................6
3.3. Media Streams and RTP Ports................................10
4. Notational Conventions......................................10
5. MRCP Specification..........................................10
6. MRCP Message................................................11
6.1. Message Types..............................................11
6.2. Request....................................................12
6.3. Response...................................................12
6.3.1. Status Codes.............................................13
6.4. Event......................................................13
6.5. Generic Headers............................................14
6.5.1. Active-Request-Id-List...................................14
6.5.2. Proxy-Sync-Id............................................15
6.5.3. Content-Type.............................................15
6.5.4. Content-Id...............................................15
6.5.5. Content-Base.............................................15
6.5.6. Content-Encoding.........................................16
6.5.7. Content-Location.........................................16
6.5.8. Content-Length...........................................16
6.5.9. Cache-Control............................................17
6.5.10. Logging-Tag............................................18
7. Media Server................................................18
7.1. Media Server Session.......................................18
8. Speech Synthesizer Resource.................................21
8.1. Synthesizer State Machine..................................21
8.2. Synthesizer Methods........................................21
8.3. Synthesizer Events.........................................22
8.4. Synthesizer Header Fields..................................22
8.4.1. Jump-Target..............................................23
8.4.2. Kill-On-Barge-In.........................................23
8.4.3. Speaker Profile..........................................24
8.4.4. Completion Cause.........................................24
8.4.5. Voice-Parameters.........................................24
8.4.6. Prosody-Parameters.......................................25
8.4.7. Vendor Specific Parameters...............................25
8.4.8. Speech Marker............................................25
8.4.9. Speech Language..........................................26
8.4.10. Fetch Hint.............................................26
8.4.11. Audio Fetch Hint.......................................26
8.4.12. Fetch Timeout..........................................26
8.4.13. Failed URI.............................................27
8.4.14. Failed URI Cause.......................................27
8.4.15. Speak Restart..........................................27
8.4.16. Speak Length...........................................27
8.5. Synthesizer Message Body...................................28
8.5.1. Synthesizer Speech Data..................................28
8.6. SET-PARAMS.................................................29
S Shanmugham, et. al. IETF-Draft Page 2
Media Resource Control Protocol November 2001
8.7. GET-PARAMS.................................................30
8.8. SPEAK......................................................31
8.9. STOP.......................................................32
8.10. BARGE-IN-OCCURRED........................................34
8.11. PAUSE....................................................36
8.12. RESUME...................................................37
8.13. CONTROL..................................................38
8.14. SPEAK-COMPLETE...........................................40
8.15. SPEECH-MARKER............................................41
9. Speech Recognizer Resource..................................43
9.1. Recognizer State Machine...................................43
9.2. Recognizer Methods.........................................43
9.3. Recognizer Events..........................................43
9.4. Recognizer Header Fields...................................44
9.4.1. Confidence Threshold.....................................45
9.4.2. Sensitivity Level........................................45
9.4.3. Speed Vs Accuracy........................................45
9.4.4. N Best List Length.......................................45
9.4.5. No Input Timeout.........................................46
9.4.6. Recognition Timeout......................................46
9.4.7. Waveform URL.............................................46
9.4.8. Completion Cause.........................................46
9.4.9. Recognizer Context Block.................................47
9.4.10. Recognition Start Timers...............................48
9.4.11. Vendor Specific Parameters.............................48
9.4.12. Speech Complete Timeout................................48
9.4.13. Speech Incomplete Timeout..............................49
9.4.14. DTMF Interdigit Timeout................................49
9.4.15. DTMF Term Timeout......................................50
9.4.16. DTMF-Term-Char.........................................50
9.4.17. Fetch Timeout..........................................50
9.4.18. Failed URI.............................................50
9.4.19. Failed URI Cause.......................................50
9.4.20. Save Waveform..........................................50
9.4.21. Reset Audio Channel....................................51
9.5. Recognizer Message Body....................................51
9.5.1. Recognizer Grammar Data..................................51
9.5.2. Recognizer Result Data...................................54
9.5.3. Recognizer Context Block.................................54
9.6. SET-PARAMS.................................................55
9.7. GET-PARAMS.................................................56
9.8. DEFINE-GRAMMAR.............................................56
9.9. RECOGNIZE..................................................60
9.10. STOP.....................................................62
9.11. GET-RESULT...............................................64
9.12. START-OF-SPEECH..........................................65
9.13. RECOGNITION-START-TIMERS.................................65
9.14. RECOGNITON-COMPLETE......................................65
9.15. DTMF Detection...........................................67
10. Future Study................................................67
11. RTSP based Examples:........................................67
12. Reference Documents.........................................73
13. Full Copyright Statement....................................73
14. Acknowledgements............................................74
S Shanmugham, et. al. IETF-Draft Page 3
Media Resource Control Protocol November 2001
15. Authors' Addresses..........................................74
1. Introduction:
The Media Resource Control Protocol (MRCP) is designed to provide a
mechanism for a client device requiring audio/video stream
processing to control processing resources on the network. These
media processing resources MAY BE Speech Recognizers, Speech
Synthesizers, FAX, Signal Detectors, etc. MRCP allows for
implementation of distributed Interactive Voice Response platforms,
for example VoiceXML [8] interpreters.
The MRCP protocol defines the requests, responses and events needed
to control the media processing resources. The MRCP protocol defines
the state machine for each resource and the required state
transitions for each request and server-generated event.
The MRCP protocol does not address how the control session is
established with the server and relies on the Real Time Streaming
Protocol (RTSP) [2] to establish and maintain the session. The
session control protocol is also responsible for establishing the
media connection from the client to the network server. The MRCP
protocol and its messaging is designed to be carried over RTSP or
another protocol as a MIME-type similar to the Session Description
Protocol (SDP).
2. Architecture:
The system consists of a client that requires media streams
generated or needs media streams processed and a server that has the
resources or devices to process or generate the streams. The client
establishes a control session with the server for media processing
using a protocol such as RTSP. This will also setup and establish
the RTP stream between the client and the server or another RTP
endpoint. Each resource needed in processing or generating the
stream is addressed or referred to by a URL. The client can now use
MRCP messages to control the media resources and affect how they
process or generate the media stream.
|--------------------|
||------------------|| |----------------------|
|| Application Layer|| ||--------------------||
||------------------|| || TTS | ASR | FAX ||
|| ASR/TTS API || ||Plugin|Plugin|Plugin||
||------------------|| || on | on | on ||
|| MRCP Core || || MRCP | MRCP | MRCP ||
|| Protocol Stack || ||--------------------||
||------------------|| || RTSP Stack ||
|| RTSP Stack || || ||
||------------------|| ||--------------------||
|| TCP/IP Stack ||========IP=========|| TCP/IP Stack ||
S Shanmugham, et. al. IETF-Draft Page 4
Media Resource Control Protocol November 2001
||------------------|| ||--------------------||
|--------------------| |----------------------|
MRCP client Real-time Streaming
MRCP media server
2.1. Resources and Services:
The server is set up to offer a certain set of resources and
services to the client. These resources are of 3 types.
Transmission Resources
These are resources that are capable of generating real-time
streams, like signal generators that generate tones and sounds of
certain frequencies and patterns, Speech Synthesizers that generate
spoken audio streams etc.
Reception Resources
These are resources that receive and process streaming data like
Signal Detectors and Speech Recognizers.
Dual Mode Resources
These are resources that both send and receive data like a fax
resource, capable of sending or receiving fax through a two-way RTP
stream.
2.2. Server and Resource Addressing
The server as a whole is addressed using a container URL, and the
individual resources the server has to offer are reached by
individual resource URLs within the container URL.
RTSP Example:
A media server or container URL like,
rtsp://mediaserver.com/media/
may contain one or more resource URLs of the form,
rtsp://mediaserver.com/media/speechrecognizer/
rtsp://mediaserver.com/media/speechsynthesizer/
rtsp://mediaserver.com/media/fax/
3. MRCP Protocol Basics
The message format for MRCP is text based with mechanisms to carry
embedded binary data. This allows data like recognition grammars,
recognition results, synthesizer speech markup etc to be carried in
S Shanmugham, et. al. IETF-Draft Page 5
Media Resource Control Protocol November 2001
the MRCP message between the client and the server resource. The
protocol does not address session control management, media
management, reliable sequencing and delivery or server or resource
addressing. These are left to a protocol like SIP or RTSP.
MRCP addresses the issue of controlling and communicating with the
resource processing the stream, and defines the requests, responses
and events needed to do that.
3.1. Establishing Control Session and Media Streams
The control session between the client and the server is established
using a protocol like RTSP. This protocol will also setup the
appropriate RTP streams between the server and the client,
allocating ports and setting up transport parameters as needed. Each
control session is identified by a unique session-id. The format,
usage and life cycle of the session-id is in accordance with the
RTSP protocol. The resources within the session are addressed by
the individual resource URLs.
The MRCP protocol is designed to work with and tunnel through
another protocol like RTSP, and augment its capabilities. MRCP
relies on RTSP headers for sequencing, reliability and addressing to
make sure that messages get delivered reliably and in the correct
order and to the right resource. The MRCP messages are carried in
the RTSP message body. The media server delivers the MRCP message
to the appropriate resource or device by looking at the session
level message headers and URL information. Another protocol, such as
SIP [4], could be used for tunneling MRCP messages [7].
3.2. MRCP over RTSP
RTSP supports both TCP and UDP mechanisms for the client to talk to
the server and is differentiated by the RTSP URL. All media servers
providing support for MRCP and its resources MUST support TCP
transport for the RTSP protocol. Support for UDP transport is
OPTIONAL. In RTSP the ANNOUNCE method/response MUST be used to carry
MRCP request/responses between the client and the server. MRCP
events between the client and the server MUST be carried in ANNOUNCE
messages from the server to the client. MRCP messages MUST NOT be
communicated in the RTSP SETUP or TEARDOWN messages. Currently all
RTSP messages are request/responses and there is no support for
asynchronous messages. This is because RTSP was designed to work
over TCP or UDP and hence could not assume reliability in the
underlying protocol. An RTSP extension to send asynchronous events
from the server to the client would provide an alternate vehicle to
carry MRCP events from the server. But this doesn't exist today.
An RTSP session is created when an RTSP SETUP message is sent from
the client to a server and is addressed to a server URL or any one
of its resource URLs without specifying a session-id. The server
will establish a session context and will respond with a session-id
to the client. This sequence will also set up the RTP transport
parameters between the client and the server and the server is ready
to receive or send media streams. If the client wants to attach an
S Shanmugham, et. al. IETF-Draft Page 6
Media Resource Control Protocol November 2001
additional resource to an existing session, the client should send
that session's ID in the subsequent SETUP message.
When a media server implementing MRCP over RTSP, receives a PLAY or
RECORD or PAUSE RTSP method to an MRCP resource URL, it should
respond with an RTSP 405 "Method not Allowed" response. For these
resources, the only allowed RTSP methods are SETUP, TEARDOWN,
DESCRIBE and ANNOUNCE.
C->S: SETUP rtsp://media.server.com/media/synthesizer RTSP/1.0
CSeq: 2
Transport: RTP/AVP;unicast;client_port=8000-8001
Content-Type: application/sdp
Content-Length: 190
v=0
o=- 123 456 IN IP4 10.0.0.1
s=Media Server
p=+1-888-555-1212
c=IN IP4 0.0.0.0
t=0 0
m=audio 8000 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
S->C: RTSP/1.0 200 OK
CSeq: 2
Transport: RTP/AVP;unicast;client_port=8000-8001;
server_port=9000-9001
Session: 12345678
Content-Length: 190
Content-Type: application/sdp
v=0
o=- 3211724219 3211724219 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 9000 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
C->S: SETUP rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001;
mode=record
Session: 12345678
S->C: RTSP/1.0 200 OK
CSeq: 3
S Shanmugham, et. al. IETF-Draft Page 7
Media Resource Control Protocol November 2001
Transport: RTP/AVP;unicast;client_port=8000-8001;
server_port=9000-9001;mode=record
Session: 12345678
Content-Length: 193
Content-Type: application/sdp
v=0
o=- 3211724947 3211724947 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 9000 RTP/AVP 0 101
a=rtpmap:0 pcmu/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
CSeq: 4
Session: 12345678
Content-Type: application/mrcp
Content-Length: 223
SPEAK 543257 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
CSeq: 4
Session: 12345678
RTP-Info: url=rtsp://media.server.com/media/synthesizer;
seq=9810092;rtptime=3450012
Content-Type: application/mrcp
Content-Length: 52
MRCP/1.0 543257 200 IN-PROGRESS
S Shanmugham, et. al. IETF-Draft Page 8
Media Resource Control Protocol November 2001
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 93
Session: 12345678
Content-Type: application/mrcp
Content-Length: 190
RECOGNIZE 543257 MRCP/1.0
Confidence-Threshold: 90
Content-Type: application/grammar+xml
Content-Id: request1@form-level.store
Content-Length: 104
ouiyes
may I speak to
Michel TremblayAndre Roy
S->C: RTSP/1.0 200 OK
CSeq: 93
Content-Type: application/mrcp
Content-Length: 87
MRCP/1.0 543257 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 217
Session: 543257
Content-Type: application/mrcp
Content-Length: 733
RECOGNITION-COMPLETE 543257 COMPLETE MRCP/1.0
Completion-Cause: 000 success
S Shanmugham, et. al. IETF-Draft Page 9
Media Resource Control Protocol November 2001
Waveform-URL: http://web.media.com/session123/audio.wav
Content-Type: application/x-nlsml
Content-Length: 276
Andre Roy
may I speak to Andre Roy
C->S: RTSP/1.0 200 OK
Cseq: 217
3.3. Media Streams and RTP Ports
A single set of RTP/RTCP ports is negotiated and shared between the
MRCP client and server when multiple media processing resources,
such as ASR engines and TTS engines, are used for a single session.
The individual resource instances allocated on the server under a
common session identifier will feed from/to that single RTP stream.
The client can send multiple media streams towards the server
differentiated by using different sync sources or SSRC values.
Similarly the Server can use multiple SSRC values to differentiate
media streams originating from the individual transmission resource
URLs if more than one exists. The individual resources may on the
other hand, work together to send just one stream to the client.
This is up to the implementation of the media server.
4. Notational Conventions
Since many of the definitions and syntax are identical to
HTTP/1.1, this specification only points to the section where they
are defined rather than copying it. For brevity, [HX.Y] is to be
taken to refer to Section X.Y of the current HTTP/1.1 specification
(RFC 2616 [1]).
All the mechanisms specified in this document are described in
both prose and an augmented Backus-Naur form (BNF) similar to that
used in [H2.1]. It is described in detail in RFC 2234 [3], with the
difference that this MRCP specification maintains the "1#" notation
for comma-separated lists.
5. MRCP Specification
S Shanmugham, et. al. IETF-Draft Page 10
Media Resource Control Protocol November 2001
The MRCP PDU is textual using an ISO 10646 character set in the UTF-
8 encoding (RFC 2044) to allow many different languages to be
represented. However, to assist in compact representations, MRCP
also allows other character sets such as ISO 8859-1 to be used when
desired. The MRCP protocol headers and field names use only the US-
ASCII subset of UTF-8. Internationalization only applies to certain
fields like grammar, results, speech markup etc, and not to MRCP as
a whole. Lines are terminated by CRLF, but receivers should be
prepared to also interpret CR and LF by themselves as line
terminators. Also, some parameters in the PDU may contain binary
data or a record spanning multiple lines. Such fields have a length
value associated with the parameter, which indicates the number of
octets immediately following the parameter.
The whole MRCP PDU is encoded in the body of the session level
message as a MIME entity of type application/mrcp. The individual
MRCP messages do not have addressing information as to the resource
the request/response are to/from. Instead the MRCP message relies on
the header of the session level message carrying it to deliver the
request to the appropriate resource, or to figure out who the
response or event is from.
6. MRCP Message
6.1. Message Types
The MRCP message set consists of requests from the client to the
server, responses from the server to the client and events from the
server to the client. All these messages consist of a start-line,
one or more header fields (also known as "headers"), an empty line
(i.e. a line with nothing preceding the CRLF) indicating the end of
the header fields, and an optional message body.
generic-message = start-line
*message-header
CRLF
[ message-body ]
start-line = request-line | status-line | event-line
message-header = *(generic-header | resource-header)
resource-header = recognizer-header
| synthesizer-header
The message-body contains resource-specific and message-specific
data that needs to be carried between the client and server as a
MIME entity. The information contained here and the actual MIME-
types used to carry the data are specified later when addressing the
specific messages.
If a message contains data in the message body, the header fields
will contain content-headers indicating the MIME-type and encoding
of the data in the message body.
S Shanmugham, et. al. IETF-Draft Page 11
Media Resource Control Protocol November 2001
6.2. Request
An MRCP request consists of a Request line followed by zero or more
parameters as part of the message headers and an optional message
body containing data specific to the request message.
The Request message from a client to the server includes within the
first line, the method to be applied, a method tag for that request
and the version of protocol in use.
request-line = method-name SP request-id SP mrcp-version
CRLF
The request-id field is a unique identifier created by the client
and sent to the server. The server resource should use this
identifier in its response to this request. If the request does not
complete with the response future asynchronous events associated
with this request MUST carry the request-id.
request-id = 1*DIGIT
The method-name field identifies the specific request that the
client is making to the server. Each resource supports a certain
list of requests or methods that can be issued to it, and will be
addressed in later sections.
method-name = synthesizer-method
| recognizer-method
The mrcp-version field is the MRCP protocol version that is being
used by the client.
mrcp-version = "MRCP" "/" 1*DIGIT "." 1*DIGIT
6.3. Response
After receiving and interpreting the request message, the server
resource responds with an MRCP response message. It consists of a
status line optionally followed by a message body.
response-line = mrcp-version SP request-id SP status-code SP
request-state CRLF
The mrcp-version field used here is similar to the one used in the
Request Line and indicates the version of MRCP protocol running on
the server.
The request-id used in the response MUST match the one sent in the
corresponding request message.
The status-code field is a 3-digit code representing the success or
failure or other status of the request.
S Shanmugham, et. al. IETF-Draft Page 12
Media Resource Control Protocol November 2001
The request-state field indicates if the job initiated by the
Request is PENDING, IN-PROGRESS or COMPLETE. The COMPLETE status
means that the Request was processed to completion and that there
are will be no more events from that resource to the client with
that request-id. The PENDING status means that the job has been
placed on a queue and will be processed in first-in-first-out order.
The IN-PROGRESS status means that the request is being processed and
is not yet complete. A PENDING or IN-PROGRESS status indicates that
further Event messages will be delivered with that request-id.
request-state = "COMPLETE"
| "IN-PROGRESS"
| "PENDING"
6.3.1. Status Codes
The status codes are classified under the Success(2XX) codes and the
Failure(4XX) codes.
6.3.1.1. Success 2xx
200 Success
201 Success with some optional parameters ignored.
6.3.1.2. Failure 4xx
401 Method not allowed
402 Method not valid in this state
403 Unsupported Parameter
404 Illegal Value for Parameter
405 Not found (e.g. Resource URI not initialized
or doesn't exist)
406 Mandatory Parameter Missing
407 Method or Operation Failed(e.g. Grammar compilation
failed in the recognizer. Detailed cause codes MAY BE
available through a resource specific header field.)
408 Unrecognized or unsupported message entity
421-499 Resource specific Failure codes
6.4. Event
The server resource may need to communicate a change in state or the
occurrence of a certain event to the client. These messages are used
when a request does not complete immediately and the response
returns a status of PENDING or IN-PROGRESS. The intermediate results
and events of the request are indicated to the client through the
event message from the server. Events have the request-id of the
request that is in progress and generating these events and status
value. The status value is COMPLETE if the request is done and this
was the last event, else it is IN-PROGRESS.
event-line = event-name SP request-id SP request-state SP
mrcp-version CRLF
S Shanmugham, et. al. IETF-Draft Page 13
Media Resource Control Protocol November 2001
The mrcp-version used here is identical to the one used in the
Request/Response Line and indicates the version of MRCP protocol
running on the server.
The request-id used in the event should match the one sent in the
request that caused this event.
The request-status indicates if the Request/Command causing this
event is complete or still in progress, and is the same as the one
mentioned in section 3.3.1. The final event will contain a COMPLETE
status indicating the completion of the request.
The event-name identifies the nature of the event generated by the
media resource. The set of valid event names are dependent on the
resource generating it, and will be addressed in later sections.
event-name = synthesizer-event
| recognizer-event
6.5. Generic Headers
generic-header = active-request-id-list
| proxy-sync-id
| speak-restart
| content-id
| content-type
| content-length
| content-base
| content-location
| content-encoding
| cache-control
| logging-tag
All headers in MRCP will be case insensitive consistent with HTTP
and RTSP protocol header definitions.
6.5.1. Active-Request-Id-List
In a request, this field indicates the list of request-ids that it
should apply to. This is useful when there are multiple Requests
that are PENDING or IN-PROGRESS and you want this request to apply
to one or more of these specifically.
In a response, this field returns the list of request-ids that the
operation modified or were in progress or just completed. There
could be one or more requests that returned a request-state of
PENDING or IN-PROGRESS. When a method affecting one or more PENDING
or IN-PROGRESS requests is sent from the client to the server, the
response MUST contain the list of request-ids that were affected in
this header field.
The active-request-id-list is only used in requests and responses,
not in events.
S Shanmugham, et. al. IETF-Draft Page 14
Media Resource Control Protocol November 2001
For example, if a STOP request with no active-request-id-list is sent
to a synthesizer resource(a wildcard STOP) which has one or more SPEAK
requests in the PENDING or IN-PROGRESS state, all SPEAK requests MUST
be cancelled, including the one IN-PROGRESS and the response to the
STOP request would contain the request-id of all the SPEAK requests
that were terminated in the active-request-id-list. In this case, no
SPEAK-COMPLETE or RECOGNITION-COMPLETE events will be sent for these
terminated requests.
active-request-id-list = "Active-Request-Id-List" ":"
request-id *("," request-id) CRLF
6.5.2. Proxy-Sync-Id
When any server resource generates a barge-in-able event, it will
generate a unique Tag and send it as a header field in an event to
the client. The client then acts as a proxy to the server resource
and sends a BARGE-IN-OCCURRED method to the Synthesizer server
resource with the Proxy-Sync-Id it received from the server
resource. When the recognizer and synthesizer resources are part of
the same session, they may choose to work together to achieve
quicker interaction and response. Here the proxy-sync-id helps the
resource receiving the event, proxied by the client, to decide if
this event has been processed through a direct interaction of the
resources.
proxy-sync-id = "Proxy-Sync-Id" ":" 1*ALPHA CRLF
6.5.3. Content-Type
See [H14.17]. Note that the content types suitable for MRCP are
restricted to speech markup, grammar, recognition results etc. and
are specified later in this document.
6.5.4. Content-Id
This field contains an ID or name for the content, by which it can
be referred to. The definition of this field is available in RFC
2111 and is needed in multi-part messages. In MRCP whenever the
content needs to be stored, by either the client or the server, it
is stored associated with this ID. Such content can be referenced
during the session in URI form using the session: URI scheme
described in a later section.
6.5.5. Content-Base
The content-base entity-header field may be used to specify the base
URI for resolving relative URLs within the entity.
content-base = "Content-Base" ":" absoluteURI
Note, however, that the base URI of the contents within the entity-
body may be redefined within that entity-body. An example of this
S Shanmugham, et. al. IETF-Draft Page 15
Media Resource Control Protocol November 2001
would be a multi-part MIME entity, which in turn can have multiple
entities within it.
6.5.6. Content-Encoding
The content-encoding entity-header field is used as a modifier to
the media-type. When present, its value indicates what additional
content coding have been applied to the entity-body, and thus what
decoding mechanisms must be applied in order to obtain the media-
type referenced by the content-type header field. Content-encoding
is primarily used to allow a document to be compressed without
losing the identity of its underlying media type.
content-encoding = "Content-Encoding" ":" 1#content-coding
Content coding is defined in section 3.5. An example of its use is
Content-Encoding: gzip
If multiple encoding have been applied to an entity, the content
coding MUST be listed in the order in which they were applied.
6.5.7. Content-Location
The content-location entity-header field MAY BE used to supply the
resource location for the entity enclosed in the message when that
entity is accessible from a location separate from the requested
resource's URI.
content-location = "Content-Location" ":"
( absoluteURI | relativeURI )
The content-location value is a statement of the location of the
resource corresponding to this particular entity at the time of the
request. The media server MAY use this header field to optimize
certain operations. When providing this header field the entity
being sent should not have been modified, from what was retrieved
from the content-location URI.
For example, if the client provided a grammar markup inline, and it
had previously retrieved it from a certain URI, that URI can be
provided as part of the entity, using the content-location header
field. This allows a resource like the recognizer to look into its
cache to see if this grammar was previously retrieved, compiled and
cached. In which case, it might optimize by using the previously
compiled grammar object.
If the content-location is a relative URI, the relative URI is
interpreted relative to the content-base URI.
6.5.8. Content-Length
S Shanmugham, et. al. IETF-Draft Page 16
Media Resource Control Protocol November 2001
This field contains the length of the content of the message body
(i.e. after the double CRLF following the last header field).
Unlike HTTP, it MUST be included in all messages that carry content
beyond the header portion of the message. If it is missing, a
default value of zero is assumed. It is interpreted according to
[H14.13].
6.5.9. Cache-Control
If the media server plans on implementing caching it MUST adhere to
the cache correctness rules of HTTP 1.1 (RFC2616). In particular,
the expires and cache-control headers must be honored. The cache-
control directives are used to define the default caching algorithms
on the media server for the session or request. The scope of the
directive is based on the method it is sent on. If the directives
are sent on a SET-PARAMS method, it SHOULD apply for all requests
for documents the media server may make in that session. If the
directives are sent on any other messages they MUST only apply to
document requests the media server needs to make for that method. An
empty cache-control header on the GET-PARAMS method is a request for
the media server to return the current cache-control directives
setting on the server.
cache-control = "Cache-Control" ":" 1#cache-directive
cache-directive = "max-age" "=" delta-seconds
| "max-stale" "=" delta-seconds
| "min-fresh" "=" delta-seconds
delta-seconds = 1*DIGIT
Here delta-seconds is a time value to be specified as an integer
number of seconds, represented in decimal, after the time that the
message response or data was received by the media server.
These directives allow the media server to override the basic
expiration mechanism.
max-age
Indicates that the client is ok with the media server using a
response whose age is no greater than the specified time in seconds.
Unless a max-stale directive is also included, the client is not
willing to accept the media server using a stale response.
min-fresh
Indicates that the client is willing to accept the media
server using a response whose freshness lifetime is no less than its
current age plus the specified time in seconds. That is, the client
wants the media server to use a response that will still be fresh
for at least the specified number of seconds.
max-stale
Indicates that the client is willing to accept the media
server using a response that has exceeded its expiration time. If
max-stale is assigned a value, then the client is willing to accept
S Shanmugham, et. al. IETF-Draft Page 17
Media Resource Control Protocol November 2001
the media server using a response that has exceeded its expiration
time by no more than the specified number of seconds. If no value is
assigned to max-stale, then the client is willing to accept the
media server using a stale response of any age.
The media server cache MAY BE requested to use stale response/data
without validation, but only if this does not conflict with any
"MUST"-level requirements concerning cache validation (e.g., a
"must-revalidate" cache-control directive) in the HTTP 1.1
specification pertaining the URI.
If both the MRCP cache-control directive and the cached entry on the
media server include "max-age" directives, then the lesser of the
two values is used for determining the freshness of the cached entry
for that request.
6.5.10. Logging-Tag
This header field MAY BE sent as part of a SET-PARAMS/GET-PARAMS
method to set the logging tag for logs generated by the media
server. Once set, the value persists until a new value is set or the
session is ended. The MRCP server should provide a mechanism to
subset its output logs so that system administrators can examine or
extract only the log file portion during which the logging tag was
set to a certain value.
MRCP clients using this feature should take care to ensure that no
two clients specify the same logging tag. In the event that two
clients specify the same logging tag, the effect on the MRCP
server's output logs in undefined.
logging-tag = "Logging-Tag" ":" 1*ALPHA CRLF
7. Media Server
The capability of media server resources can be found using the RTSP
DESCRIBE mechanism. When a client issues an RTSP DESCRIBE method for
a media resource URI, the media server response MUST contain an SDP
description in its body describing the capabilities of the media
server resource. The SDP description MUST contain at a minimum the
media header(m-line) describing the codec and other media related
features it supports. It MAY contain other SDP header as well, but
support for it is optional.
The usage of SDP messages in the RTSP message body and its
application follows the SIP RFC 2543 but is limited to media related
negotiation and description.
7.1. Media Server Session
As discussed in Section 3.2, a client/server should share one RTSP
session-id for the different resources it may use under the same
S Shanmugham, et. al. IETF-Draft Page 18
Media Resource Control Protocol November 2001
session. The client MUST allocate a set of client RTP/RTCP ports for
a new session and MUST NOT send a Session-ID in the SETUP message
for the first resource. The server then creates a Session-ID and
allocates a set of server RTP/RTCP ports and responds to the SETUP
message.
If the client wants to open more resources with the same server
under the same session, it will send the session-id it got in the
earlier SETUP response, in the SETUP for the new resource. A setup
with an existing session-id tells the server that this new resource
will feed from/into the same RTP/RTCP stream of that existing
session.
If the client wants to open a resource from a media server different
from where the first resource came from, it will send separate SETUP
requests with no session-id header field in them. Each server will
allocate its own session-id and return it in the response. Each of
them will also come back with their own set of RTP/RTCP ports. This
would be the case when the Synthesizer engine and the recognition
engine are on different servers.
The RTSP SETUP method SHOULD contain an SDP description of the media
stream being setup. The RTSP SETUP response MUST contain an SDP
description of the media stream that it expects to receive and send
on that session.
The SDP description in the SETUP method from the client SHOULD
describe the required media parameters like codec, NSE payload types
etc. This could have multiple media headers(i.e m lines) to allow
the client to provide the media server with more than one option to
choose from.
The SDP description in the SETUP response should reflect the media
parameters that the media server will be using for the stream. It
should be within the choices that were specified in the SDP of the
SETUP method if one was provided.
Example:
C->S:
SETUP rtsp://media.server.com/recognizer/ RTSP/1.0
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46456-46457
Content-Type: application/sdp
Content-Length: 190
v=0
o=- 123 456 IN IP4 10.0.0.1
s=Media Server
p=+1-888-555-1212
c=IN IP4 0.0.0.0
t=0 0
m=audio 46456 RTP/AVP 0 96
S Shanmugham, et. al. IETF-Draft Page 19
Media Resource Control Protocol November 2001
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
S->C:
RTSP/1.0 200 OK
CSeq: 1
Session: 0a030258_00003815_3bc4873a_0001_0000
Transport: RTP/AVP;unicast;client_port=46456-46457;
server_port=46460-46461
Content-Length: 190
Content-Type: application/sdp
v=0
o=- 3211724219 3211724219 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 46460 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
If an SDP description was not provided in the RTSP SETUP method,
then the media server may decide on parameters of the stream but
MUST specify what it chooses in the SETUP response. An SDP
announcement is only returned in response to a SETUP which does not
specify a Session, i.e. it will not return an SDP announcement for
the synthesizer SETUP of a session already established with a
recognizer.
C->S:
SETUP rtsp://media.server.com/recognizer/ RTSP/1.0
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46498
S->C:
RTSP/1.0 200 OK
CSeq: 1
Session: 0a030258_000039dc_3bc48a13_0001_0000
Transport: RTP/AVP;unicast; client_port=46498;
server_port=46502-46503
Content-Length: 193
Content-Type: application/sdp
v=0
o=- 3211724947 3211724947 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 46502 RTP/AVP 0 101
S Shanmugham, et. al. IETF-Draft Page 20
Media Resource Control Protocol November 2001
a=rtpmap:0 pcmu/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
8. Speech Synthesizer Resource
This resource is capable of converting text provided by the client
and generating a speech stream in real-time. Depending on the
implementation and capability of this resource, the client can
control parameters like voice characteristics, speaker speed, etc.
The synthesizer resource is controlled by MRCP requests from the
client. Similarly the resource can respond to these requests or
generate asynchronous events to the server to indicate certain
conditions during the processing of the stream.
8.1. Synthesizer State Machine
The synthesizer maintains states as it needs to correlate MRCP
requests from the client. The state transitions shown below describe
the states of the synthesizer and reflect the request at the head of
the queue. A SPEAK request in the PENDING state can be deleted or
stopped by a STOP request and does not affect the state of the
resource.
Idle Speaking Paused
State State State
| | |
|----------SPEAK------->| |--------|
|<------STOP------------| CONTROL |
|<----SPEAK-COMPLETE----| |------->|
|<----BARGE-IN-OCCURRED-| |
| |--------| |
| CONTROL |-----------PAUSE--------->|
| |------->|<----------RESUME---------|
| | |----------|
| | PAUSE |
| | |--------->|
| |--------|----------| |
| BARGE-IN-OCCURED | SPEECH-MARKER |
| |------->|<---------| |
|----------| | |------------|
| STOP | SPEAK |
| | | |----------->|
|<---------| |
|<-------------------STOP--------------------------|
8.2. Synthesizer Methods
The synthesizer supports the following methods.
synthesizer-method = "SET-PARAMS"
S Shanmugham, et. al. IETF-Draft Page 21
Media Resource Control Protocol November 2001
| "GET-PARAMS"
| "SPEAK"
| "STOP"
| "PAUSE"
| "RESUME"
| "BARGE-IN-OCCURRED"
| "CONTROL"
8.3. Synthesizer Events
The synthesizer may generate the following events.
synthesizer-event = "SPEECH-MARKER"
| "SPEAK-COMPLETE"
8.4. Synthesizer Header Fields
A synthesizer message may contain header fields containing request
options and information to augment the Request, Response or Event
the message it is associated with.
synthesizer-header = jump-target ; Section 8.4.1
| kill-on-barge-in ; Section 8.4.2
| speaker-profile ; Section 8.4.3
| completion-cause ; Section 8.4.4
| voice-parameter ; Section 8.4.5
| prosody-parameter ; Section 8.4.6
| vendor-specific ; Section 8.4.7
| speech-marker ; Section 8.4.8
| speech-language ; Section 8.4.9
| fetch-hint ; Section 8.4.10
| audio-fetch-hint ; Section 8.4.11
| fetch-timeout ; Section 8.4.12
| failed-uri ; Section 8.4.13
| failed-uri-cause ; Section 8.4.14
| speak-restart ; Section 8.4.15
| speak-length ; Section 8.4.16
Parameter Support Methods/Events/Response
jump-target MANDATORY SPEAK, CONTROL
logging-tag MANDATORY SET-PARAMS, GET-PARAMS
kill-on-barge-in MANDATORY SPEAK
speaker-profile OPTIONAL SET-PARAMS, GET-PARAMS,
SPEAK, CONTROL
completion-cause MANDATORY SPEAK-COMPLETE
voice-parameter MANDATORY SET-PARAMS, GET-PARAMS,
SPEAK, CONTROL
prosody-parameter MANDATORY SET-PARAMS, GET-PARAMS,
SPEAK, CONTROL
vendor-specific MANDATORY SET-PARAMS, GET-PARAMS
speech-marker MANDATORY SPEECH-MARKER
speech-language MANDATORY SET-PARAMS, GET-PARAMS, SPEAK
fetch-hint MANDATORY SET-PARAMS, GET-PARAMS, SPEAK
S Shanmugham, et. al. IETF-Draft Page 22
Media Resource Control Protocol November 2001
audio-fetch-hint MANDATORY SET-PARAMS, GET-PARAMS, SPEAK
fetch-timeout MANDATORY SET-PARAMS, GET-PARAMS, SPEAK
failed-uri MANDATORY Any
failed-uri-cause MANDATORY Any
speak-restart MANDATORY CONTROL
speak-length MANDATORY SPEAK, CONTROL
8.4.1. Jump-Target
This parameter MAY BE specified in a CONTROL method and controls the
jump size to move forward or rewind backward on an active SPEAK
request. A + or - indicates a relative value to what is being
currently played. This MAY BE specified in a SPEAK request to
indicate an offset into the speech markup that the SPEAK request
should start speaking from. The different speech length units
supported are dependent on the synthesizer implementation. If it
does not support a unit or the operation the resource SHOULD respond
with a status code of 404 "Illegal or Unsupported value for
parameter".
jump-target = "Jump-Size" ":" speech-length-value CRLF
speech-length-value = numeric-speech-length
| text-speech-length
text-speech-length = 1*ALPHA SP "Tag"
numeric-speech-length= ("+" | "-") 1*DIGIT SP
numeric-speech-unit
numeric-speech-unit = "Second"
| "Word"
| "Sentence"
| "Paragraph"
8.4.2. Kill-On-Barge-In
This parameter MAY BE sent as part of the SPEAK method to enable
kill-on-barge-in support. If enabled, the SPEAK method is
interrupted by DTMF input detected by a Signal Detector resource or
by the start of speech sensed or recognized by the Speech Recognizer
resource.
kill-on-barge-in = "Kill-On-Barge-In" ":" boolean-value CRLF
boolean-value = "true" | "false"
If the recognizer or signal detector resource is on the same server
as the synthesizer, the server should be intelligent enough to
recognize their interactions by their common RTSP session-id and
work with each other to provide kill-on-barge-in support.
The client needs to send a BARGE-IN-OCCURRED method to the
synthesizer resource when it receives a bargin-in-able event from
the synthesizer resource or signal detector resource. These
resources MAY BE local or distributed. If this field is not
specified, the value defaults to "true".
S Shanmugham, et. al. IETF-Draft Page 23
Media Resource Control Protocol November 2001
8.4.3. Speaker Profile
This parameter MAY BE part of the SET-PARAMS/GET-PARAMS or SPEAK
request from the client to the server and specifies the profile of
the speaker by a uri, which may be a set of voice parameters like
gender, accent etc.
speaker-profile = "Speaker-Profile" ":" uri CRLF
8.4.4. Completion Cause
This header field MUST be specified in a SPEAK-COMPLETE event coming
from the synthesizer resource to the client. This indicates the
reason behind the SPEAK request completion.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
Cause-Code Cause-Name Description
000 normal SPEAK completed normally.
001 barge-in SPEAK request was terminated because
of barge-in.
002 parse-failure SPEAK request terminated because of a
failure to parse the speech markup text.
003 uri-failure SPEAK request terminated because, access
to one of the URIs failed.
004 error SPEAK request terminated prematurely due
to synthesizer error.
8.4.5. Voice-Parameters
This set of parameters defines the voice of the speaker.
voice-parameter = "Voice-" voice-param-name ":"
voice-param-value CRLF
voice-param-name is any one of the attribute names under the voice
element specified in W3C's Speech Synthesis Markup Language
Specification, W3C Working Draft, 3 January 2001. The voice-param-
value is any one of the value choices of the corresponding voice
element attribute specified in the above section.
These header fields MAY BE sent in SET-PARAMS/GET-PARAMS request to
define/get default values for the entire session or MAY BE sent in
the SPEAK request to define default values for that speak request.
Furthermore these attributes can be part of the speech text marked
up in SML.
These voice parameter header fields can also be sent in a CONTROL
method to affect a SPEAK request in progress and change its behavior
on the fly. If the synthesizer resource does not support this
operation, it should respond back to the client with a status of
unsupported.
S Shanmugham, et. al. IETF-Draft Page 24
Media Resource Control Protocol November 2001
8.4.6. Prosody-Parameters
This set of parameters defines the prosody of the speech.
prosody-parameter = "Prosody-" prosody-param-name ":"
prosody-param-value CRLF
prosody-param-name is any one of the attribute names under the
prosody element specified in W3C's Speech Synthesis Markup Language
Specification, W3C Working Draft, 3 January 2001. The prosody-param-
value is any one of the value choices of the corresponding prosody
element attribute specified in the above section.
These header fields MAY BE sent in SET-PARAMS/GET-PARAMS request to
define/get default values for the entire session or MAY BE sent in
the SPEAK request to define default values for that speak request.
Further more these attributes can be part of the speech text marked
up in SML.
The prosody parameter header fields in the SET-PARAMS or SPEAK
request only apply if the speech data is of type text/plain and does
not use a speech markup format.
These prosody parameter header fields MAY also be sent in a CONTROL
method to affect a SPEAK request in progress and change its behavior
on the fly. If the synthesizer resource does not support this
operation, it should respond back to the client with a status of
unsupported.
8.4.7. Vendor Specific Parameters
This set of headers allows for the client to set Vendor Specific
parameters.
vendor-specific = "Vendor-Specific-Parameters" ":"
vendor-specific-av-pair
*[";" vendor-specific-av-pair] CRLF
vendor-specific-av-pair = vendor-av-pair-name "="
vendor-av-pair-value
This header MAY BE sent in the SET-PARAMS/GET-PARAMS method and is
used to set vendor-specific parameters on the server side. The
vendor-av-pair-name can be any Vendor specific field name and
conforms to the XML vendor-specific attribute naming convention. The
vendor-av-pair-value is the value to set the attribute to and needs
to be quoted.
When asking the server to get the current value of these parameters,
this header can be sent in the GET-PARAMS method with the list of
vendor-specific attribute names to get separated by a semicolon.
8.4.8. Speech Marker
S Shanmugham, et. al. IETF-Draft Page 25
Media Resource Control Protocol November 2001
This header field contains a marker tag that may be embedded in the
speech data. Most speech markup formats provide mechanisms to embed
marker fields between speech texts. The synthesizer will generate
SPEECH-MARKER events when it reaches these marker fields. This field
SHOULD be part of the SPEECH-MARKER event and will contain the
marker tag values.
speech-marker = "Speech-Marker" ":" 1*ALPHA CRLF
8.4.9. Speech Language
This header field specifies the default language of the speech data
if it is not specified in it. The value of this header field should
follow RFC 1766 for its values. This MAY occur in SPEAK, SET-PARAMS
or GET-PARAMS request.
speech-language = "Speech-Language" ":" 1*ALPHA CRLF
8.4.10. Fetch Hint
When the synthesizer needs to fetch documents or other resources
like speech markup or audio files, etc., this header field controls
URI access properties. This defines when the synthesizer should
retrieve content from the server. A value of "prefetch" indicates a
file may be downloaded when the request is received, whereas "safe"
indicates a file that should only be downloaded when actually
needed. The default value is "prefetch". This header field MAY occur
in SPEAK, SET-PARAMS or GET-PARAMS requests.
fetch-hint = "Fetch-Hint" ":" 1*ALPHA CRLF
8.4.11. Audio Fetch Hint
When the synthesizer needs to fetch documents or other resources
like speech audio files, etc., this header field controls URI access
properties. This defines whether or not the synthesizer can attempt
to optimize speech by pre-fetching audio. The value is either "safe"
to say that audio is only fetched when it is needed, never before;
"prefetch" to permit, but not require the platform to pre-fetch the
audio; or "stream" to allow it to stream the audio fetches. The
default value is "prefetch". This header field MAY occur in SPEAK,
SET-PARAMS or GET-PARAMS. requests.
audio-fetch-hint = "Audio-Fetch-Hint" ":" 1*ALPHA CRLF
8.4.12. Fetch Timeout
When the synthesizer needs to fetch documents or other resources
like speech audio files, etc., this header field controls URI access
properties. This defines the synthesizer timeout for resources the
media server may need to fetch from the network. This is specified
in milliseconds. The default value is platform-dependent. This
header field MAY occur in SPEAK, SET-PARAMS or GET-PARAMS.
S Shanmugham, et. al. IETF-Draft Page 26
Media Resource Control Protocol November 2001
fetch-timeout = "Fetch-Timeout" ":" 1*DIGIT CRLF
8.4.13. Failed URI
When a synthesizer method needs a synthesizer to fetch or access a
URI and the access fails the media server SHOULD provide the failed
URI in this header field in the method response.
failed-uri = "Failed-URI" ":" Url CRLF
8.4.14. Failed URI Cause
When a synthesizer method needs a synthesizer to fetch or access a
URI and the access fails the media server SHOULD provide the URI
specific or protocol specific response code through this header
field in the method response. This field has been defined as
alphanumeric to accommodate all protocols, some of which might have
a response string instead of a numeric response code.
failed-uri-cause = "Failed-URI-Cause" ":" 1*ALPHA CRLF
8.4.15. Speak Restart
When a CONTROL jump backward request is issued to a currently
speaking synthesizer resource and the jumps beyond the start of the
speech, the current SPEAK request re-starts from the beginning of
its speech data and the response to the CONTROL request would
contain this header indicating a restart. This header MAY occur in
the CONTROL response.
speak-restart = "Speak-Restart" ":" boolean-value CRLF
8.4.16. Speak Length
This parameter MAY BE specified in a CONTROL method to control the
length of speech to speak, relative to the current speaking point in
the currently active SPEAK request. A - value is illegal in this
field. If a field with a Tag unit is specified, then the media must
speak till the tag is reached or the SPEAK request complete, which
ever comes first. This MAY BE specified in a SPEAK request to
indicate the length to speak in the speech data and is relative to
the point in speech the SPEAK request starts. The different speech
length units supported are dependent on the synthesizer
implementation. If it does not support a unit or the operation the
resource SHOULD respond with a status code of 404 "Illegal or
Unsupported value for parameter".
speak-length = "Speak-Length" ":" speech-length-value
CRLF
speech-length-value = numeric-speech-length
| text-speech-length
text-speech-length = 1*ALPHA SP "Tag"
numeric-speech-length= ("+" | "-") 1*DIGIT SP
S Shanmugham, et. al. IETF-Draft Page 27
Media Resource Control Protocol November 2001
numeric-speech-unit
numeric-speech-unit = "Second"
| "Word"
| "Sentence"
| "Paragraph"
8.5. Synthesizer Message Body
A synthesizer message may contain additional information associated
with the Method, Response or Event in its message body.
8.5.1. Synthesizer Speech Data
Marked-up text for the synthesizer to speak is specified as a MIME
entity in the message body. The message to be spoken by the
synthesizer can be specified inline by embedding the data in the
message body or by reference by providing the URI to the data. In
either case the data and the format used to markup the speech needs
to be supported by the media server.
All media servers MUST support plain text speech data and W3C's
Speech Markup Language as a minimum and hence MUST support the MIME
types text/plain and application/synthesis+ssml at a minimum.
If the speech data needs to be specified by URI reference the MIME
type text/uri-list is used to specify the one or more URI that will
list what needs to be spoken. If a list of speech URI is specified,
speech data provided by each URI must be spoken in the order in
which the URI are specified.
If the data to be spoken consists of a mix of URI and inline speech
data the multipart/mixed MIME-type is used and embedded with the
MIME-blocks for text/uri-list, application/synthesis+ssml or
text/plain. The character set and encoding used in the speech data
may be specified according to standard MIME-type definitions. The
multi-part MIME-block can contain actual audio data in .wav or sun
audio format. This is used when the client has audio clips that it
may have recorded and has it stored in memory or a local device and
it needs to play it as part of the SPEAK request. The audio MIME-
parts, can be sent by the client as part of the multi-part MIME-
block. This audio will be referenced in the speech markup data that
will be another part in the multi-part MIME-block according to the
multipart/mixed MIME-type specification.
Example 1:
Content-Type: text/uri-list
Content-Length: 176
http://www.cisco.com/ASR-Introduction.sml
http://www.cisco.com/ASR-Document-Part1.sml
http://www.cisco.com/ASR-Document-Part2.sml
http://www.cisco.com/ASR-Conclusion.sml
Example 2:
S Shanmugham, et. al. IETF-Draft Page 28
Media Resource Control Protocol November 2001
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
Example 3:
Content-Type: multipart/mixed; boundary="--break"
--break
Content-Type: text/uri-list
Content-Length: 176
http://www.cisco.com/ASR-Introduction.sml
http://www.cisco.com/ASR-Document-Part1.sml
http://www.cisco.com/ASR-Document-Part2.sml
http://www.cisco.com/ASR-Conclusion.sml
--break
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
--break
8.6. SET-PARAMS
The SET-PARAMS method, from the client to server, tells the
synthesizer resource to define default synthesizer context
parameters, like voice characteristics and prosody etc. If the
S Shanmugham, et. al. IETF-Draft Page 29
Media Resource Control Protocol November 2001
server resource does not recognize certain OPTIONAL parameters it
should just ignore those fields.
If some of the parameters being set are not recognized or have
illegal values, the remaining parameters will still be set. The
SET-PARAMS response MUST have a Response-Status of 403 or 404, and
MUST include the header fields that could not be set.
Example:
C->S:ANNOUNCE rtsp://media.server.com/media/synthesizer
RTSP/1.0
Cseq: 312
Session: 4123456
Content-Type: application/mrcp
Content-Length: 333
SET-PARAMS 543256 MRCP/1.0
Voice-gender: female
Voice-category: adult
Voice-variant: 3
S->C:RTSP/1.0 200 OK
Cseq: 312
Content-Type: application/mrcp
Content-Length: 87
MRCP/1.0 543256 200 COMPLETE
8.7. GET-PARAMS
The GET-PARAMS method, from the client to server, asks the
synthesizer resource for its current synthesizer context parameters,
like voice characteristics and prosody etc. The client SHOULD send
the list of parameter it wants to read from the server by listing a
set of empty parameter header fields. If a specific list is not
specified then the server SHOULD return all the settable parameters
including vendor-specific parameters and their current values. The
wild card use can be very intensive as the number of settable
parameters can be large depending on the vendor. Hence it is
RECOMMENDED that the client does not use the wildcard GET-PARAMS
operation very often.
Example:
C->S: ANNOUNCE rtsp://media.server.com/serv/synthesizer RTSP/1.0
Cseq: 312
Session: 4123456
Content-Type: application/mrcp
Content-Length: 89
GET-PARAMS 543256 MRCP/1.0
Voice-gender:
Voice-category:
Voice-variant:
S Shanmugham, et. al. IETF-Draft Page 30
Media Resource Control Protocol November 2001
Vendor-Specific-Parameters:com.mycorp.param1;
com.mycorp.param2
S->C: RTSP/1.0 200 OK
Cseq: 312
Content-Type: application/mrcp
Content-Length: 198
MRCP/1.0 543256 200 COMPLETE
Voice-gender:female
Voice-category: adult
Voice-variant: 3
Vendor-Specific-Parameters:com.mycorp.param1="Company Name";
com.mycorp.param2="124324234@mycorp.com"
8.8. SPEAK
The SPEAK method from the client to the server provides the
synthesizer resource with the speech text and initiates speech
synthesis and streaming. The SPEAK method can carry voice and
prosody header fields that define the behavior of the voice being
synthesized, as well as the actual marked-up text to be spoken. If
specific voice and prosody parameters are specified as part of the
speech markup text, it will take precedence over the values
specified in the header fields and those set using a previous SET-
PARAMS request.
When applying voice parameters there are 3 levels of scope. The
highest precedence are those specified within the speech markup
text, followed by those specified in the header fields of the SPEAK
request and hence apply for that SPEAK request only, followed by the
session default values which can be set using the SET-PARAMS request
and apply for the whole session moving forward.
If the resource is idle and the SPEAK request is being actively
processed the resource will respond with a success status code and a
request-state of IN-PROGRESS.
If the resource is in the speaking or paused states, i.e. it is in
the middle of processing a previous SPEAK request, the status
returns success and a request-state of PENDING. This means that this
SPEAK request is in queue and will be processed after the currently
active SPEAK request is completed.
For the Synthesizer resource, this is the only request that can
return a request-state of IN-PROGRESS or PENDING.
When the text to be synthesized is complete, the resource will issue
a SPEAK-COMPLETE event with the request-id of the SPEAK message and
a request-state of COMPLETE.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
S Shanmugham, et. al. IETF-Draft Page 31
Media Resource Control Protocol November 2001
Cseq: 313
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543257 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 313
Content-Type: application/mrcp
Content-Length: 86
MRCP/1.0 543257 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 73
SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
Completion-Cause: 000 normal
C->S: RTSP/1.0 200 OK
Cseq: 314
8.9. STOP
The STOP method from the client to the server tells the resource to
stop speaking if it is speaking something.
S Shanmugham, et. al. IETF-Draft Page 32
Media Resource Control Protocol November 2001
The STOP request can be sent with an active-request-id-list header
field to stop the zero or more specific SPEAK requests that may be
in queue and return a response code of 200(Success). If no active-
request-id-list header field is sent in the STOP request it will
terminate all outstanding SPEAK requests.
If a STOP request successfully terminated one or more PENDING or IN-
PROGRESS SPEAK requests, then the response message body contains an
active-request-id-list header field listing the SPEAK request-ids
that were terminated. Otherwise there will be no active-request-id-
list header field in the response. No SPEAK-COMPLETE events will be
sent for these terminated requests.
If a SPEAK request that was IN-PROGRESS and speaking was stopped the
next pending SPEAK request, if any, would become IN-PROGRESS and
move to the speaking state.
If a SPEAK request that was IN-PROGRESS and in the paused state was
stopped the next pending SPEAK request, if any, would become IN-
PROGRESS and move to the paused state.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543258 MRCP/1.0
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543258 200 IN-PROGRESS
S Shanmugham, et. al. IETF-Draft Page 33
Media Resource Control Protocol November 2001
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 315
Session: 4123456
Content-Type: application/mrcp
Content-Length: 87
STOP 543259 200 MRCP/1.0
S->C: RTSP/1.0 200 OK
Cseq: 315
Content-Type: application/mrcp
Content-Length: 134
MRCP/1.0 543259 200 COMPLETE
Active-Request-Id-List: 543258
8.10. BARGE-IN-OCCURRED
The BARGE-IN-OCCURRED method is a mechanism for the client to
communicate a barge-in-able event it detects to the speech resource.
This event is useful in two scenarios,
1. The client has detected some events like DTMF digits or other
barge-in-able events and wants to communicate that to the
synthesizer.
2. The recognizer resource and the synthesizer resource are in
different servers. In which case the client MUST act as a Proxy and
receive event from the recognition resource, and then send a BARGE-
IN-OCCURRED method to the synthesizer. In such cases, the BARGE-IN-
OCCURRED method would also have a proxy-sync-id header field
received from the resource generating the original event.
If a SPEAK request is active with kill-on-barge-in enabled, and the
BARGE-IN-OCCURRED event is received, the synthesizer should stop
streaming out audio. It should also terminate any speech requests
queued behind the current active one, irrespective of whether they
have barge-in enabled or not. If a barge-in-able prompt was playing
and it was terminated, the response MUST contain the request-ids of
all SPEAK requests that were terminated in its active-request-id-
list. There will be no SPEAK-COMPLETE events generated for these
requests.
If the synthesizer and the recognizer are on the same server they
could be optimized for a quicker kill-on-barge-in response by the
recognizer and synthesizer interacting directly based on a common
RTSP session-id. In these cases, the client MUST still proxy the
recognition event through a BARGE-IN-OCCURRED method, but the
synthesizer resource may have already stopped and sent a SPEAK-
COMPLETE event with a barge in completion cause code. If there were
no SPEAK requests terminated as a result of the BARGE-IN-OCCURRED
S Shanmugham, et. al. IETF-Draft Page 34
Media Resource Control Protocol November 2001
method, the response would still be a 200 success but MUST not
contain an active-request-id-list header field.
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543258 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 87
MRCP/1.0 543258 200 IN-PROGRESS
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 315
Session: 4123456
Content-Type: application/mrcp
Content-Length: 533
BARGE-IN-OCCURRED 543259 200 MRCP/1.0
Proxy-Sync-Id: 987654321
S->C: RTSP/1.0 200 OK
Cseq: 315
Content-Type: application/mrcp
Content-Length: 165
MRCP/1.0 543259 200 COMPLETE
Active-Request-Id-List: 543258
S Shanmugham, et. al. IETF-Draft Page 35
Media Resource Control Protocol November 2001
8.11. PAUSE
The PAUSE method from the client to the server tells the resource to
pause speech, if it is speaking something. If a PAUSE method is
issued on a session when a SPEAK is not active the server SHOULD
respond with a status of 402 or "Method not valid in this state". If
a PAUSE method is issued on a session when a SPEAK is active and
paused the server SHOULD respond with a status of 200 or "Success".
If a SPEAK request was active the server MUST return an active-
request-id-list header with the request-id of the SPEAK request that
was paused.
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543258 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 57
MRCP/1.0 543258 200 IN-PROGRESS
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 315
Session: 4123456
Content-Type: application/mrcp
Content-Length: 53
PAUSE 543259 MRCP/1.0
S Shanmugham, et. al. IETF-Draft Page 36
Media Resource Control Protocol November 2001
S->C: RTSP/1.0 200 OK
Cseq: 315
Content-Type: application/mrcp
Content-Length: 223
MRCP/1.0 543259 200 COMPLETE
Active-Request-Id-List: 543258
8.12. RESUME
The RESUME method from the client to the server tells a paused
synthesizer resource to continue speaking. If a RESUME method is
issued on a session when a SPEAK is not active the server SHOULD
respond with a status of 402 or "Method not valid in this state". If
a RESUME method is issued on a session when a SPEAK is active and
speaking(i.e. not paused) the server SHOULD respond with a status of
200 or "Success". If a SPEAK request was active the server MUST
return an active-request-id-list header with the request-id of the
SPEAK request that was resumed
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543258 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 54
S Shanmugham, et. al. IETF-Draft Page 37
Media Resource Control Protocol November 2001
MRCP/1.0 543258 200 IN-PROGRESS
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 315
Session: 4123456
Content-Type: application/mrcp
Content-Length: 53
PAUSE 543259 MRCP/1.0
S->C: RTSP/1.0 200 OK
Cseq: 87
MRCP/1.0 543259 200 COMPLETE
Active-Request-Id-List: 543258
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 316
Session: 4123456
Content-Type: application/mrcp
Content-Length: 533
RESUME 543260 MRCP/1.0
S->C: RTSP/1.0 200 OK
Cseq: 316
Content-Type: application/mrcp
Content-Length: 97
MRCP/1.0 543260 200 COMPLETE
Active-Request-Id-List: 543258
8.13. CONTROL
The CONTROL method from the client to the server tells a synthesizer
that is speaking to modify what it is speaking on the fly. This
method is used to make the synthesizer jump forward or backward in
what it is speaking, change speaker rate, and speaker parameters,
etc. It affects the active or IN-PROGRESS SPEAK request. Depending
on the implementation and capability of the synthesizer resource it
may allow this operation or one or more of its parameters.
When a CONTROL to jump forward is issued and the operation goes
beyond the end of the active SPEAK method's text, the request
succeeds. A SPEAK-COMPLETE event follows the response to the CONTROL
method. If there are more SPEAK requests in the queue, the
synthesizer resource will continue to process the next SPEAK method.
When a CONTROL to jump backwards is issued and the operation jumps
to the beginning of the speech data of the active SPEAK request, the
response to the CONTROL request contains the speak-restart header.
These two behaviors can be used to rewind or fast-forward across
multiple speech requests, if the client wants to break up a speech
markup text to multiple SPEAK requests.
S Shanmugham, et. al. IETF-Draft Page 38
Media Resource Control Protocol November 2001
If a SPEAK request was active when the CONTROL method was received
the server MUST return an active-request-id-list header with the
Request-id of the SPEAK request that was active.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 314
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543258 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 45
MRCP/1.0 543258 200 IN-PROGRESS
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 315
Session: 4123456
Content-Type: application/mrcp
Content-Length: 104
CONTROL 543259 MRCP/1.0
Prosody-rate: fast
S->C: RTSP/1.0 200 OK
Cseq: 315
Content-Type: application/mrcp
Content-Length: 99
S Shanmugham, et. al. IETF-Draft Page 39
Media Resource Control Protocol November 2001
MRCP/1.0 543259 200 COMPLETE
Active-Request-Id-List: 543258
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 316
Session: 4123456
Content-Type: application/mrcp
Content-Length: 533
CONTROL 543260 MRCP/1.0
Jump-Size: -15 Words
S->C: RTSP/1.0 200 OK
Cseq: 316
Content-Type: application/mrcp
Content-Length: 98
MRCP/1.0 543260 200 COMPLETE
Active-Request-Id-List: 543258
8.14. SPEAK-COMPLETE
This is an Event message from the Synthesizer Resource to the client
indicating that the SPEAK request was completed. The request-id
header field WILL match the request-id of the SPEAK request that
initiated the speech that just completed. The request-state field
should be COMPLETE indicating that this is the last Event with that
request-id, and that the request with that request-id is now
complete. The completion-cause header field specifies the cause code
pertaining to the status and reason of request completion such as
the SPEAK completed normally or because of an error or kill-on-
barge-in etc.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 316
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543260 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
You have 4 new messages.The first is from Stephanie Williams
S Shanmugham, et. al. IETF-Draft Page 40
Media Resource Control Protocol November 2001
and arrived at 3:45pm.The subject is ski trip
S->C: RTSP/1.0 200 OK
Cseq: 316
Content-Type: application/mrcp
Content-Length: 22
MRCP/1.0 543260 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 317
Session: 4123456
Content-Type: application/mrcp
Content-Length: 73
SPEAK-COMPLETE 543260 COMPLETE MRCP/1.0
Completion-Cause: 000 normal
C->S: RTSP/1.0 200 OK
Cseq: 317
8.15. SPEECH-MARKER
This is an event generated by the Synthesizer Resource to the client
when it hits a marker tag in the speech markup it is currently
processing. The request-id field in the header matches the SPEAK
request request-id that initiated the speech. The request-state
field should be IN-PROGRESS as the speech is still not complete and
there is more to be spoken. The actual speech marker tag hit,
describing where the synthesizer is in the speech markup, is
returned in the speech-marker header field.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 318
Session: 4123456
Content-Type: application/mrcp
Content-Length: 733
SPEAK 543261 MRCP/1.0
Voice-gender: neutral
Voice-category: teenager
Prosody-volume: medium
Content-Type: application/synthesis+ssml
Content-Length: 104
S Shanmugham, et. al. IETF-Draft Page 41
Media Resource Control Protocol November 2001
You have 4 new messages.The first is from Stephanie Williams
and arrived at 3:45pm.The subject is
ski trip
S->C: RTSP/1.0 200 OK
Cseq: 318
Content-Type: application/mrcp
Content-Length: 45
MRCP/1.0 543261 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 319
Session: 4123456
Content-Type: application/mrcp
Content-Length: 73
SPEECH-MARKER 543261 IN-PROGRESS MRCP/1.0
Speech-Marker: here
C->S: RTSP/1.0 200 OK
Cseq: 319
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 320
Session: 4123456
Content-Type: application/mrcp
Content-Length: 73
SPEECH-MARKER 543261 IN-PROGRESS MRCP/1.0
Speech-Marker: ANSWER
C->S: RTSP/1.0 200 OK
Cseq: 320
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
Cseq: 321
Session: 4123456
Content-Type: application/mrcp
Content-Length: 73
SPEAK-COMPLETE 543261 COMPLETE MRCP/1.0
Completion-Cause: 000 normal
S Shanmugham, et. al. IETF-Draft Page 42
Media Resource Control Protocol November 2001
C->S: RTSP/1.0 200 OK
Cseq: 321
9. Speech Recognizer Resource
The Speech Recognizer resource is capable of receiving an incoming
voice stream and providing the client with an interpretation of what
was spoken in textual form.
9.1. Recognizer State Machine
The recognizer resource is controlled by MRCP requests from the
client. Similarly the resource can respond to these requests or
generate asynchronous events to the server to indicate certain
conditions during the processing of the stream. Hence the recognizer
maintains states to correlate MRCP requests from the client. The
state transitions are described below.
Idle Recognizing Recognized
State State State
| | |
|---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->|
|<------STOP------------|<-----RECOGNIZE-----------|
| | |
| | |-----------|
| |--------| GET-RESULT |
| START-OF-SPEECH | |---------->|
|------------| |------->| |
| | |----------| |
| DEFINE-GRAMMAR | RECOGNITION-START-TIMERS |
|<-----------| |<---------| |
| |<---DEFINE-GRAMMAR--------|
| | |
|-------| | |
| STOP | |
|<------| | |
| |
|<-------------------STOP--------------------------|
|<-------------------DEFINE-GRAMMAR----------------|
9.2. Recognizer Methods
The recognizer supports the following methods.
Recognizer-Method = SET-PARAMS
| GET-PARAMS
| DEFINE-GRAMMAR
| RECOGNIZE
| GET-RESULT
| RECOGNITION-START-TIMERS
| STOP
9.3. Recognizer Events
S Shanmugham, et. al. IETF-Draft Page 43
Media Resource Control Protocol November 2001
The recognizer may generate the following events.
Recognizer-Event = START-OF-SPEECH
| RECOGNITION-COMPLETE
9.4. Recognizer Header Fields
A recognizer message may contain header fields containing request
options and information to augment the Method, Response or Event
message it is associated with.
recognizer-header = confidence-threshold ; Section 9.4.1
| sensitivity-level ; Section 9.4.2
| speed-vs-accuracy ; Section 9.4.3
| n-best-list-length ; Section 9.4.4
| no-input-timeout ; Section 9.4.5
| recognition-timeout ; Section 9.4.6
| waveform-url ; Section 9.4.7
| completion-cause ; Section 9.4.8
| recognizer-context-block ; Section 9.4.9
| recognizer-start-timers ; Section 9.4.10
| vendor-specific ; Section 9.4.11
| speech-complete-timeout ; Section 9.4.12
| speech-incomplete-timeout; Section 9.4.13
| dtmf-interdigit-timeout ; Section 9.4.14
| dtmf-term-timeout ; Section 9.4.15
| dtmf-term-char ; Section 9.4.16
| fetch-timeout ; Section 9.4.17
| failed-uri ; Section 9.4.18
| failed-uri-cause ; Section 9.4.19
| save-waveform ; Section 9.4.20
| new-audio-channel ; Section 9.4.21
Parameter Support Methods/Events
confidence-threshold MANDATORY SET-PARAMS, RECOGNIZE
GET-RESULT
sensitivity-level Optional SET-PARAMS, GET-PARAMS,
RECOGNIZE
speed-vs-accuracy Optional SET-PARAMS, GET-PARAMS,
RECOGNIZE
n-best-list-length Optional SET-PARAMS, GET-PARAMS,
RECOGNIZE, GET-RESULT
no-input-timeout MANDATORY SET-PARAMS, GET-PARAMS,
RECOGNIZE
recognition-timeout MANDATORY SET-PARAMS, GET-PARAMS,
RECOGNIZE
waveform-url MANDATORY RECOGNITION-COMPLETE
completion-cause MANDATORY DEFINE-GRAMMAR, RECOGNIZE,
RECOGNITON-COMPLETE
recognizer-context-block Optional SET-PARAMS, GET-PARAMS
recognizer-start-timers MANDATORY RECOGNIZE
vendor-specific MANDATORY SET-PARAMS, GET-PARAMS
speech-complete-timeout MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE
S Shanmugham, et. al. IETF-Draft Page 44
Media Resource Control Protocol November 2001
speech-incomplete-timeout MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE
dtmf-interdigit-timeout MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE
dtmf-term-timeout MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE
dtmf-term-char MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE
fetch-timeout MANDATORY SET-PARAMS, GET-PARAMS
RECOGNIZE, DEFINE-GRAMMAR
failed-uri MANDATORY Any
failed-uri-cause MANDATORY Any
save-waveform MANDATORY SET-PARAMS, GET-PARAMS,
RECOGNIZE
new-audio-channel MANDATORY RECOGNIZE
9.4.1. Confidence Threshold
When a recognition resource recognizes or matches a spoken phrase
with some portion of the grammar, it associates a confidence level
with that conclusion. The confidence-threshold parameter tells the
recognizer resource what confidence level should be considered a
successful match. This is an integer from 0-100 indicating the
recognizer's confidence in the recognition. If the recognizer
determines that its confidence in all its recognition results is
less than the confidence threshold, then it MUST return no-match as
the recognition result. This header field MAY occur in RECOGNIZE,
SET-PARAMS or GET-PARAMS.
confidence-threshold= "Confidence-Threshold" ":" 1*DIGIT CRLF
9.4.2. Sensitivity Level
To filter out background noise and not mistake it for speech, the
recognizer may support a variable level of sound sensitivity. The
sensitivity-level parameter allows the client to set this value on
the recognizer. This header field MAY occur in RECOGNIZE, SET-PARAMS
or GET-PARAMS
sensitivity-level = "Sensitivity-Level" ":" 1*DIGIT CRLF
9.4.3. Speed Vs Accuracy
Depending on the implementation and capability of the recognizer
resource it may be tunable towards Performance or Accuracy. Higher
accuracy may mean more processing and higher CPU utilization,
meaning less calls per media server and vice versa. This parameter
on the resource can be tuned by the speed-vs-accuracy header. This
header field MAY occur in RECOGNIZE, SET-PARAMS or GET-PARAMS.
speed-vs-accuracy = "Speed-Vs-Accuracy" ":" 1*DIGIT CRLF
9.4.4. N Best List Length
S Shanmugham, et. al. IETF-Draft Page 45
Media Resource Control Protocol November 2001
When the recognizer matches an incoming stream with the grammar, it
may come up with more than one alternative matches because of
confidence levels in certain words or conversation paths. If this
header field is not specified, by default, the recognition resource
will only return the best match above the confidence threshold. The
client, by setting this parameter, could ask the recognition
resource to send it more than 1 alternative. All alternatives must
still be above the confidence-threshold. A value greater than one
does not guarantee that the recognizer will send the request number
of alternatives. This header field MAY occur in RECOGNIZE, SET-
PARAMS or GET-PARAMS.
n-best-list-length = "N-Best-List-Length" ":" 1*DIGIT CRLF
9.4.5. No Input Timeout
When recognition is started and there is no speech detected for a
certain period of time, the recognizer can send a RECOGNITION-
COMPLETE event to the client and terminate the recognition
operation. The no-input-timeout header field can set this timeout
value. The value is in milliseconds. This header field MAY occur in
RECOGNIZE, SET-PARAMS or GET-PARAMS.
no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF
9.4.6. Recognition Timeout
When recognition is started and there is no match for a certain
period of time, the recognizer can send a RECOGNITION-COMPLETE event
to the client and terminate the recognition operation. The
recognition-timeout parameter field sets this timeout value. The
value is in milliseconds. The default value is 10 seconds. This
header field MAY occur in RECOGNIZE, SET-PARAMS or GET-PARAMS.
recognition-timeout = "Recognition-Timeout" ":" 1*DIGIT CRLF
9.4.7. Waveform URL
If the save-waveform header field is set to true, the recognizer
MUST record the incoming audio stream of the recognition into a file
and provide a URI for the client to access it. This header MUST be
present in the RECOGNITION-COMPLETE event if the save-waveform
header field was set to true. The URL value of the header MUST be
NULL if there was some error condition preventing the server from
recording. Otherwise, the URL generated by the server SHOULD be
globally unique across the server and all its recognition sessions.
The URL SHOULD BE available untill the next RECOGNIZE request is
issued on that session, or the session is torn down, whichever
happens first.
waveform-url = "Waveform-URL" ":" Url CRLF
9.4.8. Completion Cause
S Shanmugham, et. al. IETF-Draft Page 46
Media Resource Control Protocol November 2001
This header field MUST be part of a RECOGNITION-COMPLETE, event
coming from the recognizer resource to the client. This indicates
the reason behind the RECOGNIZE method completion. This header field
MUST BE sent in the DEFINE-GRAMMAR and RECOGNIZE responses, if they
return with a failure status and a COMPLETE state.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
Cause-Code Cause-Name Description
000 success RECOGNIZE completed with a match or
DEFINE-GRAMMAR succeeded in
downloading and compiling the
grammar
001 no-match RECOGNIZE completed, but no match