The Epos Speech System: Text-To-Speech Control Protocol (version 0)

5. Text-To-Speech Control Protocol (version 0)

TTSCP is a client-server connection-oriented, both human- and machine-readable communication protocol, remotely similar to the File Transfer Protocol in spirit. TTSCP is offered as a standard interface for controlling generic speech processing applications, not only Text-To-Speech ones. It is primarily designed to run atop TCP, but any reliable connection-oriented underlying protocol should theoretically work as well.

The server awaits new connections on a single TCP port. There are two types of connections: control connections used to issue commands by the client and to return status information, such as completion messages by the server, and data connections used to transfer the actual data. Immediately after the underlying connection is opened, the server transmits a session header (see below) and treats the connection as a control connection, until the data command is issued by the client, causing it to become a data connection.

Every TTSCP connection (both a control one and a data one) obtains a connection handle from the server inside the session header. This handle is a string of alphanumeric characters which uniquely identifies the connection and which also serves as an access token for it. Other connections can use such a handle to interrupt a control connection's task in progress, to disconnect any connection, to process data received from a data connection etc.

A TTSCP session is a sequence of commands, their results and referenced data lasting from setting up the control connection until its disconnection or the data command. Any party may quit the session at any time, but must advise the other one either by the done command (the client) or by a 600 response code or higher (the server). If a done command is sent before a preceding command has completed, the server will proceed with the preceding commands. If a 600 or higher error code is received as a response to a command and subsequent commands have already been sent by the client, they will not be executed.

A data connection may be silently disconnected by the client at any time. To allow reliable disconnection detection by the server, every data connection is attached to an already existing control connection (as specified with the data command) and it will be automatically disconnected when the control connection is disconnected. This attachment relation doesn't prevent other control connections from referencing this data connection using its handle, it only limits its lifetime.

The session header (as sent before a TTSCP session starts) is a sequence of lines. The first line shall exactly match the string TTSCP spoken here; the clients are strongly encouraged to use this string to identify the protocol. Each of the following lines contains a TTSCP header keyword terminated by a colon and a single space and the value associated with the keyword. The client may choose not to use these values at all, or to scan only for some header keywords. The last line in the header shall contain the handle keyword.

A typical TTSCP session looks like this, with client commands unindented and server responses indented.


        TTSCP spoken here
        protocol: 0
        extensions:
        server: Epos
        release: 2.4.6
        handle: O29-m2UZ
user user@host.domain.net
        452 user not found
setl some_option on
        200 OK
strm $zC-4EEl0:raw:rules:diphs:synth:/dev/dsp
        200 OK
appl 34
        112 started
        122 total bytes
         3622
        123 written bytes
         3622
        200 OK
done
        600 goodbye

The "user" and "done" commands may become mandatory, the rest may be freely used between them. For the interaction with a human, the "help" command is available.

It is legal to use "anonymous" instead of the address in the user command: "user anonymous". It is also legal to switch users with additional user commands. This may cause context switches.

It is advised to check the greeting string received to begin in "TTSCP ". If it doesn't, the client or possibly the server may be obsolete or an unrelated protocol may be used at the port.

In this document, a "newline" produced by the server or the client should be a CR LF character sequence. It is allowed for both parties to accept a LF character without a preceding CR character as a valid line separator, but it is never legal to rely on this practice.

5.1 Session Header Keywords

The set of session header keywords and their sequence may vary between TTSCP implementations. Some lower case keywords are defined by this document; in addition, any implementation may supply its own keywords provided their first two characters are lower case x and dash, respectively, or they consist solely of upper case letters. Both standard and implementation specific keywords are limited to upper and lower case letters (case sensitive), digits, dashes and underlines; however, the values associated with some keywords may contain any printable ISO 8859 characters. There are three mandatory keywords (protocol, extensions and handle, in order of appearance in the session header).

extensions: The value is a whitespace separated list of semi-standard and non-standard extensions supported by this TTSCP server. Only extensions defined by this document or a future version of this document should be advertised; custom or experimental extensions may be advertised provided their first two characters are lower case x and dash, respectively. At present, there are no extensions defined, so the list should be empty, but this keyword is nevertheless mandatory.
handle: The value is a connection handle for this control connection. The handle stays valid when the connection is turned into a data connection. Only lower and upper case letters, digits, dashes and underlines may occur in the handle. This keyword is mandatory and must appear last in the session header.
protocol: The value is a decimal number identifying the major TTSCP protocol version. The current protocol version number is 0 (previous versions had no session header). It is likely that protocol versions unknown to the client will be fundamentally incompatible. It is mandatory to begin the session header with this keyword. It is recommended to check it on the client side.
release: Server release. The formatting and interpretation is implementation dependent.
server: Server name. Different versions of the same implementation should typically use an identical value for this keyword.

5.2 Data Formats

The data is passed between modules in one of the following formats:

plain text
phonetic structure of the text (TSR)
Speech Synthesizer Input Format (SSIF)
sequence of segments
waveform file

Plain Text

This is what the "text" in "text-to-speech" stands for.

Text Structure Representation

Internal text representation, suitable for arbitrary processing, but unsuitable for input or output. Before output, it must be converted to another format first. For a description, see the text structure representation overview.

Conversion to plain text dismisses prosody.

Conversion to plain text dismisses segment layer if any.

Speech Synthesizer Input Format

This format has been introduced by the MBROLA synthesizer development team. Together with the "sequence of segments" it is one of the two possible input formats to a speech synthesizer in Epos. With the MBROLA synthesizer you have to use SSIF.

SSIF is line oriented, each line corresponding to a single phone; the line contains several whitespace separated components.

The first component is the SAMPA notation of the phone; the second component is its duration in milliseconds.

Subsequent components are prosody points. Each prosody point is enclosed in parentheses and consists of two or three integers separated by commas. The first value locates the prosody point within the phone per cent (e.g. the value of 99 corresponds to just before the end of the phone); the second value indicates the desired pitch at that prosody point (the value of 100 indicates the default pitch); the third value, which is not currently supported by MBROLA, and which is optional, indicates the intensity at that point. It is the responsibility of the synthesizer to do piecewise linear interpolation between prosody points.

Sequence of Segments

Every segment is a quadruple of segment number, assigned frequency (pitch), intensity (volume) and time factor (speed). The initial segment is dummy (to be skipped); its segment number contains the total number of segments in this sequence. The corresponding prosodic parameters are undefined. They should preferably be zero.

The integer values should be encoded as 32-bit little endian integers.

This format is currently being replaced by SSIF, although it will remain to be supported for some additional period time.

Waveform

The traditional MS Windows RIFF .wav file header and data. Two liberties may be taken when waveform data in this format is sent via a data connection.

First, the total length of the RIFF form field may contain a negative number. In this case, the length of the form shall be determined from the data length as indicated in the corresponding TTSCP control connection. Also, if this field contains a positive number, which conflicts with the data length indicated in the corresponding TTSCP control connection, the recipient may choose any one of them or return a 435 error.

Second, if only fmt and data chunks are present in the RIFF form being sent, and the length of the data chunk is negative, the length of the data chunk shall be determined from the total length of the RIFF form. (Epos never actually takes advantage of this rule.)

This format allows storing labels (i.e. pointers to specific positions within the waveform); Epos does use this feature if enabled e.g. using the label_phones option to label phone and/or segment boundaries within the waveform.

5.3 TTSCP Commands

TTSCP commands are newline-terminated strings. Each of them begins with a command identifier, some of them may continue with optional or mandatory parameters, depending on the particular command. Each command generates one or more "replies", the last reply indicating completion and sometimes also some command-specific information.

`appl`

Apply the current data processing stream (see the strm command to some data. The parameter is a decimal number specifying the number of bytes to be processed.

Before the completion reply, zero or more 122 replies are received by the client, every one followed by a decimal number on a line by itself, preceded with a single space. This is the number of bytes written by the output module per task. Usually, if the appl command generates a single successful task only, there shall be exactly one such reply, but if e.g. the chunk module has split the input text into more independent parts, multiple outputs and multiple 122 replies may appear; if e.g. the join module has been employed, there may be no 122 reply at all if the text being processed is considered unterminated. Such an intermediate reply should be sent as soon as the number of bytes to be sent is known to the TTSCP server to avoid certain deadlock scenarios caused by an insufficient buffer capacity between the server and the client. The number of bytes actually sent may be even smaller in case of a user break or another unexpected situation; it shall never be larger and it shall be exactly the number of bytes sent by the server upon a successful completion reply.

Before the completion reply, one or more 123 replies for every 122 reply are received by the client. Every 123 reply is followed by a decimal number on a line by itself, preceded with a single space. This is the number of bytes actually successfully written by the output module. This intermediate reply should be sent as soon as the data is sent. When the client eventually receives a successful completion reply, the sum of byte counts received with 123 replies shall match the number of bytes sent by the server.

For every 122 reply, there shall be a corresponding sequence of 123 replies such that no unrelated 122 or 123 replies intervene. The sum of byte counts received with these 123 replies shall match the byte count received with the 122 reply. In other words, the replies relating to different subtasks must preserve the time ordering. If an error condition prematurely terminates the appl command processing, this behavior is not required for the last subtask whose processing has begun, independent of whether its 122 reply has been received by the client.

The relative ordering of 122 and 123 replies for the same subtask is not specified by the TTSCP.

The completion response code is received when all the modules have finished processing and data has been output by the output module. Some of the data may however still be being processed by hardware, e.g. a sound card, or may be delayed by the network.

Using appl before the first strm command is forbidden.

`intr`

Interrupt a single appl command in progress. The parameter is a control connection handle and specifies the connection which issued the command to be interrupted.

The server should try to discard as much pending data as possible, including e.g. waveform data already written to a sound card. If however multiple appl commands have been sent simultaneously, only the one in progress will be interrupted.

The server will reply a 401 completion code to the interrupted connection, whereas a 200 completion code will acknowledge a successful intr command.

If there is no stream associated with the connection to be interrupted or there is no apply command in progress on it, a 423 reply will be issued to the interrupting connection and the interrupting connection will not be affected.

`data`

Turn this control connection into a data connection. The parameter is the handle of an existing control connection to attach this connection to. The sole consequence of this attachment relation is a disconnect of the data connection when the specified control connection is disconnected. (It is therefore common for a client to open two connections, to get their connection handles, to turn one into a data connection and to attach it to the other connection. That way the client obtains a control and a data connection which will gracefully shutdown even after the client abruptly disconnects.)

The server sends a completion reply for this command (response code 200 if successful). After the first newline character following the 200 response code is received, no more control information will arrive. Likewise, the client may not send any TTSCP commands after the newline-terminated data command. If the data command is not successful because of capacity or other reasons, the connection stays in a valid TTSCP control connection state and more commands may be submitted.

The data connection becomes valid at receipt of a 200 response code to this command.

`delh`

Terminate a specified data connection. The parameter is the data connection handle to be terminated, as returned by a former data command on that connection. If successful, the connection is disconnected by the server and the data connection handle is forgotten.

`done`

Issued as the last command in a session. The client may exit just after sending this command. The server should reply with error code 600.

`down`

Stop the server. Quit pending sessions. May disappear in the future.

`help`

Request for TTSCP syntax help. The server response is undefined except for the proper error code termination (class 2 or 4).

Suggested behavior is to reply with "441 help yourself" or a brief list of commands with explanation. If a parameter is given, the server may supply more specific information, such as verbose description of a single command.

The usual completion reply rules apply. Specifically, care must be taken lest the help text contain a line beginning with a digit.

`pass`

Attempts to validate an account, as given by a previous "user" command. If no valid "user" command was ever received, the internal server password may be used. This may enable some internal commands such as "down" or "setg". (Epos stores this internal password in /var/run/epos.pwd while it is listening on the standard TTSCP port.)

The password is a string of alphanumeric characters, dashes and underlines, no more than 250 bytes long.

`setg`

Globally set a server configuration parameter. The parameter is a whitespace-separated "option value" pair. The server may ignore this command altogether with an error code 442. The server will reply with an error code 412 if the value assigned is illegal, or with 451 if the server is configured not to allow to change this parameter (may depend on the current authentication status).

The settings, if successful, will apply to all future connections. They will typically not affect existing connections, unless specified otherwise. For this reason, this command should be available only to authenticated and trusted users.

If the option name is "language", the command will attempt to switch the default language. The same goes for "voice".

The standardization status of this command is still unclear. It is definitely reasonable to use compatible option names between server implementations where applicable, but the set of useful configuration parameters seems to be impossible to specify in advance. Any comment on this issue is welcome.

`setl`

Set a server configuration parameter. The parameter is a whitespace-separated "option value" pair. The server may ignore this command altogether with an error code 442. In any case, this setting should never alter the execution environment of existing and/or future sessions. (In Epos, setting a static option using setl, which is rarely allowed and even less often actually done, affects all connections in the same way; but all voice, language and global options fully follow this standard.) The server will reply with an error code 412 if the value assigned is illegal, or with 451 if the server is configured not to allow to change this parameter (may depend on the current authentication status).

The settings apply to the current session; use setg for more permanent settings. Note also that setting some options can have arbitrary side-effects.

If the option name is "language", the command will attempt to switch the language. The same goes for "voice".

`show`

Show a configuration parameter value. The parameter is an option name. The server may reply with the value of the option requested, preceded by a single space character, or it may ignore this command with error code 442.

show languages and show voices may be used for listing available languages, as well as available voices for the current language. The language or voice names are given on separate lines each.

`strm`

Prepare a data flow stream. The parameter is a colon-separated sequence of data processing modules; commands such as appl cause specified data to be run through the modules from left to right. Any two adjacent modules must be compatible, that is the type of output produced by the one to the left must match the type of input processed by the one to the right. The leftmost module must designate a source (input) module for the whole stream, the rightmost one must designate a destination for the data produced by the stream. Information on specific data formats accepted or produced by the modules can be found above.

The stream is not automatically active. It processes data only when requested by the appl command.

The stream lasts until the next strm command or termination of the TTSCP connection, then it is deleted.

`user`

The user command is still not implemented properly and its semantics may change. Epos is currently configured not to need and not to use it. The tentative theory of TTSCP authentication goes as follows:

The user command should precede all TTSCP exchanges. Its parameter is "anonymous" or a local or configured user account name. Some other user names may acquire special meaning. We'll see.

Unless the account requires no authentication, this command should be immediately followed by a proper pass command; otherwise the session may be refused to issue most or all other commands.

If no user command is issued, "user anonymous" is assumed.

If the user doesn't exist, anonymous access is granted.

5.4 TTSCP Modules

Input and Output Modules

The input and output modules follow the same syntax conventions. If the module name begins with a $, the rest of the name is a data connection handle. If it begins with a slash, it is an absolute file name. Such absolute file names however form a name space distinct from that of the underlying operating system. In Epos, the name space is a single directory defined by the <tt/pseudo_root_dir/ option. It must be impossible to escape from such name space by inserting parent directory references in a file name or otherwise. A TTSCP implementation can decide to reject some or all file input and output modules with a 454 reply.

If the module name begins with a #, the rest of the name is a special input/output module identifier. The only identifier generally supported is localsound, which can only be used as an output module with the waveform type. Any waveform passed to this module should be played over using the local soundcard. The 453 or 445 reply may be issued if the user is not allowed to use the soundcard, or no local soundcard exists, respectively.

The output data type of an input module and the input data type of an output module are determined by the respective adjacent modules. If input and output modules are directly connected, it is assumed that the data is a plain text.

The TSR data type can not be sent or received, and may thus be totally implementation and architecture dependent.

Processing Modules

At the moment there are only few modules implemented that do a real processing. All of them have fixed names and types.

name	input format	output format	notes
chunk	plain text	plain text	splits text
join	plain text	plain text	joins texts
raw	plain text	TSR	parses text
stml	STML text	TSR	parses STML
rules	TSR	TSR	apply rules
print	TSR	plain text
dump	TSR	SSIF	extract SSIF
diphs	TSR	segments	extract segments
syn	SSIF	waveform	speech synthesis
synth	segments	waveform	speech synthesis

Available processing modules

`chunk`

The text is split into parts convenient for latter processing. These parts usually correspond at least to whole utterances; it is correct not to split the text at all, but care must be taken not to cause a split which significantly alters the final rendering of the text.

`join`

It is customary to use the join module just after a chunk module. If this module receives two consecutive texts such that the chunk module would not split their concatenation between them, the join module may merge them to a single text, that is, it may silently drop the first subtask and prepend the text to the text acquired later. This delay may cross the boundary of an appl command.

`raw`

The input text is converted in a language dependent way to the TSR, assuming it is a plain text without any specific TTS escape sequences or other special formatting conventions. Except for tokenization and whitespace reduction the conversion should not try to process the text, especially not in a language dependent way; this goal doesn't seem to be always feasible.

`rules`

The voice dependent TTS or other rules are applied to a TSR.

`print`

The TSR is converted to a plain text representation, suitable as a user-readable output. The conversion should be as straightforward as possible and should not emit any special formatting character sequences. Ideally, the successive application of the raw and print modules should not significantly alter the text.

`dump`

...

`diphs`

This module extracts the segment layer from the input TSR into the linear segment stream format; the rest of TSR is discarded.

There may be an implementation-dependent limit on the size of the segment stream produced. If more segments should be produced, the module may emit more subtasks; see the appl command for discussion concerning subtask reporting. t

`synth`

The input segment stream is synthesized in a voice dependent way.

Explicit Data Type Specifiers

Sometimes an ambiguity concerning the type of data passed at a certain point within the stream may occur. This is currently the case with streams consisting of input and output modules only (such as a stream to play out an audio icon from a waveform file to a sound card device); in the future, ambiguously typed versatile processing modules may be introduced, too. Sometimes the data type is semantically irrelevant (for example, a socket-to-socket forwarding stream), sometimes the default data type, that is, a plain text, is a reasonable choice. There are however instances where the type matters, like copying a waveform file to a sound card device: the waveform header must be stripped off and the appropriate ioctls must be issued to replay the raw waveform data with the appropriate sampling frequency, sample size and so on.

The data types can be expressed explicitly by inserting a pseudo-module into the stream at the ambiguous position. Failing that, the output data type of the preceding module and/or the input data type decides the data type at this point. Failing even that, the server will assume plain text data.

The pseudo-module name consists of a single letter enclosed in square brackets. The available data types are indicated by letters listed in the table of explicit data type specifiers.

name	data format
t	plain text
s	STML text
i	the server-internal text structure representation
p	SSIF
d	segments
w	waveform

Explicit data type specifiers

The data formats are described in the data formats subsection.

5.5 Response Codes

Any server reply contains a numeric code, a single space, and some arbitrary newline-terminated text. The numeric code (three decimal digits) allows interfacing with simple to trivial clients, whereas the text (which is optional) is meant for possible user interaction.

The response codes are defined by the protocol, while the accompanying text is not, but it should rarely exceed 20 characters (clients should tolerate at least 76 characters plus the response code).

Every response code consists of the response class, the subclass and an extra digit. The response class drives the protocol states and reports errors. The subclass is interpreted depending on the response class; it can specify which component has reported an error or generated this particular response. Trivial clients may ignore this digit altogether. The third digit is merely used for distinguishing between messages of the same class and subclass and most clients are likely to ignore it in most situations.

Response Classes

Within TTSCP, nine response classes have been defined. Out of these, one is used for in-progress communication, four indicate the results of commands, and the remaining four are reserved for future extensions.

code	error type	suggested action
0xx	reserved	(server queries client?)
1xx	still OK	informative only
2xx	OK, command completed	transmit another command
3xx	reserved	notify user / ignore
4xx	command failed	transmit another command
5xx	reserved	notify user
6xx	connection terminated	notify user if unexpected
7xx	reserved	notify user
8xx	server crash or shutdown	notify user

TTSCP response classes

The client is expected to send another command whenever it receives a 2xx or a 4xx response, not to send otherwise. The client should treat the connection as terminated, whenever it receives any response with code 5xx or higher. It may also quit at any time just after sending a "done" command to the server; the server will however confirm that command with a reply of 600 before disconnecting.

Replies of 8xx except 80x are reserved for cases of severe server misconfiguration, or detected programming bugs. Their meanings are very implementation dependent (implementations are encouraged not to issue them except in emergency). If such a reply is ever received, the server has abnormally terminated.

The messages accompanying 3xx and higher response codes are likely to be interesting to the user if there is one. Any message without an error code is a data flow primarily meant for the user if any; a sequence of these may occur only after some 1xx response, except for debugging messages if on.

At the moment, some error codes contain letters. Later, all of them will consist of digits only and will be space-terminated.

Response Subclasses

The subclass depends on the response class. The most interesting classes are 3xx and 4xx, i.e. errors, where the subclass indicates both the nature of the problem, and the suggested way of dealing with it (especially in the case of 4xx responses). The same meaning is attached to these subclasses also in case of 6xx and 8xx responses.

The middle digit of 1xx and 2xx responses has still no meaning attached (there are only a few such responses).

code	error type	suggested action
0	none	relax; assume user-initiated interruption
1	syntax	notify user
2	busy or timing	wait and retry
3	bad data	notify user
4	not found	notify user
5	access denied	notify user
6	server error	have user notify server author
7	network error	wait and retry

TTSCP error subclasses

Currently Defined Messages

111	daemon talks
112	apply task started
122	apply task total bytes count follows
123	apply task chunk bytes count follows
141	option value follows
142	data connection handle follows
200	daemon is happy and ready
211	access granted
212	anonymous access granted

TTSCP success codes as issued by Epos

401	intr command received
411	command not recognized
412	option passed illegal value
413	command too long, ignored
414	parameter should be a positive integer
415	no or bad stream
416	no parameter allowed
417	parameter missing
418	bad format or encoding
421	output voice busy
422	out of memory
423	nothing to interrupt
431	unknown character in text
432	received bad segments
435	received bad waveform (unused)
436	data connection disconnected
437	permanent read error
438	end of file
439	hw cannot handle this waveform
441	help not available
442	no such option
443	no such language or voice
444	invalid connection handle
445	could not open file
446	out of range (never issued)
447	invalid option value
448	cannot send woven pointery
451	not authorized to do this
452	no such user or bad password
453	not allowed to use localsound
454	not allowed to use filesystem input/output modules
456	input too long
461	input triggered server bug
462	unimplemented feature
463	input triggered configuration bug
464	input triggered OS incompatibility
465	i/o problem : error on close()
466	command stuck
467	fatal signal
471	tcpsyn received invalid waveform
472	unresolved remote tcpsyn server
473	unreachable remote tcpsyn server
474	remote tcpsyn server uses an unknown protocol
475	remote tcpsyn server returned error
476	remote tcpsyn server timed out

TTSCP error codes as issued by Epos

The 8xx class of responses (fatal errors) is still very unsettled and many of the codes listed there will later be removed or merged together. Applications should not try to decode them except possibly for the middle digit. The same goes for all x6x subclasses of errors (internal errors).

600	session ended normally
601	server reinitializing as requested
800	server shutting down as requested by client
801	error explicitly reported in config files
811	rules or dictionary file syntax
812	generic configuration file syntax
813	impossibilia referenced in config files
814	bad command line
841	cannot open necessary configuration file
842	no voices configured
843	up-to-date configuration files not found
844	no unicode or SAMPA maps found
861	internal error: impossible branch of execution
862	internal error: invariance violation
863	internal error: buffer overflow
864	insufficient capacity
865	server crashed, reason unspecified
869	double fault
871	network unreachable
872	server already running
881	too many syntax errors in rule files
882	infinite include cycle

TTSCP session termination codes as issued by Epos

Next Previous Contents