speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Speech Dispatcher roadmap discussion.


From: Trevor Saunders
Subject: Speech Dispatcher roadmap discussion.
Date: Tue, 14 Oct 2014 21:40:39 -0400

On Mon, Oct 13, 2014 at 10:45:05AM +0200, Bohdan R. Rau wrote:
> W dniu 2014-10-10 02:13, Luke Yelavich napisa?(a):
> >On Thu, Oct 09, 2014 at 10:50:35PM AEDT, Bohdan R. Rau wrote:
> >>
> >>I also have another suggestions, but it's topic for next mail :)
> >
> >Looking forward to hearing about it.
> 
> PART ONE
> 
> At first:
> 
> as some of SSIP responses may be changed in new version, we should provide
> compatibility with current versions of speech-dispatcher (both in SSIP
> protocol and libspeechd). So first step would be new SSIP command, something
> like:
> 
> COMPAT_MODE On|Off

I don't really like on and off since it assumes we'll only change the
protocol once. Taking a version to speak defaulting to 0 might work ok.
However it seems like it might be simpler to just add new commands.

> (default is On - it means, server emulates current version of SSIP protocol
> even with known bugs)
> 
> It should be safe - old applications will still work with new version, new
> applications would decide to continue if COMPAT_MODE command is not
> implemented (ie. we are connected to old version of speech-dispatcher) or
> die (because some of vital functions are not implemented).
> 
> Also new libspeechd should have function:
> 
> int spd_compat_mode(SPDConnection *conn, int compatible);
> 
> Function should return 0 at success or -1 if error occurs.
> 
> Alternatively - we could provide spd_open_new and spd_open2_new functions.
> In future applications we would use only _new functions.
> 
> 1. Extending current event notifications
> 
> There is nothing new in protocol. Simply after commands CHAR, KEY and
> SOUND_ICON server should answer identically as after SPEAK, ie.:
> 
> >CHAR x
> < 225-msg_id
> < 225 OK MESSAGE_QUEUED
> 
> Analogically library functions spd_key, spd_char, spd_wchar and
> spd_sound_icon should return message id - but in compatibility mode they
> must return zero on success, because of possible code:

or we can add functions spd_char_msgid etc which seems simpler to
explain.

btw why is spd_wchar a thing at all :( it seems like spd_char should
handle UTF-8 fine.

> if (spd_char(conn,character)) {
>     error();
> }
> 
> 2. New events
> 
> SYNC:
> 
> 706-msg_id
> 706-client_id
> 706-start_index
> 706-end_index
> 706 SYNCHRONIZED
> 
> Event is fired in SSML mode, when module has SYNC mode enabled. Is similar
> to INDEX_MARK, but returns pair of index mark names (or empty string at
> start at end of text). It may be usable for application highlighting
> currently spoken text (book readers or applications for people with
> dyslexia). Both index name are used, because module may ignore some marks.
> If SYNC mode in module is disabled (for example module has no SYNC
> capability), event must be fired with empty start_index and end_index name.
> 
> Alternatively, there may be reserved mark names - for example "__begin__"
> and "__end__".
> 
> 
> AUTOSYNC:
> 
> 707-msg_id
> 707-client_id
> 707-start_offset-end_offset
> 707 SYNCHRONIZED
> 
> Event fired in TEXT mode when module has AUTOSYNC mode enabled. It is
> similar to SYNC event, but relies on module capability of splitting spoken
> text into smaller parts. Returns offsets (in bytes) of begin/end of spoken
> text from start of message given in SPEAK command.
> If AUTOSYNC mode of module is disabled (for example module has no AUTOSYNC
> capability), event must be fired with 0 as
> 
> AUTOPAUSE:
> 
> 708-msg_id
> 708-client_id
> 708-offset
> 708 STOPPED
> 
> Event fired in TEXT mode when module has AUTOPAUSE mode enabled and we
> explicitly require autopause response from server with:
> 
> SET self AUTOPAUSE On
> 
> Returns length of spoken part of text (in bytes).
> If we don't require AUTOPAUSE, server should automatically use AUTOPAUSE
> response of module, store internally remaining part of text and respond 704
> PAUSED to client.
> 
> MOUTH:
> 
> 709-msg_id
> 709-client_id
> 709-width-height
> 709 MOUTH
> 
> Event fired, when for example graphical application should redraw mouth of
> shown face. Width and height are given in range 0..100.
> Todays modules have no idea about mouth shapes, but as I know it's possible.
> Module must have MOUTH mode enabled.
> 
> In libspeechd it's necessary to rewrite callback system. My suggestion:
> 
> typedef void (*SPD_Callback)(int msg, int id, int event, void
> *user_data,...);
> 
> and retrieve values with vararg.
> 
> For example:
> INDEX_MARK - one value of char *
> SYNC - two values of char *
> AUTOSYNC - two integers
> MOUTH - two integers
> AUTOPAUSE - one integer
> 
> Also, there must be functions like:
> 
> SPD_Callback *spd_register_callback(SPDConnection *conn, int event,
> SPD_Callback *callback, void *user_data);
> SPD_Callback *spd_unregister_callback(SPDConnection *conn,int event);
> 
> Of course this function is valid only in no-compatibility mode!

Well, you can only call it if you assume newer libspeechd than we have
today so I'm not sure what the point of caring about a compatibility on
vs off is.

> 3. Module output capabilities
> 
> SPEAK - module can speak
> FETCH - module can return synthesized wave to server
> FILE - module can save synthesized wave to file

the second two are basically indistinguishable, so why have both?

> 4. Module input capabilities
> 
> SSML - module can fully play with SSML and index marks;
> FLAT - module translates internally SSML into plain text. Index mark are
> lost, pause/resume are not implemented.
> PLAIN - module understands plain text (no SSML). Extra features (like
> AUTOPAUSE and AUTOSYNC) are possible only in this mode.

I'm not sure what the point in distinguishing between flat and plain is,
any module can rip out all the ssml bits.  anyways this is more an
implementation detail than something exposed to clients.  Though maybe
it makes sense to tell clients if a module can deal with ssml or not I'm
not really sure.

> FLAT and SSML capabilities are mutually exclusive.
> Server should never send SSML data for module reporting only PLAIN
> capability.
> Server should always send SSML data when module does not reports PLAIN.
> Server should never internally encode plain text into SSML if module reports
> PLAIN and any of extra features (AUTOPAUSE, AUTOSYNC etc.) is enabled. Also,
> server should never accept SSML data from application if extra features are
> enabled (it's application bug).

why?

> 5. Module extended capabilities:
> 
> SYNC - valid only in SSML mode. 706 SYNCHRONIZED events will be fired only
> if SYNC mode is enabled.
> 
> AUTOSYNC - valid only in PLAIN mode. 707 SYNCHRONIZED event will be fired
> only if AUTOSYNC mode is enabled. Requires simple NLP in module.

these events are different how?

> WORDSYNC - valid only in PLAIN mode. 707 SYNCHRONIZED event will be fired at
> word boundaries instead of phrase/sentence boundariesif WORDSYNC mode is
> enabled.
> 
> AUTOPAUSE - valid only in PLAIN mode. 708 STOPPED event will be fired only
> if AUTOPAUSE mode is enabled. This mode may be turned on automatically by
> server responding 704 PAUSED to client. Requires simple NLP in module.
> 
> SPEECH_MOUTH - module can fire 709 MOUTH events during speaking.
> FETCH_MOUTH - module can return to server mouth shapes together with speech
> wave
> FILE_MOUTH - module can save mouth shapes together with speech wave
> 
> MOUTH capabilities are separated, because it's relatively simple to add
> MOUTH events to Mbrola based module in FETCH/FILE mode, but synchronization
> in realtime may be difficult.
> 
> Simple NLP (Natural Language Processor) must be able to automatically split
> given text into sentences (or - if synthesizer can speak also parts of
> sentences - phrases). It may be trivial (as splitting after each dot,
> exclamation or question followed by space) or more sophisticated (as phraser
> in Milena NLP, which understands context of dots and won't split after
> abbreviation or positional number in Polish text). Trivial NLP should be
> part of projected library for speech-dispatcher modules.

I'm unconvinced, it seems like that's a problem synthesizer should
already be solving, so why should we duplicate that?

Trev

> 
> More about FETCH/FILE modes in next mail :)
> 
> ethanak
> -- 
> http://milena.polip.com/ - Pa pa, Ivonko!
> 
> _______________________________________________
> Speechd mailing list
> Speechd at lists.freebsoft.org
> http://lists.freebsoft.org/mailman/listinfo/speechd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: 
<http://lists.freebsoft.org/pipermail/speechd/attachments/20141014/84252e54/attachment.pgp>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]