speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Speech Dispatcher roadmap discussion.


From: Trevor Saunders
Subject: Speech Dispatcher roadmap discussion.
Date: Wed, 15 Oct 2014 16:21:50 -0400

On Wed, Oct 15, 2014 at 12:33:39PM +0200, Bohdan R. Rau wrote:
> W dniu 2014-10-15 03:40, Trevor Saunders napisa?(a):
> >On Mon, Oct 13, 2014 at 10:45:05AM +0200, Bohdan R. Rau wrote:
> >>
> >>COMPAT_MODE On|Off
> >
> >I don't really like on and off since it assumes we'll only change the
> >protocol once.
> 
> 
> It was only suggestion - for example there may be command like:
> 
> PROTOCOL <number>
> 
> But I think there will be only one protocol change, next changes and
> protocols would be obtained with CAPABILITY command.

I'd bet good money we'll want to change the protocol again some day.

> >we can add functions spd_char_msgid etc which seems simpler to
> >explain.
> 
> If we assume the new protocol will be used in new applications - I see no
> reason to add new function if application (and library) always knows which
> version of protocol we use.

I don't see any harm in adding new functions especially when it makes it
clearer what they do.

> >btw why is spd_wchar a thing at all :( it seems like spd_char should
> >handle UTF-8 fine.
> 
> Of course - spd_char works fine (with some exceptions). But spd_wchar has
> nothing to UTF-8, it's used for direct Unicode codes, not for encoded
> strings. As for me - spd_char function should be realized as wrapper to
> spd_wchar, something like:

I'd prefer we kept things as is and the protocol is just UTF-8.

> 
> int spd_char(SPDConnection *conn, char *str)
> {
>     int chr=get_unicode_character(str);
>     if (chr < 0) return -1;
>     return spd_wchar(conn,chr);
> }
> 
> Why? Because some modules may be inconsistent with the documentation. In
> theory we could put string of any length to spd_char and only first
> character will be said. In fact, espeak module says "null" if string is
> longer than one UTF-8 character.

imo its an error to pass more than one character to char command, and
its a bug in the espeak module to speak more though really that should
just be handled by checking on the server side.

> But as spd_wchar seems to be completely broken today - it's theme for future
> discussion.
> 
> 
> >>Also, there must be functions like:
> >>
> >>SPD_Callback *spd_register_callback(SPDConnection *conn, int event,
> >>SPD_Callback *callback, void *user_data);
> >>SPD_Callback *spd_unregister_callback(SPDConnection *conn,int event);
> >>
> >>Of course this function is valid only in no-compatibility mode!
> >
> >Well, you can only call it if you assume newer libspeechd than we have
> >today so I'm not sure what the point of caring about a compatibility on
> >vs off is.
> 
> Have you even see application without bugs? :)

I'm not sure what your point is

> >
> >>3. Module output capabilities
> >>
> >>SPEAK - module can speak
> >>FETCH - module can return synthesized wave to server
> >>FILE - module can save synthesized wave to file
> >
> >the second two are basically indistinguishable, so why have both?
> 
> Please be patient and wait for second part - I'll explain with details why.

sure, though I'm not really convinced a temp file isn't the reasonable
way to implement sending audio back to the server.

> 
> >
> >>4. Module input capabilities
> >>
> >>SSML - module can fully play with SSML and index marks;
> >>FLAT - module translates internally SSML into plain text. Index mark are
> >>lost, pause/resume are not implemented.
> >>PLAIN - module understands plain text (no SSML). Extra features (like
> >>AUTOPAUSE and AUTOSYNC) are possible only in this mode.
> >
> >I'm not sure what the point in distinguishing between flat and plain is,
> >any module can rip out all the ssml bits.
> 
> Because in FLAT mode string sent to module may be different than string sent
> to speech-dispatcher by application. So offsets returned by AUTOPAUSE and
> AUTOSYNC will be completely unusable.

We could translate back from offsets in plain text to possitions in
ssml, so client doesn't need to know if synth can deal with ssml.

> >Though maybe
> >it makes sense to tell clients if a module can deal with ssml or not I'm
> >not really sure.
> 
> Yes. But if module has extra features usable only in PLAIN mode, application
> should have this information.

It seems weird to me a module would only support something with plain
text, but maybe such a thing exists.

> 
> >>Server should never internally encode plain text into SSML if module
> >>reports
> >>PLAIN and any of extra features (AUTOPAUSE, AUTOSYNC etc.) is enabled.
> >>Also,
> >>server should never accept SSML data from application if extra features
> >>are
> >>enabled (it's application bug).
> >
> >why?
> 
> Because requesting features which are known not possible it's bug - or we
> have different ideas what is bug :)

I'm not convinced it is a bug, simple client might not want to worry
about what synth it is using and what it supports.  I imagine simple
clients just want to spew ssml / plain text at speechd and let it
provide features as best it can.

> >
> >>5. Module extended capabilities:
> >>
> >>SYNC - valid only in SSML mode. 706 SYNCHRONIZED events will be fired
> >>only
> >>if SYNC mode is enabled.
> >>
> >>AUTOSYNC - valid only in PLAIN mode. 707 SYNCHRONIZED event will be
> >>fired
> >>only if AUTOSYNC mode is enabled. Requires simple NLP in module.
> >
> >these events are different how?
> 
> Both are intended for applications which needs information, which part of
> text is actually spoken. SYNC works in SSML mode and uses predefined (by
> application) index mark. AUTOSYNC works in PLAIN mode, and returns offsets,
> which may be used for example to highlight spoken text.

I'll agree its useful to know the speaking position, but I wonder if we
need to expose two different way of dealing with it to clients.

> 
> Example of application: multi-language epub reader. Application has only
> vague idea where is end of sentence, and if module (specialized for
> particular language) knows better - why not use it's knowledge?
> 
> >>Simple NLP (Natural Language Processor) must be able to automatically
> >>split
> >>given text into sentences (or - if synthesizer can speak also parts of
> >>sentences - phrases).
> 
> >I'm unconvinced, it seems like that's a problem synthesizer should
> >already be solving, so why should we duplicate that?
> 
> Because synthesizers are for synthesis, not for dealing with gramatic
> problems.

that may be the technical definition, but in practice I think most of
the synthesis packages out there do both, I'm pretty sure espeak does,
and I think pico / ibmtts / festival variants all do too.

> Example: Mbrola is synthesizer (ie. Mbrola realizes DSP phase of TTS).
> 
> Of course - most synthesizers has some internal NLP, but it's used only for
> internal synthesizer's purposes. My Milena is exception, it uses something
> like:
> 
> while (*input_string) {
>       char *sentence = get_sentence(&input_string);
>       say(sentence);
>       free(sentence);
> }
> 
> So it's possible to get currently spoken sentence position from Milena, and
> we can use it to highlight spoken text or to determine byte offset where
> speech was paused.
> 
> But Milena is not synthesizer - in fact it's text-to-speech system with
> sophisticated NLP specialized for only one language, and backend
> synthesizers may be different (currently Mbrola and Ivona are implemented).
> 
> I know my suggestions may be a little strange, but you have to take into
> account that I want to change the way of thinking about speech-dispatcher.
> 
> Currently:
> speech-dispatcher is used by visual impaired users, and as speech backend
> for screenreaders.
> 
> My dream:
> 
> Visual impaired users are very important for speech-dispatcher developers,
> but speech-dispatcher should also be used as general purpose speech
> synthesis backend for different applications (like SAPI in Windows).
> Screenreader is example of very important application, but it's not only
> application using speech-dispatcher.

I think the difference is more in what synthesizers we care about, I
agree speech dispatcher has many uses.

Trev

> 
> Example:
> Imagine well-sighted eighteenwheeler driver, who carries several cases of
> beer from Dallas to New Orleans reading the long email sent to him by his
> fiancee :)
> 
> >Trev
> 
> ethanak
> 
> 
> _______________________________________________
> Speechd mailing list
> Speechd at lists.freebsoft.org
> http://lists.freebsoft.org/mailman/listinfo/speechd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: 
<http://lists.freebsoft.org/pipermail/speechd/attachments/20141015/d28737bd/attachment.pgp>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]