speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Speechd] KTTS and SpeechD integration


From: Hynek Hanke
Subject: [Speechd] KTTS and SpeechD integration
Date: Mon Sep 4 09:59:48 2006

Hello Gary,

thank you for your detailed description of the goals and remaining issues.
I'll try to comment on the later.

1) Synchronization with client applications

> If a synth cannot return a wav file, the next ideal 
> plugin asynchronously speaks a message, sending directly to audio device, and 
> notifies KTTS when the speech output is finished.  

I think this is the way we can go. Speech Dispatcher currently doesn't support
notification, but many things are already prepared so that this can be
implemented. Speech dispatcher has all the necessary information internally, so
the only remaining issue is how to communicate this information to the client
application. When that is solved/implemented, it will notify client
applications not only about the beginning and the end of speaking a message,
but also about reaching index marks, which you can insert into SSML messages
(when the end synthesizer allows).

I'll work on it now. I'm going to post a separate email about this soon.

2) Sentence boundary detection

> KTTS parses text into individual sentences and sends them one at a time to 
> the 
> synth plugin.  This is key in order to provide:

We have also been doing this in Speech Dispatcher some time ago, but it turned
out not to be the best approach for several reasons I'll explain bellow. I
believe festival-freebsoft-utils now offers a better solutions which, in
connection with index marking, could address these goals too.

There is a number of issues we found with cutting text into sentences in Speech
Dispatcher and sending just sentences to the output modules and synthesizers.

1) The TTS has no possibility to do more advanced syntactic analysis. It is
only allowed to operate on one sentence.

2) We need to handle language dependent issues in a project (Speech Dispatcher,
KTTSD) that should be language independent.

3) How to cut SSML or other markup into sentences?

4) How to cut data that are not an ordinary text (program source code, ...)

5) It makes the output module much more complicated if good performance is of
concern. It's necessary to already have sent for synthesis the next sentence
before the previous one is spoken in the speakers so that the TTS doesn't sit
idle. Sentences of different length may cause unnecessary delays.

We found it would be much better, and it's also natural, to pass the whole text
into the TTS and let the TTS do this job instead. The TTS needs to somehow cut
the text into smaller chunks anyway, because of index marks and because of
performance reasons, but it can do it only after the necessary SSML parsing and
syntax analysis has been done, it can do it in a language dependent manner and
it can do it according to it's own taste (which allows TTS programmers to
implement algorithms for better performance etc.)

It should not be a problem to pass the whole text into the TTS at once as long
as:

1) The TTS is able to return partial results as soon as they are available (so
that we don't have to wait untill the whole text is synthesized).  

2) The TTS provides synchronization information (index marks).

3) The TTS is able to start synthesizing the text from an arbitrary index mark
in the given (complete) text.

Currently, (1) and (2) are provided by festival-freebsoft-utils. What do you
Milan think about (3)?

Let me explain how these three would address your needs.

>   1.  Ability to advance or rewind.

The output module has the information about the position in the text from
receiving the index marks (2). It can skip sentences forwards or backwards by
sending the whole text again together with the identification number of index
mark you it wants to start from, according to (3).

>   2.  Ability to intermix with higher-priority messages.

When a higher-priority message comes, the output module knows the position from
the last received index mark (2), so it can instruct the TTS to start playing
there again (3) when that higher-priority message is spoken.

>   3.  Ability to change voice or synth in the middle of a long job.

This is very similar to the previous point.

>   4.  Notification to apps of start/end of each sentence as well as text job 
> as a whole.

Yes, this is very desirable and (2) ensures it's possible.

Currently, this is implemented between Speech Dispatcher and Festival in the
following way. Speech Dispatcher still does some basic sentence boundary
detection, but only to insert some kind of private index marks (additionally to
index marks inserted by the client application). Ideally, this should also
somehow be handled by Festival or the other TTS in use in the future so that
Speech Dispatcher doesn't have to do anything with the text itself. Then the
whole text (in SSML) is sent to Festival for synthesis and the function
(speechd-next) Festival function is repeatedly called to retrieve the newly
synthesized text or the information about an index mark being reached.

3) Priority models

> KTTS Type                SpeechD Type
> ----------------         ---------------------
> Text                          Message
> Message                    Important
> Warning                     Important

This mapping would go strongly against the philosophy of Speech Dispatcher,
because the priority important is really only to be used for short important
messages, not for ordinary messages, to prevent ``polution'' of the queues with
important messages which can't be discarded nor postponed by any other
priority.

Let me stop for a moment and explain our view on the message management in
accessibility. We don't imagine accessible computer for the blind as just a
screen reader, we also think about the concept of what we currently call
application reader. I'll explain the difference.

A screen reader is a general purpose tool which you can use to do most of your
work and it should provide you a reasonable level of access to all aplications
that don't have some kind of barier in themselves.

Screen reader typically only sees the surface of what actually is contained in
the application. It might be customized to some degree for that particular
application, but still it doesn't have access to some information the
application knows.

So it might make sense to build accessibility to some applications directly,
so that these accessibility solutions can take advantage of the particular
details of the application and of the information that are available to it
when it operates *inside* the application.

I'll give two examples. 

One of them is what we are currently doing with Emacs in speechd-el. speechd-el
knows the origin of each message so it can decide which information is and
which isn't important for the user (according to user configuration of course)
and take advantage of the advanced priority model. It knows the language and
character encoding of buffers, so it can switch languages automatically. It can
perform other actions that a general purpose screen reader couldn't. Maybe
Milan Zamazal can give better examples.

Another example is the GNU Typist package. If you are not familiar with, it's a
very good tool to help people learn touch typing. While in the previous case of
Emacs, building accessibility into the application directly and not relying on
a general purpose screen reader was a convenience, for GNU Typist that will be
an absolute need. I can't imagine how would you be trying to learn touch typing
just by reading the application with a general purpose screen reader -- you
need dictation, you need notification about errors and so on and so on,
something that a general purpose screen reader without any understanding of the
application can't provide.

Of course these specialized solutions, which we call application readers, should
not be hardwired into the main program. Rather, they should be some extension
or an additional layer built on top of the program.

Let's return to the priority model. Now you understand that once we bypass the
earliest stage in the GUI, we will get to the situation that we already are
in textmode. There will typically be more clients connected to the high level 
TTS
API in use (KTTSD, Speech Dispatcher), not just the screen reader. Some of them
might be able to make a good use of the whole priority system, some not.

Now you come to understand why we don't strictly separate the screen reader and
others in Speech Dispatcher and why we don't have any priority that an
application could use to have a full and complete controll over everything
said. Rather, applications are supposed to use priorities as text and message
and when such a situation happens that there is some more important message to
be spoken (either originating in these applications or somewhere else), than
this message *should* pass through and the original messages will be postponed
or discarded.

I think thinking in these terms should be the basis for our future work.
Obviously, the priority system in Speech Dispatcher is not optimal and we will
have to continue working on it.

Milan explained the ideas under the different priorities in a separate email, I
tried to explain the motivation for them. I very much welcome all your comments
about all of this.

I really like the ``long text'' idea. Maybe the nicest behavior would be if
every incomming higher-priority message would just pause reading ``long text''
and then resume it again after the higher-priority message is spoken. This way,
a user could be reading an ebook while still being able to listen to time and
new-email notifications, being able to jump into the mixer to adjust the sound
volume etc. What do you think of it?

4) Conclusions

> Given these issues, I cannot presently move forward with integrating SpeechD 
> with KTTS.  If I had my way, SpeechD would offer callbacks to notify when 
> messages have been spoken.  

It will.


What to do with KTTSD ``Screen Reader'' priority is still an open issue.


Thank you & please send your comments,
Hynek



reply via email to

[Prev in Thread] Current Thread [Next in Thread]