speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

eSpeak - added punctuation and capitals indications


From: Gary Cramblitt
Subject: eSpeak - added punctuation and capitals indications
Date: Mon Sep 4 09:59:51 2006

On Saturday 22 April 2006 06:14, Jonathan Duddington wrote:
> eSpeak is a compact open-source text-to-speech engine which I hope to
> adapt to work well with Speech Dispatcher.
>
>
> I've added the ability to indicate punctuation and capital letters in
> the text by either speaking their name, or by playing a sound.
>
> I'm not sure whether what I've done is exactly what's wanted, so if
> anyone wants to try it and let me know of suggestions for improvements,
> please do so.
>
> eSpeak text-to-speech is at:
>    http://espeak.sourceforge.net
>
> download the file:  test-1.09d-linux.zip

I downloaded 1.10, which I see you posted just yesterday.  I will try to 
answer some of your questions, but frankly, as a newcomer to Speech 
Dispatcher, I'm a bit fuzzy on the punctuation, spelling, and capitalization 
features myself.

>
> The ReadMe file inside gives the details.  It can speak either all
> punctuation or just a specified set of punctuation characters.  It can
> speak their names, or you can set up sound files (sound icons) to be
> played instead.
>
> Capital letters can be indicated by a sound, or by the word "capital",
> or by raising the pitch of the capitalized word.

As you are probably aware, at the moment Speech Dispatcher runs espeak via its 
generic output module.  There is currently no mechanism in SD for using your 
new -k and --punct command line options to control speaking of 
capitalization/punctuation or for modifying the espeak data files on the fly.   
Actually, a -k or --punct option can be configured in the generic-espeak.conf 
file, but this setting will apply to all speech; it will not change from one 
message to another as the application switches modes via the

SET SELF PUNCTUATION
SET SELF SPELLING
SET SELF CAP_LET_RECOGN

SSIP commands.  We could probably enhance the generic module to support this 
better, but see below.

>
> Also the feature of embedding commands within the text has been updated.
>
> Questions which I'm unsure about:
>
> 1.  Should end-of-line be indicated?

Good question.  AFAIK, festival does not currently speak an EOL, so my guess 
is "no", but perhaps others in this list have a better answer.

>
> 2.  What about apostrophes within words. Currently these are not
> indicated when speaking text since that would disrupt the pronunciation
> of the word.

That seems correct to me, but again perhaps others have a better answer for 
you.

>
> 3.  The punctuation name is spoken in a slightly different tone from
> the main text, to differentiate it.  Is that OK?

Excellent.

>
> 4.  The actual names for punctuation characters are defined in the
> data/english_list file, so these can be changed if needed (then do
> speak --compile).
>
> 5.  If the text is spoken at a fast rate, should the sound icons also
> be shortened in duration?

That's something I've wondered about myself.  Anybody?   And see my comments 
below.

>
> 6.  What is the best value for the pitch raise which indicates
> capitals?  This is currently adjustable with the -k option to allow
> experimentation.
>
> 7.  How should multiple capitals in a word be indicated?  Or a capital
> which is not the first character of a word?  Or does that only need to
> be considered when speaking letters individually (spelling)?
>
> 8.  Have I misunderstood the whole point of this, and punctuation and
> capital indications are only needed when spelling out individual
> characters?

Capitalization, punctuation, and spelling modes are independent, although I 
myself am fuzzy on just how these modes are supposed to interact with one 
another when combined together.

I downloaded espeak yesterday because I wanted to see what would be involved 
in writing an output module for it for SD.  The goal would be to take full 
advantage of the espeak capabilities.  The main problem I noticed is that you 
do not provide a library or api for interfacing directly with espeak.  
Everything is done through command line and manipulation of configuration 
files.  There are two problems this creates:

1.  The command line interface is two crude for controlling speech.

2.  The espeak program must be loaded (and it must go through its 
initialization) for each message spoken.  Fortunately, espeak is very fast 
and light-weight, but this isn't the most efficient mechanism.

You are aware, I believe, that we are currently discussing a new TTS Engine 
API on address@hidden  Hynek and I are hoping that this new 
API will be approved soon, whereupon we will begin a major refactoring of 
Speech Dispatcher to use this api.  Therefore, it would be best if you could 
design an api for espeak that would be aligned with that specification.  You 
do not have to implement that exact api, but you do need to provide as much 
of the functionality it requires as possible.  The more you implement in your 
api, the less emulation we will have to layer on top, and the better the user 
experience.

Some of the major changes I would like to see in espeak are:

1.  Support for SSML.  I noticed that you now support embedded commands for 
controlling rate, pitch, volume, etc.  In order to use these, SD would have 
to parse the SSML itself and translate embedded SSML into your command 
syntax.  That would be inefficient and probably imperfect.  It would be 
better if espeak directly supported SSML.

2.  Espeak needs to return audio directly to SD.  Writing a .wav file to disk 
is inefficient.  A more direct method, such as callbacks or socket io would 
be better.

3.  Support for index marking.  Ideally, espeak should provide callbacks 
and/or index mark information for the following:

a.  Begin/end of entire message.
b.  Begin/end of sentence.
c.  Begin/end of word.
d.  Custom index marks as <mark> tags in SSML.

The index mark information needs to be synchronized with the audio.  If you 
play audio yourself, then you would emit callbacks at the appropriate times 
just before or after playing the corresponding word, sentence, or message.  
If you provide audio directly back to SD, then you supply the index mark 
positions and timings along with the audio data.  See the TTS Engine API for 
a suggested format.  Again, you don't have to do it exactly as given in the 
TTS Engine API, but looking at the spec will tell you what we need to know.  
For example, you could implement the index mark events as separate callbacks.  
As long as those callbacks are synchronized with the audio callbacks, i.e., 
audio - index mark - audio - index mark, etc., we can construct the 
information we need.

4.  Support for stop,  pause, and resume from a specified index mark position.

BTW, sorry about the previous reply.  Accidently clicked on the Send 
button. :)

Thank you for espeak and thanks for listening.

-- 
Gary Cramblitt (aka PhantomsDad)
_______________________________________________
Speechd mailing list
address@hidden
http://lists.freebsoft.org/mailman/listinfo/speechd



reply via email to

[Prev in Thread] Current Thread [Next in Thread]