speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Speechd] KTTS and Sentence Boundary Detection


From: Gary Cramblitt
Subject: [Speechd] KTTS and Sentence Boundary Detection
Date: Mon Sep 4 09:59:48 2006

As I mentioned in my other email, I'd like to comment on the following and 
point out how KTTS addressed these issues.  It is academic, since the 
intention is to remove Sentence Boundary Detection (SBD) from KTTS, as long 
as the functionality needed is provided by Speech Dispatcher.  It may still 
be of interest to you however.

On Wednesday 04 May 2005 05:18 pm, Hynek Hanke wrote:
> There is a number of issues we found with cutting text into sentences in
> Speech Dispatcher and sending just sentences to the output modules and
> synthesizers.
>
> 1) The TTS has no possibility to do more advanced syntactic analysis. It is
> only allowed to operate on one sentence.

In practice, none of the current synths alter the speaking attributes of one 
sentence based on surrounding sentences.  It is true that syntactic analysis 
does assist Festival at deciding where sentence boundaries are.  If we assume 
that KTTS puts sentence boundaries at the same places as Festival (and for 
the most part it does), then the end result is the same.

>
> 2) We need to handle language dependent issues in a project (Speech
> Dispatcher, KTTSD) that should be language independent.

KTTS addresses differences in SBD for different languages by using a modular 
plugin architecture.  In theory, languages that use different punctuation and 
so forth for sentences can be implemented in a separate SBD filter.  In 
practice, so far, only the Polish language has needed a separate SBD filter, 
mostly because Polish Festival incorrectly "speaks" punctuation characters.  
("This is a sentence." is spoken as "This is a sentence period" -- in Polish 
of course.)  Since SBD is implemented using regular expressions, the Polish 
SBD filter was a simple matter of changing the regular expression to remove 
the sentence punctuation while simultaneously breaking the input into 
sentences.

>
> 3) How to cut SSML or other markup into sentences?

In KTTS, we addressed this with an eye towards the ability to advance and 
rewind by sentence.  To achieve this, the SSML input is parsed using an XML 
parser.  <p> and <s> tags are obviously interpreted as sentence boundaries.  
Within text and CDATA Section nodes, the same regular expression as is used 
for plain text is used to decide where sentence boundaries are.  Once the 
position of sentence boundaries is determined, each sentence is output with a 
complete set of SSML tags.  In this way, each sentence gets a complete SSML 
context, so that when rewinding and advancing, no information is lost.  For 
example, the following input SSML

<speak lang="en">
This is a sentence.  So is this.  This <prosody rate="fast">is spoken
fast</prosody>.  <p>This is the fourth sentence.</p>
</speak>

becomes

<speak lang="en"><voice gender="neutral" age="40"><prosody pitch="medium" 
range="medium" rate="medium" volume="medium"> This is a 
sentence.</prosody></voice></speak><speak lang="en"><voice gender="neutral" 
age="40"><prosody pitch="medium" range="medium" rate="medium" 
volume="medium">So is this.</prosody></voice></speak><speak lang="en"><voice 
gender="neutral" age="40"><prosody pitch="medium" range="medium" 
rate="medium" volume="medium">This </prosody></voice><voice gender="neutral" 
age="40"><prosody pitch="medium" range="medium" rate="fast" 
volume="medium">is spoken fast</prosody></voice><voice gender="neutral" 
age="40"><prosody pitch="medium" range="medium" rate="medium" 
volume="medium">.</prosody></voice></speak><speak lang="en"><voice 
gender="neutral" age="40"><prosody pitch="medium" range="medium" 
rate="medium" volume="medium">This is the fourth 
sentence.</prosody></voice></speak>

In the case of Festival, SSML is then converted into SABLE tags using an XSLT 
conversion.

This works out pretty well.  Some of the current limitations are 1) we don't 
handle SSML "relative" attributes (<prosody rate="+10">), 2) Festival seems 
to have trouble with voice attributes, so we strip them out when converting 
to SABLE, and 3) we don't handle the <say-as> tag, which isn't fully defined 
in the SSML spec anyway.

We handle HTML by first making sure it is valid XHTML and then using XSLT to 
convert it to SSML.

>
> 4) How to cut data that are not an ordinary text (program source code, ...)

Once again, the plugin filter architecture of KTTS can address this.  We 
currently handle C/C++ text (we assume each EOL is a sentence boundary).  In 
practice, speaking code has lots of other problems because of punctuation and 
"words" that are not in the lexicon, which tend to confuse Festival quite a 
bit, so there is a lot of work that still needs to be done on this.

>
> 5) It makes the output module much more complicated if good performance is
> of concern. It's necessary to already have sent for synthesis the next
> sentence before the previous one is spoken in the speakers so that the TTS
> doesn't sit idle. Sentences of different length may cause unnecessary
> delays.

Since KTTS is designed for each synth to return a wav file, the synths can be 
kept busy working 3 or 4 sentences ahead while KTTS simultaneously outputs a 
sentence to the audio device.  Hence, sentences of different lengths are not 
a problem.  The first sentence begins speaking quickly because only a single 
sentence is sent to the synth for parsing and synthesis.  In practice, the 
synths sit idle most of the time because KTTS builds up a queue of 3 or 4 
sentences that have already been synthesized while the first sentence is 
still being heard on the audio device.  As you know, synthesis time is much 
shorter compared to audio time.

That said, SBD does add a small delay before the first sentence is spoken, and 
the larger the input, the longer the delay.  SSML input adds additional 
delay.  The ideal synth would not perform SBD on the entire input before it 
begins speaking the first sentence, but I don't believe Festival is so 
optimized.  Does Speech Dispatcher do something to solve this problem?  If 
the synth were so optimized, I imagine it would be problematic to provide 
advance/rewind capability within the synth.

One of the things I'm looking forward to when we integrate KTTS with Speech 
Dispatcher is improved performance, since I know you've worked hard to 
address that.

Thanks for listening.

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php


reply via email to

[Prev in Thread] Current Thread [Next in Thread]