speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

symbolic voice-types versus synthesis voices


From: Tomas Cerha
Subject: symbolic voice-types versus synthesis voices
Date: Mon, 08 Nov 2010 21:01:00 +0100

Dne 8.11.2010 13:39, Andrei Kholodnyi napsal(a):
>> But does this diversity matter?  If these diverse names are exposed to the 
>> end user, I
>> think it is still better than exposing nicely aligned symbolic names, which 
>> carry no
>> information (except for the gender).  The client can also expose voice 
>> properties to the
>> user if this is implemented (and available).
> 
> each synth has its own convention for the voices naming, e.g.
> espeak:
>                      NAME                 LANGUAGE                  VARIANT
>                   default                       en                     none
>               en-scottish                       en                       sc
>                   english                       en                       uk
>                lancashire                       en                 uk-north
>                english_rp                       en                    uk-rp
>             english_wmids                       en                 uk-wmids
>                english-us                       en                       us
>             en-westindies                       en                       wi

Well, Espeak is a very special beast here.  It in fact has just one voice (one 
set of
recorded data).  The voices listed by espeak are actually different rule sets 
applied to
this one basic voice.  Such specifics should be handled within a particular 
output
module and reported to Speech Dispatcher in a manner consistent with other 
synths.

> pico:
>                  samantha                       en                    en-US
>                    serena                       en                    en-GB

Yes, this is a more typical example.

> as you can see VARIANT differs between them, e.g. you have
>                  english-us                       en                       us
>                  samantha                       en                    en-US
> which is the same variant, but written differently.
> It means if apps want to search for "US English" you don't know what
> to search for.
> 
> LANGUAGE is also different, you might have e.g. 3 letters
>             greek-ancient                      grc                     none
> 
> Now my question is do we want to introduce a consistent voice naming
> convention for SD?
> we could leave e.g. language names as is /however there is a name
> clash probability between synths/

I don't think naming must be consistent, but voice properties must definitely be
reported consistently.  When I speak about name, I mean a unique human readable 
voice
identifier.  It doesn't need to be unique across synthesizers as it may always 
be
exposed to the user in combination with the synth name - it is quite natural.  
We can't
avoid a situation that two synths provide a voice of the same name, such as 
"Samantha".
 To me it seems ok for the user to have choices like "Pico/Samantha", 
"Pico/Serena",
"Festival/Samantha".  It is IMO still better than having to select from 
"Pico/female-1",
"Pico/female-2", "Festival/female-1".  If someone likes the Pico's  Samantha 
voice, he
would suggest it to a friend by that name, rather by some normalized identifier.

> but IMO it would be good to "normalize" LANGUAGE and VARIANT at least.
> it will allow to search properly.

Sure.  All voice properties must have normalized meaning and values.  Output 
module must
map the synth specific properties to the normalized ones.  Some synths will not 
support
all the properties (for example the module will not be able to determine the 
age of a
particular voice) so this must be also considered.

> I just thought that we might probably map names to something like
> "spd-voice-NN" or "male-en-NN",
> which is not much worser than e.g. "english-us" :D,

I'm not sure if you mean this for some sort of internal identifiers or names 
exposed to
the user.

I am a little confused if we actually agree or not here :-)  But IMO we need 
something
like that:

Client:
  LIST VOICES
Server:
  1 Samantha
  2 Serena
Client:
  VOICE PROPERTIES 1
Server:
  LANG: en
  VARIANT: US
  GENDER: female
  AGE: 25

So the user can see the native voice name, its properties and select based on 
either the
name or the properties.  Both of them may be important for the user.

Hope it is clear what I mean now.

Best regards, Tomas



reply via email to

[Prev in Thread] Current Thread [Next in Thread]