gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuspeech-contact] Participating in gnuspeech


From: David Hill
Subject: [gnuspeech-contact] Participating in gnuspeech
Date: Sat, 10 Nov 2007 13:37:27 -0800

Hi Ravi,

On Nov 6, 2007, at 1:05 PM, Ravi Ahluwalia wrote:

Hi David,

im thinking about working with gnuspeech in my bachelor thesis. But i
don't know, how much time i have to spend porting it to gnustep.

Much of the work of porting Monet has alrerady been done.  The source on the gnuspeech savannah web site is set up for conditional assembly for either the Mac or GNU/Linux using GNUstep.  The big hurdle is providing a component in GNUstep that replaces the CoreAudio components used on the Mac, so that output can be produced on-line in real time, instead of having to save t0 a file and use another app to produce the sound output.

Greg Casamento is now the lead on the GNUstep project, and would be the best person to ask about this, and you would be contributing to the GNUstep project for this part.  It may prove to be the major resource consumer on the work you would like to do and I don't have a good idea of the time involved.  A wild guess might be 6 weeks full time, but it depends how long it might take to be "inducted" into GNUstep.

I have copied this to Greg so he knows about your interest and you have his email address (in the header).


How many days would you suppose, would it take, if you work 8 hours a
day on it. For example in order to get
-Monet

See above.  Once the CoreAudio equivalent is done, there shouldn't be much left other than getting the bugs out of the code additions placed to make the dual compilation possible.  Greg did the original work and would again be the person to ask.  However, as you may realise from earlier correspondence, Greg was planning to refactor the Monet code because the current organisation leaves something to be desired (as might be expected with a first cut at the port).  How much do you know about refactoring?  There's a good book by Martin Fowler "Refactoring: improving the design of existing code" Addison-Wesley 2000 ISBN 0-201-48567-2).  That would add some time to the exercise if it were done prior to any GNUstep port (as would be wise), and it would depend who did it.  My guess is two months, but see below at "say".

-Tube-Model

The tube model is written in 'C' and really doesn't need porting.  It is simply included as an element in the Objective-C source.  No time.

-Synthesizer and

I have currently got Synthesiser somewhat ported to the Mac (say 60% -- I think my earlier estimate of 70% was quite optimistic and I'd rather be pessimistic).  The GNUstep port should not take more than two weeks once Synthesizer is working on the Mac, except the same problem arises with Synthesizer as with Monet.  On-line, real-time output is needed, so the part-CoreAudio development raises its slightly ugly head again.  Assuming that had been done, and the Mac version was working, two weeks.  The Mac version probably needs another three months to finish, but it could be less.  Total time 3.5 months.

-say

This requires the speech daemon to be completed, callable using "Services" or the command line, which means a stripped-out
version of Monet and a means of invoking it and passing text.  Dealing with this would, IMO, best proceed in concert with the refactoring on Monet, since a similar understanding of Monet is required for both tasks, and the requirements of the daemon would help guide the refactoring.  Then additions would have to be made to allow it to compile under GNUstep.  My guess is 5-6 months, but some of this overlaps with the refactoring and port of Monet, and I am not the best person to say how much. if the overlap were 50%, which is again a guess, it would mean the two tasks would take 6-7 months.

to run.
I wonder, why gnusspech isn't already ported to gnustep by other people.
Do you have got an idea, why there less work on it? 

(a) Speech output is thought to be a "solved problem" (it isn't); (b) dealing with text-to-speech requires a whole range of skills (computing phonetics, phonology, linguistics, ... ) not often found in one person; (c) computer sound output tends to be something of a Cinderella.  How many computers have fast 3-D graphics built in compared to how many have excellent hi-fi audio and speech built in?

Then there's the question of what people are interested in doing.  To get really high quality speech output, or excellent speech recognition requires solving some AI hard problems in computer science (knowledge representation and use), and some really hard problems in linguistics (speech articulation dynamics, rhythm, intonation, and the relation of these to meaning and intention).  If people realised the problems that need to be solved, they might take a lot more interest in speech input and output for computers.

If you want to work on gnuspeech, you don't have to work on everything, of course.  I don't know how much time you need for you "bachelor's thesis" but I would imagine it would be weeks rather than months.

All good wishes.

david


Thank you for information

Regards,

Ravi
-----Ursprüngliche Nachricht-----
Von: David Hill [mailto:address@hidden] 
Gesendet: Freitag, 2. November 2007 23:12
An: Ravi Ahluwalia
Betreff: Re: AW: [gnuspeech-contact] Participating gnuspeech


Hi Ravi,




On Nov 2, 2007, at 12:13 PM, Ravi Ahluwalia wrote:


Hi David,


I think articulatory speech synthesis is a great approach to synthesize
natural sounding speech with natural formant-transitions and durations
of phonems. Articulatory synthesis needs more cpu-power, due to the
often calculations of the continous changing transfer function of the
vocal tract. Please correct me, if iam wrong. Therefore trillian tts
makes strong use of the DSP in the nextstation which runs at only 20
Mhz.  I am wondering, why 20 Mhz are enough, because the latest
approaches of artic. synthesis are not realtime synthesizing and runs
slowly. For example the syntesizer from 
work in realtime, you've to wait several seconds til the syntesizer is
finished. Do you have an idea whats the reason for that?


Well, if you want to drive a vehicle powered by an internal combustion
engine, you can get some (e.g. farm tractors) that go maybe 15 miles an
hour, or others (e.g. a Lamborghini Reventon) with a top speed of over
200 miles an hour (funnily enough, Lamborghini started by building
tractors!! ;-)


To anseer your question.  First, the CPUs back in those days were not
equipped with the kinds of instructions useful for signal processing
operations whereas the DSP processors were built specifically to be good
at those kinds of operations.  These days, CPUs have the additional
instructions.  The original tube model was written in optimised DSP code
which ran in real time for tube lengths grater than about 15 cm.  The
C-code version, which was used to develop the algorithms, did not run in
real time on the old 25 MHz 68040 CPU on the NeXT cube


When the original Trillium software was developed, we not only used the
DSP and its specialised instruction set, we also spent a lot of effort
figuring out efficient ways of controlling the tube model and optimising
the code.  Even now the underlying tube model in the current Macintosh
port is the original C-code version, driven from Objective-C by the
text-to-parameter conversion portion (see below), and this whole system
will run in real-time on the normal (host) CPU, rather than needing a
specialised DSP, because modern CPUs typically run at over 2GHz as well
as having the necessary instructions.


The version that has been ported to the Macintosh runs in real time even
on the older power PCs at perhaps 800MHz or less.




Today DSPs e.g. from Texas Instruments run at min. 300 MHz . That would
probably be enough for letting gnuspeech synthesize in realtime ? Do you
think, it would be possible to let gnuspeech run in an embedded device,
with a modern DSP and work as TTS ?


As noted above, you really don't need a DSP these days, and there's more
to creating speech from text than running the articulatory model.  You
have to turn the punctuated text into phonetic strings, and add some
reasonable rhythm and intonation (all done according to rules) and then
create the parameter tracks needed to drive the articulatory model using
more rules that figure out the phonetic (posture) targets, and how one
posture links dynamically to the next posture, in the overall context of
other postures, timing and pitch variation.  It would not make much
sense to run this part of the process on a DSP.


Have you checked out the gnuspeech web site?




All good wishes.


david




Thank you for any information,


Ravi Ahluwalia

-----Ursprüngliche Nachricht-----
Von: David Hill [mailto:address@hidden] 
Gesendet: Samstag, 29. September 2007 22:04
An: Ravi Ahluwalia
Betreff: Re: [gnuspeech-contact] Participating gnuspeech

Hi Ravi,

On Sep 29, 2007, at 12:08 PM, Ravi Ahluwalia wrote:

Dear members,

I'd like to know, what i need for getting monet to run under linux and
participating gnuspeech. I suppose, first i should start with a working
gnustep-enviroment.

That would be necessary.

What's there the ideal linux-distribution, so that i
can go on straight forward to introduce with monet and the software
synthesizer SoftwareTRM ? I know, there are ready packages for debian,
can i set up a gnustep-developing enviroment with them ? Should i
compile all the stuff by my self ?

That question would probably be best answered by Greg Casamento -- a
GNUstep guru and project leader.  He is now the GNUstep maintainer.
Check:





However, the GNU/Linux port of gnuspeech is stalled because there is no
appropriate method of dealing with the sound output as part of the app.
At present, you would have to dump the output in a file and then "play"
the file.

I attach a copy of a reply I sent to Ken Beesley a while back when he
enquired about the status of gnuspeech because it gives you a good idea
of where we are and what needs doing.

The Monet system, which speaks text according to the gnuspeech system
rules & data, currently runs on the Macintosh under OS/X, having been
ported by Steve Nygard.  To get it running under GNU/Linux GNUStep will
require creating some equivalent of the parts of Core Audio that are
used in the Mac version.  Audio output from GNUStep is currently
extremely limited.  I have discussed this with Greg Casamento (a project
member), who is the GNUStep guru, but he has not got the spare time to
work on it.  He passed the buck to Robert Slover (who has suitable
experience and is also a project member) but, like all of us, he is also
pressed for time and I fear no progress has yet been made.

Apart from Monet, which is highly interactive and really intended as a
language creation tool for the gnuspeech database (so it is therefore
far more complex than necessary for a speaking daemon), we need
"real-time Monet" which would be a stripped-down version of Monet that
would simply provide the speech service as that required daemon.  I
think the attached email copy gives you a good account of all the parts,
and there is a nice overview diagram on the gnuspeech project home page
accessed via:


All the sources (new -- i.e. the Mac OS/X/GNUStep port; and ancient --
i.e. from the original NeXT implementation) are in the gnuspeech CVS
repository, and the intention is to have a single source that will
compile for either Mac OS/X using xcode, or GNU/Linux under GNUStep.
Some of the work has already been done but without a means of handling
an audio output stream elegantly, progress on the GNU/Linux version
under GNUstep is, as I say, stalled.

Please get back with any questions or suggestions you may have.  Working
on the audio output problem, and stripping Monet to a daemon are the two
subprojects that I see as most relevant to making gnuspeech available as
a service.

I have been working on a port of the "Synthesizer" app, which is a
useful tool for understanding the articulatory (tube) model, and
essential for creating new language databases and improving the existing
database (though not essential for the speech service) , but I too have
been distracted by other things for nearly a year.  I expect to get back
to that shortly, having returned >from an offshore trip.

I hope this helps.

--------
David Hill, Prof. Emeritus,
CS Dept, U. Calgary,
Calgary, AB T2N 1N4 Canada

OR

Imagination is more important than knowledge (Albert Einstein)

Kill your television!

----------------

On Apr 10, 2007, at 2:39 PM, Kenneth Reid Beesley wrote:

Dear Gnuspeech team,

What's the status of the Gnuspeech port?

Hi Ken,

I hope that others on this list will point out any errors I have made in
what follows (which is my first real attempt to document what is going
on with respect to porting all the original Trillium TextToSpeech
software efforts!).   All of the original NeXT sources for the software
is available on the gnuspeech project site at the:


by going to "-Browse Sources Repository" and clicking down to the
gnuspeech/gnuspeech/trillium directory.

gnuspeech is being ported both to GNU/Linux and to the Macintosh under
OS/X.  There are a number of components/apps/modules which have to be
ported.  Some have been ported.  The current state is as follows:

Monet:
the interactive language database and testing tool used to create the
original databases for English Text-To-Speech using the new articulatory
model of the vocal tract (the "tube model" -- basically a wave-guide or
lattice filter that emulates the properties of the acoustic tube
directly rather than through the use of formant filters etc).  Monet
translates its symbolic input into a digital waveform representing the
"spoken" version of the input.  Monet was originally designed developed
by Craig Schock (based on an original specification by David Hill), with
testing and suggestions for improvements by David Hill and Leonard
Manzara) as proprietary research software used in-house for the
development of  the Trillium Sound Research "TextToSpeech package
offered on the NeXT computer.  It also was available as part of the
Trillium "Experimenter" kit.  On the demise of NeXT (whose remains were
bought by Apple Computer), Monet, and all other Trillium software was
reconfigured as a GNU project (gnuspeech) and made available to the
community under a General Public Licence (http://www.gnu.org/copyleft/)
and can be found at the web site
Monet and other components, click on "-Browse Sources Repository" under
"Development Tools".  Monet is there under ".../current/Applications"
but requires the tube model and other components to compile (see
".../Frameworks and ".../Tools"  under "current".  At present, complete
compilation is only possible under Macintosh OS/X 4.3 or later, though
the sources are being modified to compile under GNUstep as well and this
may introduce certain minor glitches in the Mac OS/X compilation from
time to time.  The big hold-up in getting full compilation under GNUstep
is the lack of sutiable audio output facilities under GNUstep.
Compilation under Mac OS/X uses Core Audio and the plan is to implement
the needed components of Core Audio for GNUstep.  Two people concerned
with the ongoing GNUstep development (Greg Casamento -- the Chief
GNUStep maintainer -- and Robert Slover) have been considering the
problem.  Both have been extremely busy -- especially Greg after taking
over as Chief on the GNUstep project.  The implementation is on Robert's
"to-do" list.  Until then, those wishing to try out Monet and do further
development will have to work on the Mac using the source which is
designed to compile under either OS/X xcode/interface builder, or under
GNUstep.  The Mac port is pretty well complete except for a few items
such as modifying the intonation patterns for the automatically
generated speech and was done by Steve Nygard follwoing his experience
at OmniGroup.  Steve had worked on the original NeXT implementation for
Trillium whilst he was at the University of Calgary.  Monet's emulation
of the human vocal tract depends on research carried out by Fant and his
colleagues at the Speech Technology Lab at KTH, Stockholm on formant
sensitivity analysis, and by René Carré at the ENST Dept. of Signals in
Paris on the "Distinctive Region Model" for controlling the artificial
vocal tract.

The tube model:
This was orignally a 'C' implementation of the tube model that forms the
core of the synthesis system, and was created by Leonard Manzara who
also ported it to the DSP56001 signal processor and made it run in
real-time.  It is based on work by Perry Cook and Julius Smith at the
Stanford University Center for Computer Research in Music and Acoustics
(CCRMA).  The version required to compile Monet is available in the same
repository as Monet, but under "...current/Tools/softwareTRM".  A copy
of the original 'C' version is available in the repository under
"gnuspeech/gnuspeech/trillium/src/softwareTRM/tube.c"

Synthesizer:
This is not, in fact, a complete synthesizer!  It is an interactive
application that allows a user (usually a language developer or someone
interested in the behaviour of the tube model) to interact directly with
the tube model, listen to the output under different static conditions,
and analyse the output.  It was an important tool used in developing the
databases for the original British English TextToSpeech system because
it allowed the tube configurations needed to define the speech
"postures" (of the vocal tract) to be explored and finalised.  Although
it has built-in analysis and display features, it was also used in
conjunction with a Kay Sonagraf spectrum analyser that was used to
analyse the spectrum of natural speech in order to compare the spectral
analyses of putative "postures" with what was seen in natural speech in
a form that was the same for both.  The Sonagraf was also used to check
the output of Monet against the same utterances in natural speech.
"Synthesizer" is 70% ported to the Mac under OS/X but none of the new
sources is yet available.  I (David Hill) am the one working on this,
but I keep getting diverted.  It should have been finished 6 months ago!
Real soon now!  The original version of "Synthesizer" was created (for
the NeXT) by Leonard Manzara.

PrEditor:
This was an application to allow users to create and maintain their own
dictionaries.  The original TextToSpeech kit looked up several
dictionaries in the order User, Application and Main.  PrEditor allows
the User and Application dictionaries to be created and maintained.  An
initial port was begun by Eric Zoerner and is in a sub-subdirectory
under the same subdirectory as Monet.  It is not yet functional.  The
original PrEditor on the NeXT was written by Vince DeMarco and David
Marwood, documented by leonard Manzara and later upgraded by Michael
Forbes.

The Main dictionary:
This has not really changed since the original NeXT implementation and
is incorporated as a module in the source code for Monet.  It is an
hybrid pronunciation between British (RP) English -- mainly the vowels
and related, and General American -- especially the rhotic "r" sound. It
includes around 70,000 words, plus facilities for creating/checking
derivatives such as plurals, adverbs ..., and information concerning
word stress, and part-of-speech.  The part-of-speech information is
still not used.  The main dictionary was compiled mainly by me, David
Hill, after a preliminary version, as well as creation tools were set up
by Craig Schock.

BigMouth:
(Not to be confused with a different app of a similar name by a
different company).  This was an application that allowed text-to-speech
to be tried out without reference to any particular application on the
NeXT and also drove the speech service.  It uses "TextToSpeech Server"
that ran as a daemon, started at boot time.  It has yet to be ported
(see also the next item on Real-time Monet).  The original source for
BigMouth was created by Leonard Manzara.

Real-time Monet and the TextToSpeech Server (TTS Server):
Monet incorporates all kinds of interactive interfaces for creating and
modifying the databases relating to the language being created or
managed.  It also has the means to use these databases to create the
output speech waveform.  The original NeXT-based TextToSpeech kit came
in three versions.  The User kit which simply provided speech output as
a service available to any application; the Developer kit which provided
the means to incorporate speech into applications directly; and the
Experimenter kit which allowed full access to all the tools used by
Trillium in developing language databases including dictionaries.  All
of these used the TextToSpeech Server for the actual conversion of text
to speech output.  The task was made easier on the NeXT, which was
relatively slow, by using the built-in DSP (a Motorola DSP-56001).  In
the Mac implementation of Monet and Synthesizer, the host computer
performs all the computation -- as CPU speeds are two orders of
magnitude or more faster than the old NeXT.  This also gave a certain
absolute separation between the tasks associated with creating the event
framework for synthesis, and the tasks associated with transforming the
event framework into the digital speech waveform (Real-time Monet) and
outputting it -- the latter tasks being carried out by the tube model.
Thus the tube model ran on the DSP in real-time and communicated by DMA
access.  There was also a 'C' version of the tube model which could not
run in real-time.  It was useful for producing a slightly higher quality
of speech since it did not have to be squeezed into the DSP and
rigorously optimised because of the marignal ability (even on the DSP)
to run in real-time.  The 'C' version of the tube model is what forms
the basis of the current port -- possible because of the greatly
increased processor speeds these days.
      Real-time Monet is a stripped-down version of Monet.  All the
database creation and manipulation components are absent, as are all
interactive interfaces.  On the NeXT version, the defaults database was
used to hold the parameters for controlling static aspects of the
synthesis (tube length, mean pitch, and so on -- the so-called
"utterance-rate parameters") and Real-time Monet computed the event
framework >from the input text via an intermediate input syntax which
resulted from pre-processing the text.  This pre-processing included
dictionary look-up to get the correct pronunciation (deficient in the
sense there was no grammatical parsing or attempt to determine meaning,
so that different pronunciations of words with the same spelling could
not be disambiguated).  The word stress information from the dictionary
was used to determine the rhythmic framework according to the
Jones/Abercrombie/Halliday (British) "tendency-towards-isochrony" theory
of British English speech by placing "foot" boundaries before the word
stress in words having word-stressed syllables.  The punctuation was
also used in this process, and allowed a distinction to be made between
statements, emphatic statements, questions, and questions expecting a
yes/no answer for purposes of selecting different intonation contours
(not ever really done totally satisfactorily).  Without meaning, it was
hard to decide where the tonic (information point) of the phrase or
sentence should be marked, which means that the tonic foot was generally
placed in phrase/sentence final position by default.  This causes some
degradation of the speech rhythm and intonation and is the first
deficiency that should be corrected.
     That said, Real-time Monet and the TextToSpeech server have yet to
be ported.  The current Monet port, like the original Monet,
incorporates the tube model to generate output and expects the output of
the text pre-processor as input.  A new applet (unfortunately named
"GnuSpeech" and presently residing in the "gnuspeech/current/Frameworks"
folder) allows plain text to be converted into the syntax needed for the
current version of Monet.  Steve Nygard recently "tidied things up",
following comments from people on the list, and I haven't checked out
the resulting new arrangements to see if I can still understand the
relationships well enough to compile it all, having many balls in the
air.  Any time I spend will be finishing "Synthesizer".  Knowing Steve,
I am sure there's no problem with compiling Monet and associated modules
in their re-arranged form.
     There's a diagram of the relationships between the various TTS
components of the complete system if you go to:


ServerTest and ServerTestPlus:
This was an interactive module to allow the functioning of the
TextToSpeech Server to be tested as it was running.  There were
originally two versions (plain and Plus), the latter having a number of
"hidden" methods that were restricted to Trillium's "in-house" use.  Now
that the whole system is available under a GPL, the restricted
"ServerTest" version is obsolete.  One of the 18 hidden methods allowed
plain text to be converted into the intermediate (Real-time) Monet input
syntax.  It was hidden to keep the main dictionary material proprietary,
as it could have been used to decode the encoded dictionary.  This
particular function is currently provided by the misleadingly-named
GnuSpeech applet (see above).  ServerTest will be needed once the
TextToSpeech Server has been re-implemented -- something that has not
yet been done.  The original versions were written by leonard Manzara.

WhosOnFirst:
WhosOnFirst was the first publicly available software associated with
the Trillium TTS system and was designed as a bit of a teaser.  As
issued, it provided indication on the console of remote logins.  It also
told the user that if they had the Trillium TTS system, they could get
voice alerts not only to remote logins, but other system activity such
as application launches.  The App was written by Craig Schock and was
instrumental in catching and identifying a hacker trying to break into
our system soon after it was set up.  WhosOnFirst has not yet been
ported and for real utility must await a ported version of the
TextToSpeech Server.

say:
a command line interface to the TextToSpeech Server that can be used
from a terminal or in shell scripts.  It was written by Craig Schock and
has not been ported yet.

SpeechManager:
The SpeechManager was provided to allow the TextToSpeech Server
parameters to be optimised for different systems since no particular
setting of priorities, initial silence fill, and so on could be right
for all systems.  In particular, in networked systems, or systems with a
high compute load from other tasks, the speech would sometimes crackle
due to interference from other tasks.  The App, which could only be run
as root, allowed the TextToSpeech Server to be restarted, and the
various parameters controlling priority and so on to be set to new
values to avoid crackling whilst minimising the use of system resources.
It may be that these functions are obsolete these days, given the
increased compute power available.  Some functions (such as reporting
the version of the main dictionary in use, or restarting the
TextToSpeech Server) may still be required when the TTS Server is
reimplemented.  The original App was written by Craig Schock.  It has
not been ported.

SpeechRegistrar:
An applet that was provided to allow a TextToSpeech kit to be
registered, using a password, and run under the root account.  The
function is now obsolete.  It was written by Craig Schock.  It has not
been ported.

TrilliumSoundEditor:
This was a speech editor and anaylsis program intended to provide a more
versatile replacement for the publicly available "Sonagram" program
written by Hiroshi Momose.  Although TrilliumSoundEditor was never
finished, it provided the basic functionality required and could be
finished/upgraded/ported at some point in the future.  The program was
written by Craig Schock.  None of the App has yet been ported.

I hope this is helpful to everyone.

As a summary, porting anything that "speaks"  is blocked from completion
under GNU/Linux by lack of adequate audio output facilities but much of
the core software has been or is being ported to the Mac under OS/X.
Monet has been ported to the Mac under OS/X using
xcode/InterfaceBuilder, and the source will also compile, more or less,
under GNUstep within the GNU/Linux environment.  The sources are in the
gnuspeech repository.  Synthesizer is in the process of being ported to
Mac OS/X using xcode/InterfaceBuilder and is about 70% complete. Sources
are not yet publicly available.  PrEditor is in the process of being
ported and the sources are in the gnuspeech repository.  Some accessory
tools are available.  There is an immediate need to port the TTS Server
(the daemon, the stripped version of Monet, and stripping Monet is
likely a better approach than porting the original TTSServer) to both
the Mac and to GNU/Linux.  Other items are as noted in the text above.
Robert Slover has undertaken to solve the audio output requirement for
GNU/Linux, he just needs time beyond that devoted to the work that earns
his living!  Greg Casamento, the Chief man for GNU/Linux has simply run
out of resources for taking on this task. [He is now the GNUstep
maintainer].

All good wishes.

david

------




reply via email to

[Prev in Thread] Current Thread [Next in Thread]