Re: [gnuspeech-contact] Ping: status of Gnuspeech?

On Apr 10, 2007, at 2:39 PM, Kenneth Reid Beesley wrote:

Dear Gnuspeech team,

What's the status of the Gnuspeech port?

I hope that others on this list will point out any errors I have made in what follows (which is my first real attempt to document what is going on with respect to porting all the original Trillium TextToSpeech software efforts!). All of the original NeXT sources for the software is available on the gnuspeech project site at the:

http://www.gnu.org/software/gnuspeech/

by going to "-Browse Sources Repository" and clicking down to the gnuspeech/gnuspeech/trillium directory.

gnuspeech is being ported both to GNU/Linux and to the Macintosh under OS/X. There are a number of components/apps/modules which have to be ported. Some have been ported. The current state is as follows:

Monet:

the interactive language database and testing tool used to create the original databases for English Text-To-Speech using the new articulatory model of the vocal tract (the "tube model" -- basically a wave-guide or lattice filter that emulates the properties of the acoustic tube directly rather than through the use of formant filters etc). Monet translates its symbolic input into a digital waveform representing the "spoken" version of the input. Monet was originally designed developed by Craig Schock (based on an original specification by David Hill), with testing and suggestions for improvements by David Hill and Leonard Manzara) as proprietary research software used in-house for the development of the Trillium Sound Research "TextToSpeech package offered on the NeXT computer. It also was available as part of the Trillium "Experimenter" kit. On the demise of NeXT (whose remains were bought by Apple Computer), Monet, and all other Trillium software was reconfigured as a GNU project (gnuspeech) and made available to the community under a General Public Licence (http://www.gnu.org/copyleft/) and can be found at the web site http://savannah.gnu.org/projects/gnuspeech/ To access the sources for Monet and other components, click on "-Browse Sources Repository" under "Development Tools". Monet is there under ".../current/Applications" but requires the tube model and other components to compile (see ".../Frameworks and ".../Tools" under "current". At present, complete compilation is only possible under Macintosh OS/X 4.3 or later, though the sources are being modified to compile under GNUstep as well and this may introduce certain minor glitches in the Mac OS/X compilation from time to time. The big hold-up in getting full compilation under GNUstep is the lack of sutiable audio output facilities under GNUstep. Compilation under Mac OS/X uses Core Audio and the plan is to implement the needed components of Core Audio for GNUstep. Two people concerned with the ongoing GNUstep development (Greg Casamento -- the Chief GNUStep maintainer -- and Robert Slover) have been considering the problem. Both have been extremely busy -- especially Greg after taking over as Chief on the GNUstep project. The implementation is on Robert's "to-do" list. Until then, those wishing to try out Monet and do further development will have to work on the Mac using the source which is designed to compile under either OS/X xcode/interface builder, or under GNUstep. The Mac port is pretty well complete except for a few items such as modifying the intonation patterns for the automatically generated speech and was done by Steve Nygard follwoing his experience at OmniGroup. Steve had worked on the original NeXT implementation for Trillium whilst he was at the University of Calgary. Monet's emulation of the human vocal tract depends on research carried out by Fant and his colleagues at the Speech Technology Lab at KTH, Stockholm on formant sensitivity analysis, and by René Carré at the ENST Dept. of Signals in Paris on the "Distinctive Region Model" for controlling the artificial vocal tract.

The tube model:

This was orignally a 'C' implementation of the tube model that forms the core of the synthesis system, and was created by Leonard Manzara who also ported it to the DSP56001 signal processor and made it run in real-time. It is based on work by Perry Cook and Julius Smith at the Stanford University Center for Computer Research in Music and Acoustics (CCRMA). The version required to compile Monet is available in the same repository as Monet, but under "...current/Tools/softwareTRM". A copy of the original 'C' version is available in the repository under "gnuspeech/gnuspeech/trillium/src/softwareTRM/tube.c"

Synthesizer:

This is not, in fact, a complete synthesizer! It is an interactive application that allows a user (usually a language developer or someone interested in the behaviour of the tube model) to interact directly with the tube model, listen to the output under different static conditions, and analyse the output. It was an important tool used in developing the databases for the original British English TextToSpeech system because it allowed the tube configurations needed to define the speech "postures" (of the vocal tract) to be explored and finalised. Although it has built-in analysis and display features, it was also used in conjunction with a Kay Sonagraf spectrum analyser that was used to analyse the spectrum of natural speech in order to compare the spectral analyses of putative "postures" with what was seen in natural speech in a form that was the same for both. The Sonagraf was also used to check the output of Monet against the same utterances in natural speech. "Synthesizer" is 70% ported to the Mac under OS/X but none of the new sources is yet available. I (David Hill) am the one working on this, but I keep getting diverted. It should have been finished 6 months ago! Real soon now! The original version of "Synthesizer" was created (for the NeXT) by Leonard Manzara.

PrEditor:

This was an application to allow users to create and maintain their own dictionaries. The original TextToSpeech kit looked up several dictionaries in the order User, Application and Main. PrEditor allows the User and Application dictionaries to be created and maintained. An initial port was begun by Eric Zoerner and is in a sub-subdirectory under the same subdirectory as Monet. It is not yet functional. The original PrEditor on the NeXT was written by Vince DeMarco and David Marwood, documented by leonard Manzara and later upgraded by Michael Forbes.

The Main dictionary:

This has not really changed since the original NeXT implementation and is incorporated as a module in the source code for Monet. It is an hybrid pronunciation between British (RP) English -- mainly the vowels and related, and General American -- especially the rhotic "r" sound. It includes around 70,000 words, plus facilities for creating/checking derivatives such as plurals, adverbs ..., and information concerning word stress, and part-of-speech. The part-of-speech information is still not used. The main dictionary was compiled mainly by me, David Hill, after a preliminary version, as well as creation tools were set up by Craig Schock.

BigMouth:

(Not to be confused with a different app of a similar name by a different company). This was an application that allowed text-to-speech to be tried out without reference to any particular application on the NeXT and also drove the speech service. It uses "TextToSpeech Server" that ran as a daemon, started at boot time. It has yet to be ported (see also the next item on Real-time Monet). The original source for BigMouth was created by Leonard Manzara.

Real-time Monet and the TextToSpeech Server (TTS Server):

Monet incorporates all kinds of interactive interfaces for creating and modifying the databases relating to the language being created or managed. It also has the means to use these databases to create the output speech waveform. The original NeXT-based TextToSpeech kit came in three versions. The User kit which simply provided speech output as a service available to any application; the Developer kit which provided the means to incorporate speech into applications directly; and the Experimenter kit which allowed full access to all the tools used by Trillium in developing language databases including dictionaries. All of these used the TextToSpeech Server for the actual conversion of text to speech output. The task was made easier on the NeXT, which was relatively slow, by using the built-in DSP (a Motorola DSP-56001). In the Mac implementation of Monet and Synthesizer, the host computer performs all the computation -- as CPU speeds are two orders of magnitude or more faster than the old NeXT. This also gave a certain absolute separation between the tasks associated with creating the event framework for synthesis, and the tasks associated with transforming the event framework into the digital speech waveform (Real-time Monet) and outputting it -- the latter tasks being carried out by the tube model. Thus the tube model ran on the DSP in real-time and communicated by DMA access. There was also a 'C' version of the tube model which could not run in real-time. It was useful for producing a slightly higher quality of speech since it did not have to be squeezed into the DSP and rigorously optimised because of the marignal ability (even on the DSP) to run in real-time. The 'C' version of the tube model is what forms the basis of the current port -- possible because of the greatly increased processor speeds these days.

Real-time Monet is a stripped-down version of Monet. All the database creation and manipulation components are absent, as are all interactive interfaces. On the NeXT version, the defaults database was used to hold the parameters for controlling static aspects of the synthesis (tube length, mean pitch, and so on -- the so-called "utterance-rate parameters") and Real-time Monet computed the event framework from the input text via an intermediate input syntax which resulted from pre-processing the text. This pre-processing included dictionary look-up to get the correct pronunciation (deficient in the sense there was no grammatical parsing or attempt to determine meaning, so that different pronunciations of words with the same spelling could not be disambiguated). The word stress information from the dictionary was used to determine the rhythmic framework according to the Jones/Abercrombie/Halliday (British) "tendency-towards-isochrony" theory of British English speech by placing "foot" boundaries before the word stress in words having word-stressed syllables. The punctuation was also used in this process, and allowed a distinction to be made between statements, emphatic statements, questions, and questions expecting a yes/no answer for purposes of selecting different intonation contours (not ever really done totally satisfactorily). Without meaning, it was hard to decide where the tonic (information point) of the phrase or sentence should be marked, which means that the tonic foot was generally placed in phrase/sentence final position by default. This causes some degradation of the speech rhythm and intonation and is the first deficiency that should be corrected.

That said, Real-time Monet and the TextToSpeech server have yet to be ported. The current Monet port, like the original Monet, incorporates the tube model to generate output and expects the output of the text pre-processor as input. A new applet (unfortunately named "GnuSpeech" and presently residing in the "gnuspeech/current/Frameworks" folder) allows plain text to be converted into the syntax needed for the current version of Monet. Steve Nygard recently "tidied things up", following comments from people on the list, and I haven't checked out the resulting new arrangements to see if I can still understand the relationships well enough to compile it all, having many balls in the air. Any time I spend will be finishing "Synthesizer". Knowing Steve, I am sure there's no problem with compiling Monet and associated modules in their re-arranged form.

There's a diagram of the relationships between the various TTS components of the complete system if you go to:

http://www.gnu.org/software/gnuspeech/

and click on "Project Home Page" under "Quick Overview".

ServerTest and ServerTestPlus:

This was an interactive module to allow the functioning of the TextToSpeech Server to be tested as it was running. There were originally two versions (plain and Plus), the latter having a number of "hidden" methods that were restricted to Trillium's "in-house" use. Now that the whole system is available under a GPL, the restricted "ServerTest" version is obsolete. One of the 18 hidden methods allowed plain text to be converted into the intermediate (Real-time) Monet input syntax. It was hidden to keep the main dictionary material proprietary, as it could have been used to decode the encoded dictionary. This particular function is currently provided by the misleadingly-named GnuSpeech applet (see above). ServerTest will be needed once the TextToSpeech Server has been re-implemented -- something that has not yet been done. The original versions were written by leonard Manzara.

WhosOnFirst:

WhosOnFirst was the first publicly available software associated with the Trillium TTS system and was designed as a bit of a teaser. As issued, it provided indication on the console of remote logins. It also told the user that if they had the Trillium TTS system, they could get voice alerts not only to remote logins, but other system activity such as application launches. The App was written by Craig Schock and was instrumental in catching and identifying a hacker trying to break into our system soon after it was set up. WhosOnFirst has not yet been ported and for real utility must await a ported version of the TextToSpeech Server.

say:

a command line interface to the TextToSpeech Server that can be used from a terminal or in shell scripts. It was written by Craig Schock and has not been ported yet.

SpeechManager:

The SpeechManager was provided to allow the TextToSpeech Server parameters to be optimised for different systems since no particular setting of priorities, initial silence fill, and so on could be right for all systems. In particular, in networked systems, or systems with a high compute load from other tasks, the speech would sometimes crackle due to interference from other tasks. The App, which could only be run as root, allowed the TextToSpeech Server to be restarted, and the various parameters controlling priority and so on to be set to new values to avoid crackling whilst minimising the use of system resources. It may be that these functions are obsolete these days, given the increased compute power available. Some functions (such as reporting the version of the main dictionary in use, or restarting the TextToSpeech Server) may still be required when the TTS Server is reimplemented. The original App was written by Craig Schock. It has not been ported.

SpeechRegistrar:

An applet that was provided to allow a TextToSpeech kit to be registered, using a password, and run under the root account. The function is now obsolete. It was written by Craig Schock. It has not been ported.

TrilliumSoundEditor:

This was a speech editor and anaylsis program intended to provide a more versatile replacement for the publicly available "Sonagram" program written by Hiroshi Momose. Although TrilliumSoundEditor was never finished, it provided the basic functionality required and could be finished/upgraded/ported at some point in the future. The program was written by Craig Schock. None of the App has yet been ported.

I hope this is helpful to everyone.

As a summary, porting anything that "speaks" is blocked from completion under GNU/Linux by lack of adequate audio output facilities but much of the core software has been or is being ported to the Mac under OS/X. Monet has been ported to the Mac under OS/X using xcode/InterfaceBuilder, and the source will also compile, more or less, under GNUstep within the GNU/Linux environment. The sources are in the gnuspeech repository. Synthesizer is in the process of being ported to Mac OS/X using xcode/InterfaceBuilder and is about 70% complete. Sources are not yet publicly available. PrEditor is in the process of being ported and the sources are in the gnuspeech repository. Some accessory tools are available. There is an immediate need to port the TTS Server to both the Mac and to GNU/Linux. Other items are as noted in the main text of this email. Robert Slover has undertaken to solve the audio output requirement for GNU/Linux, he just needs time beyond that devoted to the work that earns his living! Greg Casamento, the Chief man for GNU/Linux has simply run out of resources for taking on this task.

All good wishes.

david

------

Just wondering.

Thanks,

Ken

_______________________________________________
gnuspeech-contact mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/gnuspeech-contact

From:	David Hill
Subject:	Re: [gnuspeech-contact] Ping: status of Gnuspeech?
Date:	Fri, 13 Apr 2007 16:55:38 -0700