Diff: HowToSpeechRecognitionHOWTO - Waikato Linux Users Group

Differences between version 3 and revision by previous author of HowToSpeechRecognitionHOWTO.

Other diffs: Previous Major Revision, Previous Revision, or view the Annotated Edit History

Newer page:	version 3	Last edited on Monday, October 25, 2004 5:08:08 am	by AristotlePagaltzis	Revert
Older page:	version 2	Last edited on Saturday, July 20, 2002 12:04:56 am	by !PerryLorier	Revert

@@ -1,1260 +1 @@

-~~Speech Recognition HOWTO~~

-~~!!!Speech Recognition HOWTO~~

-~~!Stephen Cook~~

-

- ~~scook@gear21.com~~

-

-__Revision History__Revision v2.0April 19, 2002Revised by: sccChanged license information (now GFDL) and added a new publication.Revision v1.2February 5, 2002Revised by: sccAdded more commercial software listings (sent by Mayur Patel).Revision v1.1October 5, 2001Revised by: sccAdded info for Vocalis Speechware. Fixed/Updated various other items.Revision v1.0November 20, 2000Revised by: sccAdded info on L and H and HTKRevision v0.5September 13, 2000Revised by: sccInitial HOWTO Submission

-

-~~Automatic Speech Recognition (ASR) on Linux is becoming easier.~~

-~~Several packages are available for users as well as developers.~~

-~~This document describes the basics of speech recognition and~~

-~~describes some of the available software.~~

-

-----; __Table of Contents__; 1. Legal Notices: ; 1.1. Copyright/License; 1.2. Disclaimer; 1.3. Trademarks; 2. Forward: ; 2.1. About This Document; 2.2. Acknowledgements; 2.3. Comments/Updates/Feedback; 2.4. !ToDo; 2.5. Revision History; 3. Introduction: ; 3.1. Speech Recognition Basics; 3.2. Types of Speech Recognition; 3.3. Uses and Applications; 4. Hardware: ; 4.1. Sound Cards; 4.2. Microphones; 4.3. Computers/Processors; 5. Speech Recognition Software: ; 5.1. Free Software; 5.2. Commercial Software; 6. Inside Speech Recognition: ; 6.1. How Recognizers Work; 6.2. Digital Audio Basics; 7. Publications: ; 7.1. Books; 7.2. Internet

-~~!!!1. Legal Notices~~

-~~!!1.1. Copyright/License~~

-

-~~Permission is granted to copy, distribute, and/or modify this document under the~~

-~~terms of the GNU Free Documentation License, Version 1.1 or any later version published~~

-~~by the Free Software Foundation.~~

-

-~~This document is made available under the terms of the GNU Free Documentation License (GFDL), which is hereby~~

-~~incorporated by reference.~~

-

-~~----~~

-~~!!1.2. Disclaimer~~

-

-~~The author disclaims all warranties with regard to this document,~~

-~~including all implied warranties of merchantability and fitness for a~~

-~~certain purpose; in no event shall the author be liable for any~~

-~~special, indirect or consequential damages or any damages whatsoever~~

-~~resulting from loss of use, data or profits, whether in an action of~~

-~~contract, negligence or other tortious action, arising out of or in~~

-~~connection with the use of this document.~~

-

-~~----~~

-~~!!1.3. Trademarks~~

-

-~~All trademarks contained in this document are copyright/trademark~~

-~~of their respective owners.~~

-

-~~----~~

-~~!!!2. Forward~~

-~~!!2.1. About This Document~~

-

-~~This document is targeted at the beginner to intermediate level Linux~~

-~~user interested in learning about Speech Recognition and trying it out.~~

-~~It may also help the interested developer in explaining the basics of~~

-~~speech recognition programming.~~

-

-~~I started this document when I began researching what speech~~

-~~recognition software and development libraries were available for Linux.~~

-~~Automated Speech Recognition (ASR or just SR) on Linux is just starting~~

-~~to come into its own, and I hope this document gives it a push in the~~

-~~right direction - by supporting both users and developers of ASR~~

-~~technology.~~

-

-~~I have left a variety of SR techniques out of this document, and~~

-~~instead I have focused on the "HOWTO" aspect (since this is a howto...).~~

-~~I have included a Publications section so the interested reader can~~

-~~find books and articles on anything not covered~~ here. ~~This is not~~

-~~meant to be a definitive statement of ASR on Linux.~~

-

-~~For the most recent version of this document, check the LDP archive,~~

-~~or go to:~~

-~~http://www.gear21.com/speech/index.html.~~

-

-~~----~~

-~~!!2.2. Acknowledgements~~

-

-~~I would like to thank the following people for the help, reviewing,~~

-~~and support of this document:~~

-

-~~****~~

-

- ~~Jessica Perry Hekman~~

-

-~~****~~

-

- ~~Geoff Wexler~~

-

-~~****~~

-

-~~----~~

-~~!!2.3. Comments/Updates/Feedback~~

-

-~~If you have any comments, suggestions, revisions, updates, or just~~

-~~want to chat about ASR, please send an email to me at~~

-~~scook@gear21.com.~~

-

-~~----~~

-~~!!2.4. !ToDo~~

-

-~~The following things are left "to do":~~

-

-~~****~~

-

- ~~Add descriptions in the Publications section.~~

-

-~~****~~

-

- ~~Add more books to the Publications section.~~

-

-~~****~~

-

- ~~Add more links with descriptions.~~

-

-~~****~~

-

- ~~Enhance the description of the ASR system steps~~

-

-~~****~~

-

- ~~Include descriptions of FFTs and Filters.~~

-

-~~****~~

-

- ~~Include descriptions of DSP principles.~~

-

-~~****~~

-

-~~----~~

-~~!!2.5. Revision History~~

-

-~~v0.1 first rough draft - August 2000~~

-

-~~v0.5 final draft - September 2000~~

-

-~~----~~

-~~!!!3. Introduction~~

-~~!!3.1. Speech Recognition Basics~~

-

-~~Speech recognition is the process by which a computer (or~~

-~~other type of machine) identifies spoken words. Basically, it means~~

-~~talking to your computer, AND having it correctly recognize what you~~

-~~are saying.~~

-

-~~The following definitions are the basics needed for understanding~~

-~~speech recognition technology.~~

-

-~~; Utterance:~~

-

- ~~An utterance is the vocalization (speaking) of a word or words that~~

-~~represent a single meaning to the computer. Utterances can be a~~

-~~single word, a few words, a sentence, or even multiple sentences.~~

-

-~~; Speaker Dependance:~~

-

- ~~Speaker dependent systems are designed around a specific speaker.~~

-~~They generally are more accurate for the correct speaker, but much~~

-~~less accurate for other speakers. They assume the speaker will~~

-~~speak in a consistent voice and tempo. Speaker independent systems~~

-~~are designed for a variety of speakers. Adaptive systems usually start~~

-~~as speaker independent systems and utilize training techniques to~~

-~~adapt to the speaker to increase their recognition accuracy.~~

-

-~~; Vocabularies:~~

-

- ~~Vocabularies (or dictionaries) are lists of words or utterances that~~

-~~can be recognized by the SR system. Generally, smaller vocabularies~~

-~~are easier for a computer to recognize, while larger vocabularies~~

-~~are more difficult. Unlike normal dictionaries, each entry doesn't~~

-~~have to be a single word. They can be as long as a sentence or two.~~

-~~Smaller vocabularies can have as few as 1 or 2 recognized utterances~~

-~~(e.g."Wake Up"), while very large vocabularies can have a hundred~~

-~~thousand or more!~~

-

-~~; Accuract:~~

-

- ~~The ability of a recognizer can be examined by measuring its~~

-~~accuracy - or how well it recognizes utterances. This includes not~~

-~~only correctly identifying an utterance but also identifying if the~~

-~~spoken utterance is not in its vocabulary. Good ASR systems have an~~

-~~accuracy of 98% or more! The acceptable accuracy of a system~~

-~~really depends on the application.~~

-

-~~; Training:~~

-

- ~~Some speech recognizers have the ability to adapt to a speaker.~~

-~~When the system has this ability, it may allow training to take~~

-~~place. An ASR system is trained by having the speaker repeat~~

-~~standard or common phrases and adjusting its comparison algorithms~~

-~~to match that particular speaker. Training a recognizer usually~~

-~~improves its accuracy.~~

-

- ~~Training can also be used by speakers that have difficulty~~

-~~speaking, or pronouncing certain words. As long as the speaker~~

-~~can consistently repeat an utterance, ASR systems with training~~

-~~should be able to adapt.~~

-

-~~----~~

-~~!!3.2. Types of Speech Recognition~~

-

-~~Speech recognition systems can be separated in several different~~

-~~classes by describing what types of utterances they have the ability~~

-~~to recognize. These classes are based on the fact that one of the~~

-~~difficulties of ASR is the ability to determine when a speaker starts~~

-~~and finishes an utterance. Most packages can fit into more than one~~

-~~class, depending on which mode they're using.~~

-

-~~; Isolated Words:~~

-

- ~~Isolated word recognizers usually require each utterance to have~~

-~~quiet (lack of an audio signal) on BOTH sides of the sample window.~~

-~~It doesn't mean that it accepts single words, but does require~~

-~~a single utterance at a time. Often, these systems have~~

-~~"Listen/Not-Listen" states, where they require the speaker to wait~~

-~~between utterances (usually doing processing during the pauses).~~

-~~Isolated Utterance might be a better name for this class.~~

-

-~~; Connected Words:~~

-

- ~~Connect word systems (or more correctly 'connected utterances')~~

-~~are similar to Isolated words, but allow separate utterances to be~~

-~~'run-together' with a minimal pause between them.~~

-

-~~; Continuous Speech:~~

-

- ~~Continuous recognition is the next step. Recognizers with continuous~~

-~~speech capabilities are some of the most difficult to create because~~

-~~they must utilize special methods to determine utterance boundaries.~~

-~~Continuous speech recognizers allow users to speak almost naturally,~~

-~~while the computer determines the content. Basically, it's computer~~

-~~dictation.~~

-

-~~; Spontaneous Speech:~~

-

- ~~There appears to be a variety of definitions for what spontaneous~~

-~~speech actually is. At a basic level, it can be thought of as~~

-~~speech that is natural sounding and not rehearsed. An ASR system~~

-~~with spontaneous speech ability should be able to handle a variety~~

-~~of natural speech features such as words being run together, "ums"~~

-~~and "ahs", and even slight stutters.~~

-

-~~; Voice Verification/Identification:~~

-

- ~~Some ASR systems have the ability to identify specific users. This~~

-~~document doesn't cover verification or security systems.~~

-

-~~----~~

-~~!!3.3. Uses and Applications~~

-

-~~Although any task that involves interfacing with a computer can~~

-~~potentially use ASR, the following applications are the most~~

-~~common right now.~~

-

-~~; Dictation:~~

-

- ~~Dictation is the most common use for ASR systems today. This~~

-~~includes medical transcriptions, legal and business dictation, as~~

-~~well as general word processing. In some cases special vocabularies~~

-~~are used to increase the accuracy of the system.~~

-

-~~; Command and Control:~~

-

- ~~ASR systems that are designed to perform functions and actions on the~~

-~~system are defined as Command and Control systems. Utterances like~~

-~~"Open Netscape" and "Start a new xterm" will do just that.~~

-

-~~; Telephony:~~

-

- ~~Some PBX/Voice Mail systems allow callers to speak commands instead of~~

-~~pressing buttons to send specific tones.~~

-

-~~; Wearables:~~

-

- ~~Because inputs are limited for wearable devices, speaking is a~~

-~~natural possibility.~~

-

-~~; Medical/Disabilities:~~

-

- ~~Many people have difficulty typing due to physical limitations such~~

-~~as repetitive strain injuries (RSI), muscular dystrophy, and~~

-~~many others. For example, people with difficulty hearing could use~~

-~~a system connected to their telephone to convert the caller's speech~~

-~~to text.~~

-

-~~; Embedded Applications:~~

-

- ~~Some newer cellular phones include C8C speech recognition that allow~~

-~~utterances such as "Call Home". This could be a major factor in the~~

-~~future of ASR and Linux. Why can't I talk to my television yet?~~

-

-~~----~~

-~~!!!4. Hardware~~

-~~!!4.1. Sound Cards~~

-

-~~Because speech requires a relatively low bandwidth, just about any~~

-~~medium-high quality 16 bit sound card will get the job done. You must~~

-~~have sound enabled in your kernel, and you must have correct drivers~~

-~~installed. For more information on sound cards, please see "The Linux~~

-~~Sound HOWTO" available at: http://www.!LinuxDoc.org/. Sound card~~

-~~quality often starts a heated discussion about their impact on accuracy~~

-~~and noise.~~

-

-~~Sound cards with the 'cleanest' A/D (analog to digital) conversions~~

-~~are recommended, but most often the clarity of the digital sample is~~

-~~more dependent on the microphone quality and even more dependent on the~~

-~~environmental noise. Electrical "noise" from monitors, pci slots,~~

-~~hard-drives, etc. are usually nothing compared to audible noise~~

-~~from the computer fans, squeaking chairs, or heavy breathing.~~

-

-~~Some ASR software packages may require a specific sound card. It's~~

-~~usually a good idea to stay away from specific hardware requirements,~~

-~~because it limits many of your possible future options and decisions.~~

-~~You'll have to weigh the benefits and costs if you are considering~~

-~~packages that require specific hardware to function properly.~~

-

-~~----~~

-~~!!4.2. Microphones~~

-

-~~A quality microphone is key when utilizing ASR. In most cases, a~~

-~~desktop microphone just won't do the job. They tend to pick up more~~

-~~ambient noise that gives ASR programs a hard time.~~

-

-~~Hand held microphones are also not the best choice as they can be~~

-~~cumbersome to pick up all the time. While they do limit the amount~~

-~~of ambient noise, they are most useful in applications that require~~

-~~changing speakers often, or when speaking to the recognizer isn't~~

-~~done frequently (when wearing a headset isn't an option).~~

-

-~~The best choice, and by far the most common is the headset style.~~

-~~It allows the ambient noise to be minimized, while allowing you to~~

-~~have the microphone at the tip of your tongue all the time. Headsets~~

-~~are available without earphones and with earphones (mono or stereo).~~

-~~I recommend the stereo headphones, but it's just a matter of personal~~

-~~taste.~~

-

-~~You can get excellent quality microphone headsets for between $25~~

-~~$100. A good place to start looking is http://www.headphones.com or~~

-~~http://www.speechcontrol.com.~~

-

-~~A quick note about levels: Don't forget to turn up your microphone~~

-~~volume. This can be done with a program such as XMixer or OSS Mixer~~

-~~and care should be used to avoid feedback noise. If the ASR software~~

-~~includes auto-adjustment programs, use them instead, as they are~~

-~~optimized for their particular recognition system.~~

-

-~~----~~

-~~!!4.3. Computers/Processors~~

-

-~~ASR applications can be heavily dependent on processing speed. This~~

-~~is because a large amount of digital filtering and signal processing~~

-~~can take place in ASR.~~

-

-~~As with just about any cpu intensive software, the faster the better.~~

-~~Also, the more memory the better. It's possible to do some SR with 100MHz~~

-~~and 16M RAM, but for fast processing (large dictionaries, complex~~

-~~recognition schemes, or high sample rates), you should shoot for a~~

-~~minimum of a 400MHz and 128M RAM. Because of the processing required,~~

-~~most software packages list their minimum requirements.~~

-

-~~Using a cluster (Beowulf or otherwise) to perform massive recognition~~

-~~efforts hasn't yet been undertaken. If you know of any project underway,~~

-~~or in development please send me a note! scook@gear21.com~~

-

-~~----~~

-~~!!!5. Speech Recognition Software~~

-~~!!5.1. Free Software~~

-

-~~Much of the free software listed here is available for download at:~~

-~~http://sunsite.uio.no/pub/Linux/sound/apps/speech/~~

-

-~~----~~

-~~!5.1.1. XVoice~~

-

-~~XVoice is a dictation/continuous speech recognizer that can be used~~

-~~with a variety of XWindow applications. It allows user-defined macros.~~

-~~This is a fine program with a definite future. Once setup, it~~

-~~performs with adequate accuracy.~~

-

-~~XVoice requires that you download and install IBM's (free) !ViaVoice~~

-~~for Linux (See Commercial Section). It also requires the configuration~~

-~~of !ViaVoice to work correctly. Additionally, Lesstif/Motif (libXm) is~~

-~~required. It is also important to note that because this program~~

-~~interacts with X windows, you must leave X resources open on your~~

-~~machine, so caution should be used if you use this on a networked or~~

-~~multi-user machine.~~

-

-~~This software is primarily for users. An RPM is available.~~

-

-~~!!HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/~~

-~~http://www.zachary.com/creemer/xvoice.html~~

-

-~~Project: http://xvoice.sourceforge.net~~

-

-~~Community: http://www.onelist.com/community/xvoice~~

-

-~~----~~

-~~!5.1.2. CVoiceControl/kVoiceControl~~

-

-~~CVoiceControl (which stands for Console Voice Control) started its~~

-~~life as KVoiceControl (KDE Voice Control). It is a basic speech~~

-~~recognition system that allows a user to execute Linux commands by~~

-~~using spoken commands. CVoiceControl replaces KVoiceControl.~~

-

-~~The software includes a microphone level configuration utility,~~

-~~a vocabulary "model editor" for adding new commands and utterances,~~

-~~and the speech recognition system.~~

-

-~~CVoiceControl is an excellent starting point for experienced users~~

-~~looking to get started in ASR. It is not the most user friendly,~~

-~~but once it has been trained correctly, it can be very helpful. Be~~

-~~sure to read the documentation while setting up.~~

-

-~~This software is primarily for users.~~

-

-~~Homepage: http://www.kiecza.de/daniel/linux/index.html~~

-

-~~Documents: http://www.kiecza.de/daniel/linux/cvoicecontrol/index.html~~

-

-~~----~~

-~~!5.1.3. Open Mind Speech~~

-

-~~Started in late 1999, Open Mind Speech has changed names several times~~

-~~(was !VoiceControl, then !SpeechInput, and then !FreeSpeech), and is now~~

-~~part of the "Open Mind Initiative". This is an open source project.~~

-~~Currently it isn't completely operational and is primarily for developers.~~

-

-~~This software is primarily for developers.~~

-

-~~Homepage: http://freespeech.sourceforge.net~~

-

-~~----~~

-~~!5.1.4. GVoice~~

-

-~~GVoice is a speech ASR library that uses IBM's !ViaVoice (free) SDK~~

-~~to control Gtk/GNOME applications. It includes libraries for~~

-~~initialization, recognition engine, vocabulary manipulation, and panel~~

-~~control. Development on this has been idle for over a year.~~

-

-~~This software is primarily for developers.~~

-

-~~Homepage: http://www.cse.ogi.edu/~omega/gnome/gvoice/~~

-

-~~----~~

-~~!5.1.5. ISIP~~

-

-~~The Institute for Signal and Information Processing at Mississippi~~

-~~State University has made its speech recognition engine available. The~~

-~~toolkit includes a front-end, a decoder, and a training module. It's a~~

-~~functional toolkit.~~

-

-~~This software is primarily for developers.~~

-

-~~The toolkit (and more information about ISIP) is available at:~~

-~~http://www.isip.msstate.edu/project/speech/~~

-

-~~----~~

-~~!5.1.6. CMU Sphinx~~

-

-~~Sphinx originally started at CMU and has recently been released as~~

-~~open source. This is a fairly large program that includes a lot of~~

-~~tools and information. It is still "in development", but includes~~

-~~trainers, recognizers, acoustic models, language models, and some~~

-~~limited documentation.~~

-

-~~This software is primarily for developers.~~

-

-~~Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html~~

-

-~~Source: http://download.sourceforge.net/cmusphinx/sphinx2-.1a.tar.gz~~

-

-~~----~~

-~~!5.1.7. Ears~~

-

-~~Although Ears isn't fully developed, it is a good starting~~

-~~point for programmers wishing to start in ASR.~~

-

-~~This software is primarily for developers.~~

-

-~~FTP site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/recognition/~~

-

-~~----~~

-~~!5.1.8. NICO ANN Toolkit~~

-

-~~The NICO Artificial Neural Network toolkit is a flexible back~~

-~~propagation neural network toolkit optimized for speech recognition~~

-~~applications.~~

-

-~~This software is primarily for developers.~~

-

-~~Its homepage: http://www.speech.kth.se/NICO/index.html~~

-

-~~----~~

-~~!5.1.9. Myers' Hidden Markov Model Software~~

-

-~~This software by Richard Myers is HMM algorithms written in C++ code.~~

-~~It provides an example and learning tool for HMM models described in~~

-~~the L. Rabiner book "Fundamentals of Speech Recognition".~~

-

-~~This software is primarily for developers.~~

-

-~~Information is available at:~~

-~~http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html~~

-

-~~----~~

-~~!5.1.10. Jialong He's Speech Recognition Research Tool~~

-

-~~Although not originally written for Linux, this research tool can be~~

-~~compiled on Linux. It contains three different types of recognizers:~~

-~~DTW, Dynamic Hidden Markov Model, and a Continuous Density Hidden~~

-~~Markov Model. This is for research and development uses, as it is~~

-~~not a fully functional ASR system. The toolkit contains some very~~

-~~useful tools.~~

-

-~~This software is primarily for developers.~~

-

-~~More information is available at:~~

-~~http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html~~

-

-~~----~~

-~~!5.1.11. More Free Software?~~

-

-~~If you know of free software that isn't included in the above list,~~

-~~please send me a note at: scook@gear21.com. If you're in the mood,~~

-~~you can also send me where to get a copy of the software, and any~~

-~~impressions you may have about it. Thanks!~~

-

-~~----~~

-~~!!5.2. Commercial Software~~

-~~!5.2.1. IBM !ViaVoice~~

-

-~~IBM has made true on their promise to support Linux with their series~~

-~~of !ViaVoice products for Linux, though the future of their SDKs aren't~~

-~~set in stone (their licensing agreement for developers isn't officially~~

-~~released as of this date - more to come).~~

-

-~~Their commercial (not-free) product, IBM !ViaVoice Dictation for Linux~~

-~~(available at http://www-4.ibm.com/software/speech/linux/dictation.html)~~

-~~performs very well, but has some sizeable system requirements compared~~

-~~to the more basic ASR systems (64M RAM and 233MHz Pentium). For the~~

-~~$59.95US price tag you also get an Andrea NC-8 microphone. It also~~

-~~allows multiple users (but I haven't tried it with multiple users, so~~

-~~if anyone has any experience please give me a shout). The package~~

-~~includes: documentation (PDF), Trainer, dictation system, and~~

-~~installation scripts. Support for additional Linux Distributions based~~

-~~on 2.2 kernels is also available in the latest release.~~

-

- ~~The ASR SDK is available for free, and includes IBM's SMAPI, grammar~~

-~~API, documentation, and a variety of sample programs. The !ViaVoice~~

-~~Run Time Kit provides an ASR engine and data files for dictation~~

-~~functions, and user utilities. The !ViaVoice Command 8 Control Run Time~~

-~~Kit includes the ASR engine and data files for command and control~~

-~~functions, and user utilities. The SDK and Kits require 128M RAM and~~

-~~a Linux 2.2 or better kernel)~~

-

-~~The SDKs and Kits are available for free at:~~

-~~http://www-4.ibm.com/software/speech/dev/sdk_linux.html~~

-

-~~----~~

-~~!5.2.2. Vocalis Speechware~~

-

-~~More information on Vocalis and Vocalis Speechware is available at:~~

-~~http://www.vocalisspeechware.com and~~

-~~http://www.vocalis.com.~~

-

-~~----~~

-~~!5.2.3. Babel Technologies~~

-

-~~Babel Technologies has a Linux SDK available called Babear. It is a speaker-independent~~

-~~system based on Hybrid Markov Models and Artificial Neural Networks technology. They also~~

-~~have a variety of products for Text-to-speech, speaker verification, and phoneme analysis.~~

-~~More information is available at: http://www.babeltech.com.~~

-

-~~----~~

-~~!5.2.4. !SpeechWorks~~

-

-~~I didn't see anything on their website that specifically mentioned Linux, but their~~

-~~"!OpenSpeech Recognizer" uses VoiceXML, which is an open standard.~~

-~~More information is available at: http://www.speechworks.com.~~

-

-~~----~~

-~~!5.2.5. Nuance~~

-

-~~Nuance offers a speech recognition/natural language product (currently Nuance 8.) for~~

-~~a variety of *nix platforms. It can handle very large vocabularies and uses a unqiue~~

-~~distributed architecture for scalability and fault tolerance.~~

-~~More information is available at: http://www.nuance.com.~~

-

-~~----~~

-~~!5.2.6. Abbot/!AbbotDemo~~

-

-~~Abbot is a very large vocabulary, speaker independent ASR system.~~

-~~It was originally developed by the Connectionist Speech Group at~~

-~~Cambridge University. It was transferred (commercialized) to~~

-~~!SoftSound. More information is available at:~~

-~~http://www.softsound.com.~~

-

-~~!AbbotDemo is a demonstration package of Abbot. This demo system~~

-~~has a vocabulary of about 5000 words and uses the connectionist/HMM~~

-~~continuous speech algorithm. This is a demonstration program with no~~

-~~source code.~~

-

-~~----~~

-~~!5.2.7. Entropic~~

-

-~~The fine people over at Entropic have been bought out by Micro$oft...~~

-~~Their products and support services have all but disappeared. Their~~

-~~support for HTK and ESPS/waves+ is gone, and their future is in the~~

-~~hands of M$. Their old website as http://www.entropic.com has more~~

-~~information.~~

-

-~~K.K. Chin advised me that the original developers of the HTK (the~~

-~~Speech Vision and Robotic Group at Cambridge) are still~~

-~~providing support for it. There is also a "free" version~~

-~~available at: http://htk.eng.cam.ac.uk.~~

-~~Also note that Microsoft still owns the copyright to the current~~

-~~HTK code...~~

-

-~~----~~

-~~!5.2.8. More Commercial Products~~

-

-~~There are rumors of more commercial ASR products becoming available~~

-~~in the near future (including L8H). I talked with a couple of~~

-~~L8H representatives at Comdex 2000 (Vegas) and none of them could give~~

-~~me any information on a Linux release, or even if they planned on releasing~~

-~~any products for Linux. If you have any further information, please send~~

-~~any details to me at scook@gear21.com.~~

-

-~~----~~

-~~!!!6. Inside Speech Recognition~~

-~~!!6.1. How Recognizers Work~~

-

-~~Recognition systems can be broken down into two main types. Pattern~~

-~~Recognition systems compare patterns to known/trained patterns to~~

-~~determine a match. Acoustic Phonetic systems use knowledge of the~~

-~~human body (speech production, and hearing) to compare speech features~~

-~~(phonetics such as vowel sounds). Most modern systems focus on the~~

-~~pattern recognition approach because it combines nicely with current~~

-~~computing techniques and tends to have higher accuracy.~~

-

-~~Most recognizers can be broken down into the following steps:~~

-

-***#

-

- ~~Audio recording and Utterance detection~~

-

-***#

-

- ~~Pre-Filtering (pre-emphasis, normalization, banding, etc.)~~

-

-***#

-

- ~~Framing and Windowing (chopping the data into a usable format)~~

-

-***#

-

- ~~Filtering (further filtering of each window/frame/freq. band)~~

-

-***#

-

- ~~Comparison and Matching (recognizing the utterance)~~

-

-***#

-

- ~~Action (Perform function associated with the recognized pattern)~~

-

-***#

-

-~~Although each step seems simple, each one can involve a multitude of~~

-~~different (and sometimes completely opposite) techniques.~~

-

-~~(1) Audio/Utterance Recording: can be accomplished in a number of ways.~~

-~~Starting points can be found by comparing ambient audio levels (acoustic~~

-~~energy in some cases) with the sample just recorded. Endpoint detection~~

-~~is harder because speakers tend to leave "artifacts" including~~

-~~breathing/sighing,teeth chatters, and echoes.~~

-

-~~(2) Pre-Filtering: is accomplished in a variety of ways, depending on~~

-~~other features of the recognition system. The most common methods are~~

-~~the "Bank-of-Filters" method which utilizes a series of audio filters to~~

-~~prepare the sample, and the Linear Predictive Coding method which uses~~

-~~a prediction function to calculate differences (errors). Different~~

-~~forms of spectral analysis are also used.~~

-

-~~(3) Framing/Windowing involves separating the sample data into~~

-~~specific sizes. This is often rolled into step 2 or step 4. This step~~

-~~also involves preparing the sample boundaries for analysis (removing~~

-~~edge clicks, etc.)~~

-

-~~(4) Additional Filtering is not always present. It is the final~~

-~~preparation for each window before comparison and matching. Often this~~

-~~consists of time alignment and normalization.~~

-

-~~There are a huge number of techniques available for (5), Comparison~~

-~~and Matching. Most involve comparing the current window with known~~

-~~samples. There are methods that use Hidden Markov Models (HMM),~~

-~~frequency analysis, differential analysis, linear algebra~~

-~~techniques/shortcuts, spectral distortion, and time distortion methods.~~

-~~All these methods are used to generate a probability and accuracy match.~~

-

-(6) Actions can be just about anything the developer wants. *GRIN*

-

-~~----~~

-~~!!6.2. Digital Audio Basics~~

-

-~~Audio is inherently an analog phenomenon. Recording a digital sample~~

-~~is done by converting the analog signal from the microphone to an~~

-~~digital signal through the A/D converter in the sound card. When a~~

-~~microphone is operating, sound waves vibrate the magnetic element in~~

-~~the microphone, causing an electrical current to the sound card (think~~

-~~of a speaker working in reverse). Basically, the A/D converter records~~

-~~the value of the electrical voltage at specific intervals.~~

-

-~~There are two important factors during this process. First is the~~

-~~"sample rate", or how often to record the voltage values. Second, is~~

-~~the "bits per sample", or how accurate the value is recorded. A third~~

-~~item is the number of channels (mono or stereo), but for most ASR~~

-~~applications mono is sufficient. Most applications use pre-set values~~

-~~for these parameters and user's shouldn't change them unless the~~

-~~documentation suggests it. Developers should experiment with different~~

-~~values to determine what works best with their algorithms.~~

-

-~~So what is a good sample rate for ASR? Because speech is relatively~~

-~~low bandwidth (mostly between 100Hz-8kHz), 8000 samples/sec (8kHz) is~~

-~~sufficient for most basic ASR. But, some people prefer 16000~~

-~~samples/sec (16kHz) because it provides more accurate high frequency~~

-~~information. If you have the processing power, use 16kHz. For most~~

-~~ASR applications, sampling rates higher than about 22kHz is a waste.~~

-

-~~And what is a good value for "bits per sample"? 8 bits per sample~~

-~~will record values between 0 and 255, which means that the position~~

-~~of the microphone element is in one of 256 positions. 16 bits per~~

-~~sample divides the element position into 65536 possible values.~~

-~~Similar to sample rate, if you have enough processing power and~~

-~~memory, go with 16 bits per sample. For comparison, an audio~~

-~~Compact Disc is encoded with 16 bits per sample at about 44kHz.~~

-

-~~The encoding format used should be simple - linear signed or~~

-~~unsigned. Using a U-Law/A-Law algorithm or some other compression~~

-~~scheme is usually not worth it, as it will cost you in computing power,~~

-~~and not gain you much.~~

-

-~~----~~

-~~!!!7. Publications~~

-

-~~If there is a publication that is not on this list, that you think~~

-~~should be, please send the information to me at: scook@gear21.com.~~

-

-~~----~~

-~~!!7.1. Books~~

-

-~~****~~

-

- ~~"Fundamentals of Speech Recognition". L. Rabiner 8 B. Juang. 1993.~~

-~~ISBN: 0130151572.~~

-

-~~****~~

-

- ~~"How to Build a Speech Recognition Application". B. Balentine,~~

-~~D. Morgan, and W. Meisel. 1999. ISBN: 0967127815.~~

-

-~~****~~

-

-~~"Speech Recognition : Theory and C++ Implementation". C. Becchetti~~

-~~and L.P. Ricotti. 1999. ISBN: 0471977306.~~

-

-~~****~~

-

- ~~"Applied Speech Technology". A. Syrdal, R. Bennett, S. Greenspan.~~

-~~1994. ISBN: 0849394562.~~

-

-~~****~~

-

- ~~"Speech Recognition : The Complete Practical Reference Guide".~~

-~~P. Foster, T. Schalk. 1993. ISBN: 0936648392.~~

-

-~~****~~

-

- ~~"Speech and Language Processing: An Introduction to Natural Language~~

-~~Processing, Computational Linguistics and Speech Recognition".~~

-~~D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.~~

-

-~~****~~

-

- ~~"Discrete-Time Processing of Speech Signals (IEEE Press Classic~~

-~~Reissue)". J. Deller, J. Hansen, J. Proakis. 1999.~~

-~~ISBN: 0780353862.~~

-

-~~****~~

-

- ~~"Statistical Methods for Speech Recognition (Language, Speech, and~~

-~~Communication)". F. Jelinek. 1999. ISBN: 0262100665.~~

-

-~~****~~

-

- ~~"Digital Processing of Speech Signals" L. Rabiner, R. Schafer. 1978.~~

-~~ISBN: 0132136031~~

-

-~~****~~

-

- ~~"Foundations of Statistical Natural Language Processing".~~

-~~C. Manning, H. Schutze. 1999. ISBN: 0262133601.~~

-

-~~****~~

-

- ~~"Designing Effective Speech Interfaces".~~

-~~S. Weinschenk, D. T. Barker. 2000. ISBN: 0471375454.~~

-

-~~****~~

-

- ~~For a very LARGE online biography, check the Institut Fur Phonetik:~~

-~~http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html~~

-

-~~----~~

-~~!!7.2. Internet~~

-

-~~; news:comp.speech:~~

-

- ~~Newsgroup dedicated to computer and speech.~~

-

-~~****~~

-

- ~~US: http://www.speech.cs.cmu.edu/comp.speech/~~

-

-~~****~~

-

- ~~UK: http://svr-www.eng.cam.ac.uk/comp.speech/~~

-

-~~****~~

-

- ~~Aus: http://www.speech.su.oz.au/comp.speech/~~

-

-~~****~~

-

-~~; news:comp.speech.users:~~

-

- ~~Newsgroup dedicated to users of speech software.~~

-

-~~****~~

-

- ~~http://www.speechtechnology.com/users/comp.speech.users.html~~

-

-~~****~~

-

-~~; news:comp.speech.research:~~

-

- ~~Newsgroup dedicated to speech software and hardware research.~~

-

-~~; news:comp.dsp:~~

-

- ~~Newsgroup dedicated to digital signal processing.~~

-

-~~; news:alt.sci.physics.acoustics:~~

-

- ~~Newsgroup dedicated to the physics of sound.~~

-

-~~; DDLinux Email List:~~

-

- ~~Speech Recognition on Linux Mailing List.~~

-

-~~****~~

-

- ~~Homepage: http://leb.net/ddlinux/~~

-

-~~****~~

-

- ~~Archives: http://leb.net/pipermail/ddlinux/~~

-

-~~****~~

-

-~~; Linux Software Repository for speech applications:~~

-

- ~~http://sunsite.uio.no/pub/linux/sound/apps/speech/~~

-

-~~; Russ Wilcox's List of Speech Recognition Links:~~

-

- ~~(excellent) http://www.tiac.net/users/rwilcox/speech.html~~

-

-~~; Online Bibliography:~~

-

- ~~Online Bibliography of Phonetics and Speech Technology Publications.~~

-~~http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html~~

-

-~~; MIT's Spoken Language Systems Homepage:~~

-

- ~~http://www.sls.lcs.mit.edu/sls/~~

-

-~~; Oregon Graduate Institute:~~

-

- ~~Center for Spoken Language Understanding at Oregon Graduate~~

-~~Institute. An excellent location for developers and researchers.~~

-~~http://cslu.cse.ogi.edu/~~

-

-~~; IBM's !ViaVoice Linux SDK:~~

-

- ~~http://www-4.ibm.com/software/speech/dev/sdk_linux.html~~

-

-~~; Mississippi State:~~

-

- ~~Mississippi State Institute for Signal and Information Processing~~

-~~homepage with a large amount of useful information for developers.~~

-~~http://www.isip.msstate.edu/projects/speech/~~

-

-~~; Speech Technology:~~

-

- ~~ASR software and accessories.~~

-~~http://www.speechtechnology.com~~

-

-~~; Speech Control:~~

-

- ~~Speech Controlled Computer Systems. Microphones, headsets, and~~

-~~wireless products for ASR.~~

-~~http://www.speechcontrol.com~~

-

-~~; Microphones.com:~~

-

- ~~Microphones and accessories for ASR.~~

-~~http://www.microphones.com~~

-

-~~; 21st Century Eloquence:~~

-

- ~~"Speech Recognition Specialists."~~

-~~http://voicerecognition.com~~

-

-~~; Computing Out Loud:~~

-

- ~~Primarily for Windows users, but good info.~~

-~~http://www.out-loud.com~~

-

-~~; Say I Can.com:~~

-

- ~~"The Speech Recognition Information Source."~~

-~~http://www.sayican.com~~

+Describe [HowToSpeechRecognitionHOWTO] here.