Penguin
Diff: HowToSpeechRecognitionHOWTO
EditPageHistoryDiffInfoLikePages

Differences between current version and predecessor to the previous major change of HowToSpeechRecognitionHOWTO.

Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History

Newer page: version 3 Last edited on Monday, October 25, 2004 5:08:08 am by AristotlePagaltzis
Older page: version 2 Last edited on Saturday, July 20, 2002 12:04:56 am by !PerryLorier Revert
@@ -1,1260 +1 @@
-Speech Recognition HOWTO  
-!!!Speech Recognition HOWTO  
-!Stephen Cook  
-  
- scook@gear21.com  
-  
-  
-  
-__Revision History__Revision v2.0April 19, 2002Revised by: sccChanged license information (now GFDL) and added a new publication.Revision v1.2February 5, 2002Revised by: sccAdded more commercial software listings (sent by Mayur Patel).Revision v1.1October 5, 2001Revised by: sccAdded info for Vocalis Speechware. Fixed/Updated various other items.Revision v1.0November 20, 2000Revised by: sccAdded info on L and H and HTKRevision v0.5September 13, 2000Revised by: sccInitial HOWTO Submission  
-  
-  
-  
-  
-  
-Automatic Speech Recognition (ASR) on Linux is becoming easier.  
-Several packages are available for users as well as developers.  
-This document describes the basics of speech recognition and  
-describes some of the available software.  
-  
-  
-  
-  
-  
-  
-----; __Table of Contents__; 1. Legal Notices: ; 1.1. Copyright/License; 1.2. Disclaimer; 1.3. Trademarks; 2. Forward: ; 2.1. About This Document; 2.2. Acknowledgements; 2.3. Comments/Updates/Feedback; 2.4. !ToDo; 2.5. Revision History; 3. Introduction: ; 3.1. Speech Recognition Basics; 3.2. Types of Speech Recognition; 3.3. Uses and Applications; 4. Hardware: ; 4.1. Sound Cards; 4.2. Microphones; 4.3. Computers/Processors; 5. Speech Recognition Software: ; 5.1. Free Software; 5.2. Commercial Software; 6. Inside Speech Recognition: ; 6.1. How Recognizers Work; 6.2. Digital Audio Basics; 7. Publications: ; 7.1. Books; 7.2. Internet  
-!!!1. Legal Notices  
-!!1.1. Copyright/License  
-  
-Copyright (c) 2000-2002 Stephen C. Cook.  
-Permission is granted to copy, distribute, and/or modify this document under the  
-terms of the GNU Free Documentation License, Version 1.1 or any later version published  
-by the Free Software Foundation.  
-  
-  
-  
-This document is made available under the terms of the GNU Free Documentation License (GFDL), which is hereby  
-incorporated by reference.  
-  
-----  
-!!1.2. Disclaimer  
-  
-The author disclaims all warranties with regard to this document,  
-including all implied warranties of merchantability and fitness for a  
-certain purpose; in no event shall the author be liable for any  
-special, indirect or consequential damages or any damages whatsoever  
-resulting from loss of use, data or profits, whether in an action of  
-contract, negligence or other tortious action, arising out of or in  
-connection with the use of this document.  
-  
-----  
-!!1.3. Trademarks  
-  
-All trademarks contained in this document are copyright/trademark  
-of their respective owners.  
-  
-----  
-!!!2. Forward  
-!!2.1. About This Document  
-  
-This document is targeted at the beginner to intermediate level Linux  
-user interested in learning about Speech Recognition and trying it out.  
-It may also help the interested developer in explaining the basics of  
-speech recognition programming.  
-  
-  
-  
-I started this document when I began researching what speech  
-recognition software and development libraries were available for Linux.  
-Automated Speech Recognition (ASR or just SR) on Linux is just starting  
-to come into its own, and I hope this document gives it a push in the  
-right direction - by supporting both users and developers of ASR  
-technology.  
-  
-  
-  
-I have left a variety of SR techniques out of this document, and  
-instead I have focused on the "HOWTO" aspect (since this is a howto...).  
-I have included a Publications section so the interested reader can  
-find books and articles on anything not covered here. This is not  
-meant to be a definitive statement of ASR on Linux.  
-  
-  
-  
-For the most recent version of this document, check the LDP archive,  
-or go to:  
-http://www.gear21.com/speech/index.html.  
-  
-----  
-!!2.2. Acknowledgements  
-  
-I would like to thank the following people for the help, reviewing,  
-and support of this document:  
-  
-  
-  
-  
-  
-  
-  
-  
-****  
-  
- Jessica Perry Hekman  
-  
-  
-  
-****  
-****  
-  
- Geoff Wexler  
-  
-  
-  
-****  
-  
-----  
-!!2.3. Comments/Updates/Feedback  
-  
-If you have any comments, suggestions, revisions, updates, or just  
-want to chat about ASR, please send an email to me at  
-scook@gear21.com.  
-  
-----  
-!!2.4. !ToDo  
-  
-The following things are left "to do":  
-  
-  
-  
-  
-  
-  
-  
-  
-****  
-  
- Add descriptions in the Publications section.  
-  
-  
-  
-****  
-****  
-  
- Add more books to the Publications section.  
-  
-  
-  
-****  
-****  
-  
- Add more links with descriptions.  
-  
-  
-  
-****  
-****  
-  
- Enhance the description of the ASR system steps  
-  
-  
-  
-****  
-****  
-  
- Include descriptions of FFTs and Filters.  
-  
-  
-  
-****  
-****  
-  
- Include descriptions of DSP principles.  
-  
-  
-  
-****  
-  
-----  
-!!2.5. Revision History  
-  
-v0.1 first rough draft - August 2000  
-  
-  
-  
-v0.5 final draft - September 2000  
-  
-----  
-!!!3. Introduction  
-!!3.1. Speech Recognition Basics  
-  
-  
-Speech recognition is the process by which a computer (or  
-other type of machine) identifies spoken words. Basically, it means  
-talking to your computer, AND having it correctly recognize what you  
-are saying.  
-  
-  
-  
-The following definitions are the basics needed for understanding  
-speech recognition technology.  
-  
-  
-  
-  
-  
-  
-  
-; Utterance:  
-  
- An utterance is the vocalization (speaking) of a word or words that  
-represent a single meaning to the computer. Utterances can be a  
-single word, a few words, a sentence, or even multiple sentences.  
-  
-  
-; Speaker Dependance:  
-  
- Speaker dependent systems are designed around a specific speaker.  
-They generally are more accurate for the correct speaker, but much  
-less accurate for other speakers. They assume the speaker will  
-speak in a consistent voice and tempo. Speaker independent systems  
-are designed for a variety of speakers. Adaptive systems usually start  
-as speaker independent systems and utilize training techniques to  
-adapt to the speaker to increase their recognition accuracy.  
-  
-  
-; Vocabularies:  
-  
- Vocabularies (or dictionaries) are lists of words or utterances that  
-can be recognized by the SR system. Generally, smaller vocabularies  
-are easier for a computer to recognize, while larger vocabularies  
-are more difficult. Unlike normal dictionaries, each entry doesn't  
-have to be a single word. They can be as long as a sentence or two.  
-Smaller vocabularies can have as few as 1 or 2 recognized utterances  
-(e.g."Wake Up"), while very large vocabularies can have a hundred  
-thousand or more!  
-  
-  
-; Accuract:  
-  
- The ability of a recognizer can be examined by measuring its  
-accuracy - or how well it recognizes utterances. This includes not  
-only correctly identifying an utterance but also identifying if the  
-spoken utterance is not in its vocabulary. Good ASR systems have an  
-accuracy of 98% or more! The acceptable accuracy of a system  
-really depends on the application.  
-  
-  
-; Training:  
-  
- Some speech recognizers have the ability to adapt to a speaker.  
-When the system has this ability, it may allow training to take  
-place. An ASR system is trained by having the speaker repeat  
-standard or common phrases and adjusting its comparison algorithms  
-to match that particular speaker. Training a recognizer usually  
-improves its accuracy.  
-  
-  
-  
-  
- Training can also be used by speakers that have difficulty  
-speaking, or pronouncing certain words. As long as the speaker  
-can consistently repeat an utterance, ASR systems with training  
-should be able to adapt.  
-  
-  
-  
-  
-----  
-!!3.2. Types of Speech Recognition  
-  
-  
-Speech recognition systems can be separated in several different  
-classes by describing what types of utterances they have the ability  
-to recognize. These classes are based on the fact that one of the  
-difficulties of ASR is the ability to determine when a speaker starts  
-and finishes an utterance. Most packages can fit into more than one  
-class, depending on which mode they're using.  
-  
-  
-  
-  
-  
-  
-  
-; Isolated Words:  
-  
- Isolated word recognizers usually require each utterance to have  
-quiet (lack of an audio signal) on BOTH sides of the sample window.  
-It doesn't mean that it accepts single words, but does require  
-a single utterance at a time. Often, these systems have  
-"Listen/Not-Listen" states, where they require the speaker to wait  
-between utterances (usually doing processing during the pauses).  
-Isolated Utterance might be a better name for this class.  
-  
-  
-; Connected Words:  
-  
- Connect word systems (or more correctly 'connected utterances')  
-are similar to Isolated words, but allow separate utterances to be  
-'run-together' with a minimal pause between them.  
-  
-  
-; Continuous Speech:  
-  
- Continuous recognition is the next step. Recognizers with continuous  
-speech capabilities are some of the most difficult to create because  
-they must utilize special methods to determine utterance boundaries.  
-Continuous speech recognizers allow users to speak almost naturally,  
-while the computer determines the content. Basically, it's computer  
-dictation.  
-  
-  
-; Spontaneous Speech:  
-  
- There appears to be a variety of definitions for what spontaneous  
-speech actually is. At a basic level, it can be thought of as  
-speech that is natural sounding and not rehearsed. An ASR system  
-with spontaneous speech ability should be able to handle a variety  
-of natural speech features such as words being run together, "ums"  
-and "ahs", and even slight stutters.  
-  
-  
-; Voice Verification/Identification:  
-  
- Some ASR systems have the ability to identify specific users. This  
-document doesn't cover verification or security systems.  
-  
-  
-  
-  
-----  
-!!3.3. Uses and Applications  
-  
-  
-Although any task that involves interfacing with a computer can  
-potentially use ASR, the following applications are the most  
-common right now.  
-  
-  
-  
-  
-  
-  
-  
-; Dictation:  
-  
- Dictation is the most common use for ASR systems today. This  
-includes medical transcriptions, legal and business dictation, as  
-well as general word processing. In some cases special vocabularies  
-are used to increase the accuracy of the system.  
-  
-  
-; Command and Control:  
-  
- ASR systems that are designed to perform functions and actions on the  
-system are defined as Command and Control systems. Utterances like  
-"Open Netscape" and "Start a new xterm" will do just that.  
-  
-  
-; Telephony:  
-  
- Some PBX/Voice Mail systems allow callers to speak commands instead of  
-pressing buttons to send specific tones.  
-  
-  
-; Wearables:  
-  
- Because inputs are limited for wearable devices, speaking is a  
-natural possibility.  
-  
-  
-; Medical/Disabilities:  
-  
- Many people have difficulty typing due to physical limitations such  
-as repetitive strain injuries (RSI), muscular dystrophy, and  
-many others. For example, people with difficulty hearing could use  
-a system connected to their telephone to convert the caller's speech  
-to text.  
-  
-  
-; Embedded Applications:  
-  
- Some newer cellular phones include C8C speech recognition that allow  
-utterances such as "Call Home". This could be a major factor in the  
-future of ASR and Linux. Why can't I talk to my television yet?  
-  
-  
-  
-  
-----  
-!!!4. Hardware  
-!!4.1. Sound Cards  
-  
-  
-Because speech requires a relatively low bandwidth, just about any  
-medium-high quality 16 bit sound card will get the job done. You must  
-have sound enabled in your kernel, and you must have correct drivers  
-installed. For more information on sound cards, please see "The Linux  
-Sound HOWTO" available at: http://www.!LinuxDoc.org/. Sound card  
-quality often starts a heated discussion about their impact on accuracy  
-and noise.  
-  
-  
-  
-Sound cards with the 'cleanest' A/D (analog to digital) conversions  
-are recommended, but most often the clarity of the digital sample is  
-more dependent on the microphone quality and even more dependent on the  
-environmental noise. Electrical "noise" from monitors, pci slots,  
-hard-drives, etc. are usually nothing compared to audible noise  
-from the computer fans, squeaking chairs, or heavy breathing.  
-  
-  
-  
-Some ASR software packages may require a specific sound card. It's  
-usually a good idea to stay away from specific hardware requirements,  
-because it limits many of your possible future options and decisions.  
-You'll have to weigh the benefits and costs if you are considering  
-packages that require specific hardware to function properly.  
-  
-----  
-!!4.2. Microphones  
-  
-A quality microphone is key when utilizing ASR. In most cases, a  
-desktop microphone just won't do the job. They tend to pick up more  
-ambient noise that gives ASR programs a hard time.  
-  
-  
-  
-Hand held microphones are also not the best choice as they can be  
-cumbersome to pick up all the time. While they do limit the amount  
-of ambient noise, they are most useful in applications that require  
-changing speakers often, or when speaking to the recognizer isn't  
-done frequently (when wearing a headset isn't an option).  
-  
-  
-  
-  
-The best choice, and by far the most common is the headset style.  
-It allows the ambient noise to be minimized, while allowing you to  
-have the microphone at the tip of your tongue all the time. Headsets  
-are available without earphones and with earphones (mono or stereo).  
-I recommend the stereo headphones, but it's just a matter of personal  
-taste.  
-  
-  
-  
-You can get excellent quality microphone headsets for between $25  
-$100. A good place to start looking is http://www.headphones.com or  
-http://www.speechcontrol.com.  
-  
-  
-  
-  
-A quick note about levels: Don't forget to turn up your microphone  
-volume. This can be done with a program such as XMixer or OSS Mixer  
-and care should be used to avoid feedback noise. If the ASR software  
-includes auto-adjustment programs, use them instead, as they are  
-optimized for their particular recognition system.  
-  
-----  
-!!4.3. Computers/Processors  
-  
-ASR applications can be heavily dependent on processing speed. This  
-is because a large amount of digital filtering and signal processing  
-can take place in ASR.  
-  
-  
-  
-As with just about any cpu intensive software, the faster the better.  
-Also, the more memory the better. It's possible to do some SR with 100MHz  
-and 16M RAM, but for fast processing (large dictionaries, complex  
-recognition schemes, or high sample rates), you should shoot for a  
-minimum of a 400MHz and 128M RAM. Because of the processing required,  
-most software packages list their minimum requirements.  
-  
-  
-  
-Using a cluster (Beowulf or otherwise) to perform massive recognition  
-efforts hasn't yet been undertaken. If you know of any project underway,  
-or in development please send me a note! scook@gear21.com  
-  
-----  
-!!!5. Speech Recognition Software  
-!!5.1. Free Software  
-  
-Much of the free software listed here is available for download at:  
-http://sunsite.uio.no/pub/Linux/sound/apps/speech/  
-  
-  
-----  
-!5.1.1. XVoice  
-  
-XVoice is a dictation/continuous speech recognizer that can be used  
-with a variety of XWindow applications. It allows user-defined macros.  
-This is a fine program with a definite future. Once setup, it  
-performs with adequate accuracy.  
-  
-  
-  
-XVoice requires that you download and install IBM's (free) !ViaVoice  
-for Linux (See Commercial Section). It also requires the configuration  
-of !ViaVoice to work correctly. Additionally, Lesstif/Motif (libXm) is  
-required. It is also important to note that because this program  
-interacts with X windows, you must leave X resources open on your  
-machine, so caution should be used if you use this on a networked or  
-multi-user machine.  
-  
-  
-  
-This software is primarily for users. An RPM is available.  
-  
-  
-  
-!!HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/  
-http://www.zachary.com/creemer/xvoice.html  
-  
-  
-  
-Project: http://xvoice.sourceforge.net  
-  
-  
-  
-Community: http://www.onelist.com/community/xvoice  
-  
-----  
-!5.1.2. CVoiceControl/kVoiceControl  
-  
-CVoiceControl (which stands for Console Voice Control) started its  
-life as KVoiceControl (KDE Voice Control). It is a basic speech  
-recognition system that allows a user to execute Linux commands by  
-using spoken commands. CVoiceControl replaces KVoiceControl.  
-  
-  
-  
-The software includes a microphone level configuration utility,  
-a vocabulary "model editor" for adding new commands and utterances,  
-and the speech recognition system.  
-  
-  
-  
-CVoiceControl is an excellent starting point for experienced users  
-looking to get started in ASR. It is not the most user friendly,  
-but once it has been trained correctly, it can be very helpful. Be  
-sure to read the documentation while setting up.  
-  
-  
-  
-This software is primarily for users.  
-  
-  
-  
-Homepage: http://www.kiecza.de/daniel/linux/index.html  
-  
-  
-  
-Documents: http://www.kiecza.de/daniel/linux/cvoicecontrol/index.html  
-  
-----  
-!5.1.3. Open Mind Speech  
-  
-Started in late 1999, Open Mind Speech has changed names several times  
-(was !VoiceControl, then !SpeechInput, and then !FreeSpeech), and is now  
-part of the "Open Mind Initiative". This is an open source project.  
-Currently it isn't completely operational and is primarily for developers.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-Homepage: http://freespeech.sourceforge.net  
-  
-----  
-!5.1.4. GVoice  
-  
-GVoice is a speech ASR library that uses IBM's !ViaVoice (free) SDK  
-to control Gtk/GNOME applications. It includes libraries for  
-initialization, recognition engine, vocabulary manipulation, and panel  
-control. Development on this has been idle for over a year.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-Homepage: http://www.cse.ogi.edu/~omega/gnome/gvoice/  
-  
-----  
-!5.1.5. ISIP  
-  
-The Institute for Signal and Information Processing at Mississippi  
-State University has made its speech recognition engine available. The  
-toolkit includes a front-end, a decoder, and a training module. It's a  
-functional toolkit.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-The toolkit (and more information about ISIP) is available at:  
-http://www.isip.msstate.edu/project/speech/  
-  
-----  
-!5.1.6. CMU Sphinx  
-  
-Sphinx originally started at CMU and has recently been released as  
-open source. This is a fairly large program that includes a lot of  
-tools and information. It is still "in development", but includes  
-trainers, recognizers, acoustic models, language models, and some  
-limited documentation.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html  
-  
-  
-  
-Source: http://download.sourceforge.net/cmusphinx/sphinx2-.1a.tar.gz  
-  
-----  
-!5.1.7. Ears  
-  
-Although Ears isn't fully developed, it is a good starting  
-point for programmers wishing to start in ASR.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-FTP site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/recognition/  
-  
-----  
-!5.1.8. NICO ANN Toolkit  
-  
-The NICO Artificial Neural Network toolkit is a flexible back  
-propagation neural network toolkit optimized for speech recognition  
-applications.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-Its homepage: http://www.speech.kth.se/NICO/index.html  
-  
-----  
-!5.1.9. Myers' Hidden Markov Model Software  
-  
-This software by Richard Myers is HMM algorithms written in C++ code.  
-It provides an example and learning tool for HMM models described in  
-the L. Rabiner book "Fundamentals of Speech Recognition".  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-Information is available at:  
-http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html  
-  
-----  
-!5.1.10. Jialong He's Speech Recognition Research Tool  
-  
-Although not originally written for Linux, this research tool can be  
-compiled on Linux. It contains three different types of recognizers:  
-DTW, Dynamic Hidden Markov Model, and a Continuous Density Hidden  
-Markov Model. This is for research and development uses, as it is  
-not a fully functional ASR system. The toolkit contains some very  
-useful tools.  
-  
-  
-  
-This software is primarily for developers.  
-  
-  
-  
-More information is available at:  
-http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html  
-  
-----  
-!5.1.11. More Free Software?  
-  
-If you know of free software that isn't included in the above list,  
-please send me a note at: scook@gear21.com. If you're in the mood,  
-you can also send me where to get a copy of the software, and any  
-impressions you may have about it. Thanks!  
-  
-----  
-!!5.2. Commercial Software  
-!5.2.1. IBM !ViaVoice  
-  
-IBM has made true on their promise to support Linux with their series  
-of !ViaVoice products for Linux, though the future of their SDKs aren't  
-set in stone (their licensing agreement for developers isn't officially  
-released as of this date - more to come).  
-  
-  
-  
-Their commercial (not-free) product, IBM !ViaVoice Dictation for Linux  
-(available at http://www-4.ibm.com/software/speech/linux/dictation.html)  
-performs very well, but has some sizeable system requirements compared  
-to the more basic ASR systems (64M RAM and 233MHz Pentium). For the  
-$59.95US price tag you also get an Andrea NC-8 microphone. It also  
-allows multiple users (but I haven't tried it with multiple users, so  
-if anyone has any experience please give me a shout). The package  
-includes: documentation (PDF), Trainer, dictation system, and  
-installation scripts. Support for additional Linux Distributions based  
-on 2.2 kernels is also available in the latest release.  
-  
-  
-  
- The ASR SDK is available for free, and includes IBM's SMAPI, grammar  
-API, documentation, and a variety of sample programs. The !ViaVoice  
-Run Time Kit provides an ASR engine and data files for dictation  
-functions, and user utilities. The !ViaVoice Command 8 Control Run Time  
-Kit includes the ASR engine and data files for command and control  
-functions, and user utilities. The SDK and Kits require 128M RAM and  
-a Linux 2.2 or better kernel)  
-  
-  
-  
-The SDKs and Kits are available for free at:  
-http://www-4.ibm.com/software/speech/dev/sdk_linux.html  
-  
-----  
-!5.2.2. Vocalis Speechware  
-  
-More information on Vocalis and Vocalis Speechware is available at:  
-http://www.vocalisspeechware.com and  
-http://www.vocalis.com.  
-  
-----  
-!5.2.3. Babel Technologies  
-  
-Babel Technologies has a Linux SDK available called Babear. It is a speaker-independent  
-system based on Hybrid Markov Models and Artificial Neural Networks technology. They also  
-have a variety of products for Text-to-speech, speaker verification, and phoneme analysis.  
-More information is available at: http://www.babeltech.com.  
-  
-----  
-!5.2.4. !SpeechWorks  
-  
-I didn't see anything on their website that specifically mentioned Linux, but their  
-"!OpenSpeech Recognizer" uses VoiceXML, which is an open standard.  
-More information is available at: http://www.speechworks.com.  
-  
-----  
-!5.2.5. Nuance  
-  
-Nuance offers a speech recognition/natural language product (currently Nuance 8.) for  
-a variety of *nix platforms. It can handle very large vocabularies and uses a unqiue  
-distributed architecture for scalability and fault tolerance.  
-More information is available at: http://www.nuance.com.  
-  
-----  
-!5.2.6. Abbot/!AbbotDemo  
-  
-Abbot is a very large vocabulary, speaker independent ASR system.  
-It was originally developed by the Connectionist Speech Group at  
-Cambridge University. It was transferred (commercialized) to  
-!SoftSound. More information is available at:  
-http://www.softsound.com.  
-  
-  
-  
-!AbbotDemo is a demonstration package of Abbot. This demo system  
-has a vocabulary of about 5000 words and uses the connectionist/HMM  
-continuous speech algorithm. This is a demonstration program with no  
-source code.  
-  
-----  
-!5.2.7. Entropic  
-  
-The fine people over at Entropic have been bought out by Micro$oft...  
-Their products and support services have all but disappeared. Their  
-support for HTK and ESPS/waves+ is gone, and their future is in the  
-hands of M$. Their old website as http://www.entropic.com has more  
-information.  
-  
-  
-  
-K.K. Chin advised me that the original developers of the HTK (the  
-Speech Vision and Robotic Group at Cambridge) are still  
-providing support for it. There is also a "free" version  
-available at: http://htk.eng.cam.ac.uk.  
-Also note that Microsoft still owns the copyright to the current  
-HTK code...  
-  
-  
-----  
-!5.2.8. More Commercial Products  
-  
-There are rumors of more commercial ASR products becoming available  
-in the near future (including L8H). I talked with a couple of  
-L8H representatives at Comdex 2000 (Vegas) and none of them could give  
-me any information on a Linux release, or even if they planned on releasing  
-any products for Linux. If you have any further information, please send  
-any details to me at scook@gear21.com.  
-  
-----  
-!!!6. Inside Speech Recognition  
-!!6.1. How Recognizers Work  
-  
-  
-Recognition systems can be broken down into two main types. Pattern  
-Recognition systems compare patterns to known/trained patterns to  
-determine a match. Acoustic Phonetic systems use knowledge of the  
-human body (speech production, and hearing) to compare speech features  
-(phonetics such as vowel sounds). Most modern systems focus on the  
-pattern recognition approach because it combines nicely with current  
-computing techniques and tends to have higher accuracy.  
-  
-  
-  
-Most recognizers can be broken down into the following steps:  
-  
-  
-  
-  
-  
-  
-  
-  
-***#  
-  
- Audio recording and Utterance detection  
-  
-  
-  
-***#  
-***#  
-  
- Pre-Filtering (pre-emphasis, normalization, banding, etc.)  
-  
-  
-  
-***#  
-***#  
-  
- Framing and Windowing (chopping the data into a usable format)  
-  
-  
-  
-***#  
-***#  
-  
- Filtering (further filtering of each window/frame/freq. band)  
-  
-  
-  
-***#  
-***#  
-  
- Comparison and Matching (recognizing the utterance)  
-  
-  
-  
-***#  
-***#  
-  
- Action (Perform function associated with the recognized pattern)  
-  
-  
-  
-***#  
-  
-  
-  
-  
-Although each step seems simple, each one can involve a multitude of  
-different (and sometimes completely opposite) techniques.  
-  
-  
-  
-(1) Audio/Utterance Recording: can be accomplished in a number of ways.  
-Starting points can be found by comparing ambient audio levels (acoustic  
-energy in some cases) with the sample just recorded. Endpoint detection  
-is harder because speakers tend to leave "artifacts" including  
-breathing/sighing,teeth chatters, and echoes.  
-  
-  
-  
-(2) Pre-Filtering: is accomplished in a variety of ways, depending on  
-other features of the recognition system. The most common methods are  
-the "Bank-of-Filters" method which utilizes a series of audio filters to  
-prepare the sample, and the Linear Predictive Coding method which uses  
-a prediction function to calculate differences (errors). Different  
-forms of spectral analysis are also used.  
-  
-  
-  
-(3) Framing/Windowing involves separating the sample data into  
-specific sizes. This is often rolled into step 2 or step 4. This step  
-also involves preparing the sample boundaries for analysis (removing  
-edge clicks, etc.)  
-  
-  
-  
-(4) Additional Filtering is not always present. It is the final  
-preparation for each window before comparison and matching. Often this  
-consists of time alignment and normalization.  
-  
-  
-  
-There are a huge number of techniques available for (5), Comparison  
-and Matching. Most involve comparing the current window with known  
-samples. There are methods that use Hidden Markov Models (HMM),  
-frequency analysis, differential analysis, linear algebra  
-techniques/shortcuts, spectral distortion, and time distortion methods.  
-All these methods are used to generate a probability and accuracy match.  
-  
-  
-  
-(6) Actions can be just about anything the developer wants. *GRIN*  
-  
-----  
-!!6.2. Digital Audio Basics  
-  
-Audio is inherently an analog phenomenon. Recording a digital sample  
-is done by converting the analog signal from the microphone to an  
-digital signal through the A/D converter in the sound card. When a  
-microphone is operating, sound waves vibrate the magnetic element in  
-the microphone, causing an electrical current to the sound card (think  
-of a speaker working in reverse). Basically, the A/D converter records  
-the value of the electrical voltage at specific intervals.  
-  
-  
-  
-There are two important factors during this process. First is the  
-"sample rate", or how often to record the voltage values. Second, is  
-the "bits per sample", or how accurate the value is recorded. A third  
-item is the number of channels (mono or stereo), but for most ASR  
-applications mono is sufficient. Most applications use pre-set values  
-for these parameters and user's shouldn't change them unless the  
-documentation suggests it. Developers should experiment with different  
-values to determine what works best with their algorithms.  
-  
-  
-  
-So what is a good sample rate for ASR? Because speech is relatively  
-low bandwidth (mostly between 100Hz-8kHz), 8000 samples/sec (8kHz) is  
-sufficient for most basic ASR. But, some people prefer 16000  
-samples/sec (16kHz) because it provides more accurate high frequency  
-information. If you have the processing power, use 16kHz. For most  
-ASR applications, sampling rates higher than about 22kHz is a waste.  
-  
-  
-  
-And what is a good value for "bits per sample"? 8 bits per sample  
-will record values between 0 and 255, which means that the position  
-of the microphone element is in one of 256 positions. 16 bits per  
-sample divides the element position into 65536 possible values.  
-Similar to sample rate, if you have enough processing power and  
-memory, go with 16 bits per sample. For comparison, an audio  
-Compact Disc is encoded with 16 bits per sample at about 44kHz.  
-  
-  
-  
-The encoding format used should be simple - linear signed or  
-unsigned. Using a U-Law/A-Law algorithm or some other compression  
-scheme is usually not worth it, as it will cost you in computing power,  
-and not gain you much.  
-  
-----  
-!!!7. Publications  
-  
-If there is a publication that is not on this list, that you think  
-should be, please send the information to me at: scook@gear21.com.  
-  
-----  
-!!7.1. Books  
-  
-  
-  
-  
-  
-  
-  
-****  
-  
- "Fundamentals of Speech Recognition". L. Rabiner 8 B. Juang. 1993.  
-ISBN: 0130151572.  
-  
-  
-  
-****  
-****  
-  
- "How to Build a Speech Recognition Application". B. Balentine,  
-D. Morgan, and W. Meisel. 1999. ISBN: 0967127815.  
-  
-  
-  
-****  
-****  
-  
-  
-"Speech Recognition : Theory and C++ Implementation". C. Becchetti  
-and L.P. Ricotti. 1999. ISBN: 0471977306.  
-  
-  
-  
-****  
-****  
-  
- "Applied Speech Technology". A. Syrdal, R. Bennett, S. Greenspan.  
-1994. ISBN: 0849394562.  
-  
-  
-  
-****  
-****  
-  
- "Speech Recognition : The Complete Practical Reference Guide".  
-P. Foster, T. Schalk. 1993. ISBN: 0936648392.  
-  
-  
-  
-****  
-****  
-  
- "Speech and Language Processing: An Introduction to Natural Language  
-Processing, Computational Linguistics and Speech Recognition".  
-D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.  
-  
-  
-  
-****  
-****  
-  
- "Discrete-Time Processing of Speech Signals (IEEE Press Classic  
-Reissue)". J. Deller, J. Hansen, J. Proakis. 1999.  
-ISBN: 0780353862.  
-  
-  
-  
-****  
-****  
-  
- "Statistical Methods for Speech Recognition (Language, Speech, and  
-Communication)". F. Jelinek. 1999. ISBN: 0262100665.  
-  
-  
-  
-****  
-****  
-  
- "Digital Processing of Speech Signals" L. Rabiner, R. Schafer. 1978.  
-ISBN: 0132136031  
-  
-  
-  
-****  
-****  
-  
- "Foundations of Statistical Natural Language Processing".  
-C. Manning, H. Schutze. 1999. ISBN: 0262133601.  
-  
-  
-  
-****  
-****  
-  
- "Designing Effective Speech Interfaces".  
-S. Weinschenk, D. T. Barker. 2000. ISBN: 0471375454.  
-  
-  
-  
-****  
-  
-  
-  
- For a very LARGE online biography, check the Institut Fur Phonetik:  
-http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html  
-  
-----  
-!!7.2. Internet  
-  
-  
-  
-  
-  
-; news:comp.speech:  
-  
- Newsgroup dedicated to computer and speech.  
-  
-  
-  
-  
-  
-****  
-  
- US: http://www.speech.cs.cmu.edu/comp.speech/  
-  
-  
-  
-****  
-****  
-  
- UK: http://svr-www.eng.cam.ac.uk/comp.speech/  
-  
-  
-  
-****  
-****  
-  
- Aus: http://www.speech.su.oz.au/comp.speech/  
-  
-  
-  
-****  
-  
-  
-; news:comp.speech.users:  
-  
- Newsgroup dedicated to users of speech software.  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-****  
-  
- http://www.speechtechnology.com/users/comp.speech.users.html  
-  
-  
-  
-****  
-  
-  
-; news:comp.speech.research:  
-  
- Newsgroup dedicated to speech software and hardware research.  
-  
-  
-; news:comp.dsp:  
-  
- Newsgroup dedicated to digital signal processing.  
-  
-  
-; news:alt.sci.physics.acoustics:  
-  
- Newsgroup dedicated to the physics of sound.  
-  
-  
-; DDLinux Email List:  
-  
- Speech Recognition on Linux Mailing List.  
-  
-  
-  
-  
-  
-****  
-  
- Homepage: http://leb.net/ddlinux/  
-  
-  
-  
-****  
-****  
-  
- Archives: http://leb.net/pipermail/ddlinux/  
-  
-  
-  
-****  
-  
-  
-; Linux Software Repository for speech applications:  
-  
- http://sunsite.uio.no/pub/linux/sound/apps/speech/  
-  
-  
-; Russ Wilcox's List of Speech Recognition Links:  
-  
- (excellent) http://www.tiac.net/users/rwilcox/speech.html  
-  
-  
-; Online Bibliography:  
-  
- Online Bibliography of Phonetics and Speech Technology Publications.  
-http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html  
-  
-  
-; MIT's Spoken Language Systems Homepage:  
-  
- http://www.sls.lcs.mit.edu/sls/  
-  
-  
-; Oregon Graduate Institute:  
-  
- Center for Spoken Language Understanding at Oregon Graduate  
-Institute. An excellent location for developers and researchers.  
-http://cslu.cse.ogi.edu/  
-  
-  
-; IBM's !ViaVoice Linux SDK:  
-  
- http://www-4.ibm.com/software/speech/dev/sdk_linux.html  
-  
-  
-; Mississippi State:  
-  
- Mississippi State Institute for Signal and Information Processing  
-homepage with a large amount of useful information for developers.  
-http://www.isip.msstate.edu/projects/speech/  
-  
-  
-; Speech Technology:  
-  
- ASR software and accessories.  
-http://www.speechtechnology.com  
-  
-  
-; Speech Control:  
-  
- Speech Controlled Computer Systems. Microphones, headsets, and  
-wireless products for ASR.  
-http://www.speechcontrol.com  
-  
-  
-; Microphones.com:  
-  
- Microphones and accessories for ASR.  
-http://www.microphones.com  
-  
-  
-; 21st Century Eloquence:  
-  
- "Speech Recognition Specialists."  
-http://voicerecognition.com  
-  
-  
-; Computing Out Loud:  
-  
- Primarily for Windows users, but good info.  
-http://www.out-loud.com  
-  
-  
-; Say I Can.com:  
-  
- "The Speech Recognition Information Source."  
-http://www.sayican.com  
+Describe [HowToSpeechRecognitionHOWTO] here.