Speech recognition is continually being realized as a user interface in new applications. As this technology progresses, it enables new ways for humans to interact with machines and information. The performance in many domains has approached users’ expectations. Although there are still abundant technology challenges ahead, speech recognition has reached a maturity level that requires one to consider its deployment in complex systems and environments. It is in this vein that we discuss a systems approach to the successful execution of speech recognition within the Force XXI Land Warrior program.
We discuss the System Voice Control component as it fits within the overall program. The requirements for robustness, recognition, and computational complexity issues are addressed. We explicitly cover the system aspects and how they influence the user interface and reveal the parameters for actual use. Finally, we consider the implementation of a polynomial-based classifier for speech recognition, and we provide the final system performance measures on a large domain specific database.
The Land Warrior Engineering Manufacturing Development (EMD) program is the Army’s revolutionary program to develop and field a totally integrated Soldier Fighting System. This system uses advanced technologies to render unparalleled effectiveness by providing an improved capability to detect, acquire, locate and engage targets at greater ranges, day or night. The system links the individual soldier to the digitized battlefield for improved communications and situational awareness.
The purpose of the Force XXI Land Warrior program is to accelerate the fielding of advanced technology upgrades to the Land Warrior EMD platform. This ensures a global technology advantage for dismounted warrior combat systems.
The System Voice Control (SVC) component of Force XXI Land Warrior provides a speech interface to the existing soldier computer from the Land Warrior EMD program. The intent is to provide the dismounted soldier with an efficient method of hands-busy, eyes-busy control of the soldier system. Figure 1 illustrates the SVC concept.
The application of speech recognition in a combat environment elicits challenging performance requirements. Both recognition and out-of-vocabulary (OOV) rejection must maintain usable performance levels in adverse noise conditions. In addition, the system must respond to a wide dynamic range of voice levels – low levels for covert operations, and high levels for noisy situations. Voice stress is another concern for performance, as the Lombard effect  is often encountered. Finally, the algorithms must be computationally efficient so as not to drain the system battery, and the word models must be sufficiently small in order to fit into the available memory.
The design of the overall system and the user interface is discussed in Section 2. The structure of the actual fielded speech recognition algorithm, which is based on a polynomial classifier, is explained in Section 3. In Sections 4 and 5, solutions to the technical problems of noise robustness, stressed speech and outof-vocabulary rejection are discussed. Finally, the validation and results of the final system are given in Section 6.
2. SYSTEM DESIGN & USER INTERFACE
The basic block diagram of the SVC component is shown in Figure 2. The soldier depresses a button on his weapon to initiate the recognition system. A close-talking noise-canceling microphone captures the spoken command, which is digitized by an A/D. The recognizer processes the sampled speech, and the appropriate response is generated by the soldier computer.
The successful deployment of speech recognition is credited with the systems approach to the design and implementation of SVC. Each component, from the user interface to the back-end classifier, is designed to work together in order to maximize performance for this application and for the target users.
Understanding of the users’ environment and true use scenarios is gained through extensive user experience studies and actual involvement in live-fire (and other) exercises. Based on the collected data, one extracts detailed user interface specifications as well as the expected environmental conditions in which the speech recognition algorithms must operate.