X t−1 , cT1 ) t=1 P(x = arg maxC (4) T P(ct |c1 , c2 , . , ct−1 ) · t=1 T ≈ arg maxC T P(xxt |ct ) · t=1 P(ct |ct−1 ) (1) with X ϭ xT1 ϭ ͕x1, x2, . , xT͖ being the input vector feature sequence (frames) calculated from the observed waveform, in practical systems around 100 40-dimensional vectors/s input speech, with C ϭ cT1 ϭ ͕c1, c2, . , cT͖ being any valid symbol sequence and C* being the recognized symbol sequence with the highest probability among all possible sequences. In the case of word recognition, valid symbol classes ct are any words which are listed in a pronunciation dictionary that contains all words to be recognized as phoneme sequences; and in the case of phoneme recognition, valid symbol classes ct are all possible phonemes, which are, depending on the language to recognize, usually around 50.

These systems are often referred to as hybrid systems, because NNs are used in combination with other techniques like Hidden Markov Models (HMMs). Neural networks are not the only way to approach the subproblem discussed above, and in speech recognition they have to compete with other techniques—for example, unconditional mixture density estimation with Gaussian kernels. Although most of the current state-of-the-art speech recognition systems don’t use neural networks, systems based on NNs have a number of advantages, which include the following: (1) For similar recognition rates, NN systems are often faster than systems using the traditional techniques, often by a factor of 2 to 5; (2) for similar recognition rates, NN systems use, because of implicit parameter sharing, less parameters in the model, often by a factor of 5 to 10, which leads in turn to a system with lower memory requirements; and (3) NNs are, because of their in general very regular structure, believed to build easier in hardware than other model types.

