These techniques can identify if a speaker belongs to a set of known people.
Speaker recognition is the computer problem of establishing the identity of a speaker using voice characteristics. It is different from speech recognition, where the goal is to identify the words being spoken. An example of speaker recognition technology is building security, where a door only opens when a given person speaks into the microphone. Several methods can be used to accomplish this task.
Frequency Estimation
The spoken signal has an unknown noise component, such as background noise and audio equipment noise. Frequency estimation methods estimate the noise component by using techniques such as solving for eigenvectors, a type of mathematics important in physics and engineering; subtracting the noise from the input to get an approximation to the signal of interest; and decomposing that signal as a sum of complex frequency components. The most important fact about this method is that the noise-free voice of a given speaker is reduced to a more manageable representation: the voice's intensity on a few frequency components (that happen to be the most intense ones.) This
Hidden Markov Models
A hidden Markov model always is in one of a set of states, but the current state is not visible to the observer. Such a model is constantly making transitions from the current state to the next at rates, and with probabilities, determined by the model's parameters. When making a transition, the model may emit an output with a known probability. The same output can be generated by a transition from multiple states, with different probabilities. In the particular case of speaker recognition, a hidden Markov model emits outputs representing phonemes with probabilities that depend on the prior sequence of visited states. A speaker uttering a sequence of phonemes (i.e., talking) corresponds to the model visiting a sequence of states and emitting outputs corresponding to the same phonemes. This
Pattern Recognition
This technique, among the most complex being used for speaker recognition, compares two voice streams: the one spoken by the authenticated speaker while training the system, and the one spoken by the unknown speaker who is attempting to gain access. The speaker utters the same words when training the system and, later, when trying to prove his identity. The computer aligns the training sound stream with the one just obtained (to account for small variations in rhythm and for delays in beginning to speak). Then, the computer discretizes each of the two streams as a sequence of frames and computes the probability that each pair of frames was spoken by the same speaker by running them through a multilayer perceptron--a particular type of neural network trained for this task.