The vulnerability of voice biometrics to spoofing attacks is now well acknowledged. If exposed, these vulnerabilities not only threaten reliability, but also have the potential to spoil user-confidence and hence hinder future exploitation of voice biometrics technology. It is thus critical that countermeasures to protect from spoofing are integrated into the OCTAVE platform or trusted biometric authentication service (TBAS). A possible taxonomy of spoofing attacks and their characteristics is listed in in following schema.
The inclusion of countermeasures guarantees high performance of a voice biometrics system when malicious attacks are presented. This is clearly demonstrated in terms of performance comparison. With the introduction of advanced countermeasures modules, OCTAVE aims to almost completely eliminate degradation due to malicious attacks, bringing the functioning of the voice biometrics to a level-of state-of-the-art performance. The following figure shows a comparison of the performance resulting when a voice biometrics system is attacked: with no countermeasures (red line); with standard countermeasures (blue line); with advanced countermeasures (green line).
In the OCTAVE Platform spoofing detection is performed in parallel with automatic speaker verification. Innovative speech features based on perceptual characteristics have been introduced, with very promising results. Specific algorithms have been considered for the different spoofing technologies.
In OCTAVE we introduce the hybrid
mode of operation, which is the combination of the three modes described above. Specifically, the user provides a fixed-passphrase, a text-dependent and a text-independent voice input. Each input is processed by mode-dependent speaker acoustic models and the corresponding scores are afterwards fused by a machine learning model.
In voice biometrics interfaces three main modes of operation can be found.
- In the fixed-passphrase mode the user is asked to say a fixed passphrase or password which has to know in advance and remember. The speaker acoustic models used in this mode are trained with enrolment recordings of the same fixed passphrase. This mode of operation presents high speaker verification accuracy however it is vulnerable to spoofing attacks such as audio replay.
- In the text-dependent mode the user is asked to read a prompted message which is usually randomly selected from a list of utterances. The speaker acoustic models used in this mode are trained with enrolment recordings of the same (i.e. text-dependent) utterances. This mode of operation presents good speaker verification accuracy however it is less vulnerable to spoofing attacks.
- In the text-independent mode the user is asked to read a prompted message which can be produced by a random word sequence generator. The speaker acoustic models used in this mode are trained with enrolment recordings of the different (i.e. text-independent) utterances. This mode of operation presents lower speaker verification accuracy however it is robust to spoofing attacks such as synthetic speech and voice conversion (audio replay attacks are practically not applicable in this case).
One of the main purpose of the project is to provide algorithmic modules to enhance the performance of the OCTAVE platform TBAS within real-world, often adverse, environments where noise and other distortions are likely to be encountered.
The main focus has been on front-end processing in terms of robust voice activity detection (VAD), robust feature extraction, speech enhancement and noise characterization. Further, model-domain acoustic normalization, score normalization, and data collection using a throat microphone are investigated. To evaluate the performance of all these different approches a code framework is used which consists of software modules, including code and wrapper scripts.
A GMM-UBM is chosen as the back-end automatic speaker verification system, which is used to evaluate the performance of each module.
A large array of noise-robustness algorithms and an optimised end-to-end system have been evaluated using the standard RSR2015 database, with a variety of manually added noise in addition to a speech codec. Several speech enhancement algorithms are evaluated through subjective listening tests.
This equipment merges audio and throat microphone into a stereo signal, where both inputs are always kept in sync. This enhances speaker verification task in noisy environments, because verification is based on two input streams.
The audio signal of a voice utterance recorded with the normal audio microphone (up) and the throat microphone (low).
Throat microphone can be used for voice activity detection, by which it is also possible to remove ambient noise from audio signal.
APLcomp provided research equipment for AAU and UEF, which they used for studying and developing methods for throat microphone based speaker verification.
This research displayed great improvement possibilities of dual microphone technology.
By using these experiences APLcomp developed a pre-commercial hardware product called DualMic, which integrates the voice capture tasks on a single board. It can make use of commodity quality audio and throat microphones in order to output high quality stereo voice.
DualMic can be used for speaker verification and possibly for other speech enhancement needs.
Research data acquisition equipment (left) and single-board DualMic prototype (right).