This document accompanies the program code of the OCTAVE deliverable D11 ‘Single mode biometric engines’. Here, an engine is understood as a collection of pre-configured wrappers combined with existing open-source speaker verification systems. The main intention of D11 is to provide a simple baseline recognizer to enable studying relevant data engineering questions in later stages of OCTAVE. By a ‘simple baseline’ we refer to a system that (1) requires as few additional hyper-parameter training sets or additional parameter tuning as possible; (2) is representative of state-of-the-art for short-duration speech utterances, the core of OCTAVE use cases. The adopted technical solution to fulfil these constraints is a standard Gaussian mixture model, universal background model (GMM-UBM) system that includes mainly selection of the UBM training data and optimizing the number of Gaussians. It is known, however, that GMM-UBM systems are sensitive to changes in channel, noise and intersession variability, calling enhancements to the feature extraction part. To this end, the provided package contains a few alternative configurations for feature extraction whose benefits are demonstrated by experiments. In general, the engine is designed to be a complete package including feature extraction, UBM training, speaker adaptation, scoring and error rate computation. It contains two alternative reference implementations for the back-end processing modules, based on two popular open-source speaker recognition toolkits: MSRidentity toolbox and Alize. OCTAVE partners can adopt either baseline system. The engine is written to support Windows and Linux environments and has been independently set up and executed at all the sites contributing to D11. This document provides a tentative set of results on the recent RSR 2015 corpus intended for benchmarking text-dependent automatic speaker verification. In particular, we provide training and trial lists (definition of speaker pairs) to simulate system evaluation of three different configurations concerning speech content, (a) fixed pass-phrase, (b) text-prompted phrases, and (c) text-prompted text-independent engines. The first case includes a fixed phrase shared by all the users; the second case refers to a scenario whereby a system prompts a randomly selected phrase out of a close subset of passphrases. The last scenario is essentially a text-independent one with arbitrary enrolment and test phrases.

Source: WP 4 Hybrid Voice Biometrics

Dissemination level: Confidential

To know more about the document, you may place a request in the ‘Contact’ section of this site. We reserve the right to decide how much we can disclose.