Availability of spoken language dataset has proved crucial to the progress of research and technology on speech processing, ever since. Bearing this in mind, we have designed and implemented a unified database, where metadata of virtually any speech-dataset can be represented. The project has defined and normalised a general format for the metadata. Open-source tools have been used to import specific datasets into the database. Tools for the management of the data have also been also realised (e.g. automatic transcription, quality measurements, etc.). In this way, speech-signals from  a specific dataset can be easily recalled. The unified database can be seen indeed as a “corpus of corpora”.

corpus of corpora