The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners

Nechanský, Tomáš; Bořil,  Tomáš; Houzar,  Alžběta; Skarnitzl,  Radek

AUC PHILOLOGICA

AUC Philologica (Acta Universitatis Carolinae Philologica) is an academic journal published by Charles University. It publishes scholarly articles in a large number of disciplines (English, German, Greek and Latin, Oriental, Romance and Slavonic studies, as well as in phonetics and translation studies), both on linguistic and on literary and cultural topics. Apart from articles it publishes reviews of new academic books or special issues of academic journals.

The journal is indexed in CEEOL, DOAJ, EBSCO, and ERIH PLUS.

AUC PHILOLOGICA, Vol 2022 No 1 (2022), 11–22

The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners

Tomáš Nechanský, Tomáš Bořil, Alžběta Houzar, Radek Skarnitzl

DOI: https://doi.org/10.14712/24646830.2022.25
published online: 17. 01. 2023

abstract

The so-called ‘mismatch’ is a factor which experts in the forensic voice comparison field encounter regularly. Therefore, we decided to explore to what extent the automatic-speaker-recognition system’s and the earwitness’ ability to identify speakers is influenced when recordings are acquired in different languages and at different times. 100 voices in a database of 300 recordings (100 speakers recorded in three mutually mismatched sessions) were compared with an automatic-speaker-recognition software VOCALISE based on i-vectors and x-vectors, and by 39 respondents in simulated voice parades. Both the automatic and perceptual approach seem to have yielded similar results in that the less complex the mismatch type, the more successful the identification. The results point to the superiority of the x-vector approach, and also to varying identification abilities of listeners.

keywords: forensic voice comparison; temporal mismatch; language mismatch; automatic speaker recognition; voice parade

references (35)

1. Alexander, A., Dessimoz, D., Botti, F., & Drygajlo, A. (2005). Aural and automatic forensic speaker recognition in mismatched conditions. International Journal of Speech, Language and the Law, 12(2), 214-234. CrossRef

2. Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. CrossRef

3. Bortlík, J. F. (2021). Czech accent in English: Linguistics and biometric speech technologies. Palacký University Olomouc. (unpublished PhD dissertation)

4. de Jong-Lendle, G., Nolan, F., McDougall, K., & Hudson, T. (2015). Voice lineups: A practical guide. In: Proceedings of ICPhS 2015, paper 0598.

5. Earnshaw, K. (2021). Examining the implications of speech accommodation for forensic speaker comparison casework: A case study of the West Yorkshire FACE vowel. Journal of Phonetics, 87, 101062. CrossRef

6. Eriksson, A. (2010). The disguised voice: Imitating accents or speech styles and impersonating individuals. In: Llamas, C., & Watt, D. (Eds.), Language and identities (pp. 86-96). Edinburgh University Press. CrossRef

7. Eriksson, E. J., Rodman, R. D., Hubal, R. C. (2007). Emotions in speech: Juristic implications. In: Müller, C. (Ed.), Speaker classification I (pp. 152-173). Springer-Verlag. CrossRef

8. Gold, E., Ross, R., & Earnshaw, K. (2022, accepted). Within-speaker variation: Speaker-based causes. In: Nolan, F., McDougall K., & Hudson, T. (Eds.), Oxford handbook of forensic phonetics. Oxford University Press.

9. Guillemin, B. (2022, accepted). Within-speaker variation: External causes. In: Nolan, F., McDougall K., & Hudson, T. (Eds.), Oxford handbook of forensic phonetics. Oxford University Press.

10. Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(November), 74-99. CrossRef

11. Hollien, H., & Schwartz, R. (2000). Aural-perceptual speaker identification: Problems with noncontemporary samples. Forensic Linguistics, 7(2), 199-211. CrossRef

12. Hughes, V., Harrison, P., Foulkes, P., French, P., & Gully, A. J. (2019). Effects of formant analysis settings and channel mismatch on semi-automatic forensic voice comparison. In: Proceedings of ICPhS 2019, 3080-3084.

13. Jessen, M. (2009). Forensic phonetics and the influence of speaking style on global measures of fundamental frequency. In: Grewendorf, G., & Rathert, M. (Eds.), Formal linguistics and law (pp. 115-139). Mouton de Gruyter.

14. Kelly, F., Forth, O., Kent, S., Gerlach, L., & Alexander, A. (2019). Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors. Presented at the Audio Engineering Society (AES) Forensics Conference 2019, Porto, Portugal, 2019. Retrieved from https://www.aes.org/e-lib/browse.cfm?elib=20477

15. Kelly, F., & Hansen, J. H. L. (2015). Evaluation and calibration of short-term aging effects in speaker verification. In: Proceedings of Interspeech 2015, 224-228. CrossRef

16. Lenth, R. (2022). emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.7.4-1, <https://cran.r-project.org/package=emmeans>.

17. McDougall, K., & Duckworth, M. (2018). Individual patterns of disfluency across speaking styles: A forensic phonetic investigation of Standard Southern British English. International Journal of Speech, Language and the Law, 25(2), 205-230. CrossRef

18. Misra, A., & Hansen, J. H. L. (2014). Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS BI-LING corpora. In: Proceedings of the IEEE Spoken Language Technology Workshop, 372-377. CrossRef

19. Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91-98. CrossRef

20. Neuhauser, S., & Simpson, A. P. (2007). Imitated or authentic? Listeners' judgements of foreign accents. In: Proceedings of ICPhS 2007, 1805-1808.

21. R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

22. Rhodes, R. (2017). Aging effects on voice features used in forensic speaker comparison. International Journal of Speech, Language and the Law, 24(2), 177-199. CrossRef

23. Rogers, H. (1998). Foreign accent in voice discrimination: A case study. Forensic Linguistics, 5(2), 203-208. CrossRef

24. Ross, S., Earnshaw, K., & Gold, E. (2019). A cautionary tale for phonetic analysis: the variability of speech between and within recording sessions. In: Proceedings of ICPhS 2019, 3090-3094.

25. Růžičková, A., & Skarnitzl, R. (2017). Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae - Philologica 3, Phonetica Pragensia XIV, 19-34. CrossRef

26. Scherer, K. R. (2019). Acoustic patterning of emotion vocalization. In: Frühholz, S., & Belin, P. (Eds.), Oxford handbook of voice perception (pp. 61-91). Oxford University Press. CrossRef

27. Schiller, N. O., Köster, O., & Duckworth, M. (1997). The effect of removing linguistic information upon identifying speakers of a foreign language. International Journal of Speech, Language and the Law, 4(1), 1-17. CrossRef

28. Shriberg, E., & Scheffer, N. (2009). Does session variability compensation in speaker recognition model intrinsic variation under mismatched conditions? In: Proceedings of Interspeech 2009, 1551-1554. CrossRef

29. Singmann, H., Bolker, B., Westfall, J., Aust, F., & Ben-Shachar, M. (2022). afex: Analysis of Factorial Experiments. R package version 1.1-1, <https://cran.r-project.org/package=afex>.

30. Skarnitzl, R., Asiaee, M., & Nourbakhsh, M. (2019). Tuning the performance of automatic speaker recognition in different conditions: Effects of language and simulated voice disguise. International Journal of Speech, Language and the Law, 26(2), 209-229. CrossRef

31. Stoet, G. (2010). PsyToolkit - A software package for programming psychological experiments using Linux. Behavior Research Methods, 42(4), 1096-1104. CrossRef

32. Stoet, G. (2017). PsyToolkit: A novel web-based method for running online questionnaires and reaction-time experiments. Teaching of Psychology, 44(1), 24-31. CrossRef

33. Šturm, P., Skarnitzl, R., & Nechanský, T. (2021). Prosodic accommodation in face-to-face and telephone dialogues. In: Proceedings of Interspeech 2021, 1444-1448. CrossRef

34. Sullivan, K., & Schlichting, F. (2000). Speaker discrimination in a foreign language: First language environment, second language learners. Forensic Linguistics, 7(1), 95-111. CrossRef

35. Torstensson, N., Eriksson, E. J., & Sullivan, K. P. H. (2004). Mimicked accents - Do speakers have similar cognitive prototypes? In: Proceedings of SST2004: the 10th Australian international conference on speech science and technology, 271-276.

The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners is licensed under a Creative Commons Attribution 4.0 International License.

230 x 157 mm
periodicity: 3 x per year
print price: 150 czk
ISSN: 0567-8269
E-ISSN: 2464-6830

Download

Phil_2022_1_0011.pdf

Share