AUC PHILOLOGICA
AUC PHILOLOGICA

AUC Philologica (Acta Universitatis Carolinae Philologica) is an academic journal published by Charles University. It publishes scholarly articles in a large number of disciplines (English, German, Greek and Latin, Oriental, Romance and Slavonic studies, as well as in phonetics and translation studies), both on linguistic and on literary and cultural topics. Apart from articles it publishes reviews of new academic books or special issues of academic journals.

The journal is indexed in CEEOL, DOAJ, EBSCO, and ERIH PLUS.

AUC PHILOLOGICA, Vol 2025 No 3 (2025), 43–60

Article

Customising Czech Phonetic Alignment using HuBERT and manual segmentation

Adléta HanžlováORCID, Václav Hanžl

DOI: https://doi.org/10.14712/24646830.2025.20
published online: 26. 01. 2026

abstract

This paper presents Prak, a forced alignment tool developed for Czech, with a focus on transparent modular design and phonetic accuracy. In addition to a rule-based pronunciation module and exception handling, Prak introduces a novel application of non-deterministic, backward-processing FSTs to model complex regressive assimilation processes in Czech consonant clusters. We further describe the integration of a HuBERT-based transformer model and training including extensive manually time-aligned data to enhance phone classification accuracy while maintaining ease of installation and use. Evaluation against a manually aligned test corpus demonstrates that the enhanced model significantly outperforms both our earlier Prak-CV model and the long-established previous forced alignment baseline. The new model reduces major boundary errors and mismatches, bringing alignment accuracy closer to manual phonetic segmentation standards for Czech. We emphasize both methodological transparency and practical usability, aiming to support phoneticians working with Czech as well as developers interested in extending the tool for other languages.

keywords: forced alignment; phonetic segmentation; Czech; HuBERT; Prak; Praat

references (32)

1. Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2019). Common Voice: A Massively-multilingual speech corpus. arXiv Preprint arXiv:1912.06670.

2. Bavarian Archive for Speech Signals. (2018). Terms of Usage. Version 4.0. BAS Web Services. https://clarin.phonetik.uni-muenchen.de/BASWebServices/help/termsOfUsage

3. Boersma, P., & Weenink, D. (2023). Praat: Doing phonetics by computer. [Computer program]. Version 6.3.14. http://www.praat.org

4. Boersma, P., Weenink, D., & collaborators. (2023). Praat: Doing phonetics by computer [Source code]. https://github.com/praat/praat

5. Boigne, J. (2021). HuBERT: How to Apply BERT to Speech, Visually Explained. https://jonathanbgn.com/2021/10/30/hubert-visually-explained.html

6. Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(1), 35. CrossRef

7. Hanžl, V. (2023). Details of the Montreal FA. Prak Wiki. https://github.com/vaclavhanzl/prak/wiki/Details-of-the-Montreal-FA

8. Hanžl, V., & Hanžlová, A. (2023). Prak: An automatic phonetic alignment tool for Czech. In R. Skarnitzl & J. Volín (eds.), Proceedings of the 20th International Congress of Phonetic Sciences (pp. 3121-3125). Guarant International.

9. Hanžl, V., & Hanžlová, A. (2025). prak: Czech phonetic alignment tool. https://github.com/vaclavhanzl/prak

10. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. CrossRef

11. Kisler, T., Reichel, U., & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326-347. CrossRef

12. Kuldanová, P., Hebal-Jezierska, M., & Petráš, P. (2022). Orthoepy of West Slavonic Languages (Czech, Slovak and Polish). Ostravská univerzita. CrossRef

13. Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI Open, 3, 111-132. CrossRef

14. Machač, P., & Skarnitzl, R. (2009). Principles of phonetic segmentation. Epocha.

15. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. Proceedings of Interspeech 2017, 498-502. CrossRef

16. Opensource.org. (2025). The MIT License. Open Source Initiative. https://opensource.org/licenses/MIT

17. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210. CrossRef

18. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An Imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 8026-8037.

19. Patc, Z., Mizera, P., & Pollak, P. (2015). Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. Text, Speech, and Dialogue, 433-441. CrossRef

20. Pavlík, R. (2009). A Typology of assimilations. SKASE Journal of Theoretical Linguistics, 6(1), 2-26.

21. Pettarin, A. (2018). A collection of links and notes on forced alignment tools. https://github.com/pettarin/forced-alignment-tools

22. Pollák, P., Volín, J., & Skarnitzl, R. (2005). Influence of HMM's parameters on the accuracy of phone seg¬mentation-evaluation baseline. Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop 'Speech Processing', 1, 302-309.

23. Pollák, P., Volín, J., & Skarnitzl, R. (2007). HMM-based phonetic segmentation in Praat environment. The XII International Conference Speech and Computer - SPECOM, 537-541.

24. Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. Proceedings of the XIVth International Congress of Phonetic Sciences, 607-610.

25. Skarnitzl, R. (2011). Znělostní kontrast nejen v češtině. Epocha.

26. Torchaudio Contributors. (2024). HUBERT_BASE. https://docs.pytorch.org/audio/2.4.0/generated/torchaudio.pipelines.HUBERT_BASE.html

27. Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace.

28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

29. Volín, J. (2012). Jak se v Čechách 'rázuje'. Naše řeč, 95(1), 51-54.

30. Volín, J., & Skarnitzl, R. (2018). Segmentální plán češtiny. Univerzita Karlova, Filozofická fakulta.

31. Volín, J., & Skarnitzl, R. (2022). The impact of prosodic position on post-stress rise in three genres of Czech. Speech Prosody 2022, 505-509. CrossRef

32. Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J., Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran, S., … Shi, Y. (2022). TorchAudio: Building blocks for audio and speech process¬ing. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 6982-6986. CrossRef

Creative Commons License
Customising Czech Phonetic Alignment using HuBERT and manual segmentation is licensed under a Creative Commons Attribution 4.0 International License.

230 x 157 mm
periodicity: 3 x per year
print price: 150 czk
ISSN: 0567-8269
E-ISSN: 2464-6830

Download