Customising Czech Phonetic Alignment using HuBERT and manual segmentation

Hanžlová, Adléta; Hanžl,  Václav

AUC PHILOLOGICA

AUC Philologica (Acta Universitatis Carolinae Philologica) je akademický časopis publikující jak lingvistické, tak literárně historické a teoretické studie. Nedílnou součástí časopisu jsou i recenze odborných knih a zprávy z akademického prostředí.

Časopis je indexován v databázích CEEOL, DOAJ, EBSCO a ERIH PLUS.

AUC PHILOLOGICA, Vol 2025 No 3 (2025), 43–60

Article

Customising Czech Phonetic Alignment using HuBERT and manual segmentation

Adléta Hanžlová, Václav Hanžl

DOI: https://doi.org/10.14712/24646830.2025.20
zveřejněno: 26. 01. 2026

Abstract

This paper presents Prak, a forced alignment tool developed for Czech, with a focus on transparent modular design and phonetic accuracy. In addition to a rule-based pronunciation module and exception handling, Prak introduces a novel application of non-deterministic, backward-processing FSTs to model complex regressive assimilation processes in Czech consonant clusters. We further describe the integration of a HuBERT-based transformer model and training including extensive manually time-aligned data to enhance phone classification accuracy while maintaining ease of installation and use. Evaluation against a manually aligned test corpus demonstrates that the enhanced model significantly outperforms both our earlier Prak-CV model and the long-established previous forced alignment baseline. The new model reduces major boundary errors and mismatches, bringing alignment accuracy closer to manual phonetic segmentation standards for Czech. We emphasize both methodological transparency and practical usability, aiming to support phoneticians working with Czech as well as developers interested in extending the tool for other languages.

klíčová slova: forced alignment; phonetic segmentation; Czech; HuBERT; Prak; Praat

reference (32)

1. Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2019). Common Voice: A Massively-multilingual speech corpus. arXiv Preprint arXiv:1912.06670.

2. Bavarian Archive for Speech Signals. (2018). Terms of Usage. Version 4.0. BAS Web Services. https://clarin.phonetik.uni-muenchen.de/BASWebServices/help/termsOfUsage

3. Boersma, P., & Weenink, D. (2023). Praat: Doing phonetics by computer. [Computer program]. Version 6.3.14. http://www.praat.org

4. Boersma, P., Weenink, D., & collaborators. (2023). Praat: Doing phonetics by computer [Source code]. https://github.com/praat/praat

5. Boigne, J. (2021). HuBERT: How to Apply BERT to Speech, Visually Explained. https://jonathanbgn.com/2021/10/30/hubert-visually-explained.html

6. Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(1), 35. CrossRef

7. Hanžl, V. (2023). Details of the Montreal FA. Prak Wiki. https://github.com/vaclavhanzl/prak/wiki/Details-of-the-Montreal-FA

8. Hanžl, V., & Hanžlová, A. (2023). Prak: An automatic phonetic alignment tool for Czech. In R. Skarnitzl & J. Volín (eds.), Proceedings of the 20th International Congress of Phonetic Sciences (pp. 3121-3125). Guarant International.

9. Hanžl, V., & Hanžlová, A. (2025). prak: Czech phonetic alignment tool. https://github.com/vaclavhanzl/prak

10. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460. CrossRef

11. Kisler, T., Reichel, U., & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326-347. CrossRef

12. Kuldanová, P., Hebal-Jezierska, M., & Petráš, P. (2022). Orthoepy of West Slavonic Languages (Czech, Slovak and Polish). Ostravská univerzita. CrossRef

13. Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI Open, 3, 111-132. CrossRef

14. Machač, P., & Skarnitzl, R. (2009). Principles of phonetic segmentation. Epocha.

15. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. Proceedings of Interspeech 2017, 498-502. CrossRef

16. Opensource.org. (2025). The MIT License. Open Source Initiative. https://opensource.org/licenses/MIT

17. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210. CrossRef

18. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An Imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 8026-8037.

19. Patc, Z., Mizera, P., & Pollak, P. (2015). Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. Text, Speech, and Dialogue, 433-441. CrossRef

20. Pavlík, R. (2009). A Typology of assimilations. SKASE Journal of Theoretical Linguistics, 6(1), 2-26.

21. Pettarin, A. (2018). A collection of links and notes on forced alignment tools. https://github.com/pettarin/forced-alignment-tools

22. Pollák, P., Volín, J., & Skarnitzl, R. (2005). Influence of HMM's parameters on the accuracy of phone seg¬mentation-evaluation baseline. Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop 'Speech Processing', 1, 302-309.

23. Pollák, P., Volín, J., & Skarnitzl, R. (2007). HMM-based phonetic segmentation in Praat environment. The XII International Conference Speech and Computer - SPECOM, 537-541.

24. Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. Proceedings of the XIVth International Congress of Phonetic Sciences, 607-610.

25. Skarnitzl, R. (2011). Znělostní kontrast nejen v češtině. Epocha.

26. Torchaudio Contributors. (2024). HUBERT_BASE. https://docs.pytorch.org/audio/2.4.0/generated/torchaudio.pipelines.HUBERT_BASE.html

27. Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace.

28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

29. Volín, J. (2012). Jak se v Čechách 'rázuje'. Naše řeč, 95(1), 51-54.

30. Volín, J., & Skarnitzl, R. (2018). Segmentální plán češtiny. Univerzita Karlova, Filozofická fakulta.

31. Volín, J., & Skarnitzl, R. (2022). The impact of prosodic position on post-stress rise in three genres of Czech. Speech Prosody 2022, 505-509. CrossRef

32. Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J., Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran, S., … Shi, Y. (2022). TorchAudio: Building blocks for audio and speech process¬ing. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 6982-6986. CrossRef

Customising Czech Phonetic Alignment using HuBERT and manual segmentation is licensed under a Creative Commons Attribution 4.0 International License.

230 x 157 mm
vychází: 3 x ročně
cena tištěného čísla: 150 Kč
ISSN: 0567-8269
E-ISSN: 2464-6830

Ke stažení

Phil_2025_3_0043.pdf

Sdílet