STATISTICAL SOFTWARE R IN CORPUS-DRIVEN RESEARCH AND MACHINE LEARNING
DOI:
https://doi.org/10.33407/itlt.v86i6.4627Keywords:
corpus linguistics, machine learning model, linguistic classifier, statistical software R, RStudio, grammatical construction, linguistic parameter, univariate analysis of variance (ANOVA), multivariate analysis of variance (MANOVA), the Tukey test, linear discriminant analysis, methodological aspects of interdisciplinary studiesAbstract
The rapid development of computer software and network technologies has facilitated the intensive application of specialized statistical software not only in the traditional information technology spheres (i.e., statistics, engineering, artificial intelligence) but also in linguistics. The statistical software R is one of the most popular analytical tools for statistical processing a huge array of digitalized language data, especially in quantitative corpus linguistic studies of Western Europe and North America. This article discusses the functionality of the software package R, focusing on its advantages in performing complex statistical analyses of linguistic data in corpus-driven studies and creating linguistic classifiers in machine learning. With this in mind, a three-stage strategy of computer-statistical analysis of linguistic corpus data is elaborated: 1) data processing and preparing to be subjected to a statistical procedure, 2) utilizing statistical hypothesis testing methods (MANOVA, ANOVA) and the Tukey post-hoc test, and 3) developing a model of a linguistic classifier and analyzing its effectiveness. The strategy is implemented on 11 000 tokens of English detached nonfinite constructions with an explicit subject extracted from the BNC-BYU corpus. The statistical analysis indicates significant differences in the realization of the factors of the parameter “Part of speech of the subject”. The analyzed linguistic data are employed to build a machine model for the classification of the given constructions. Particular attention is devoted to the methodological perspectives of interdisciplinary research in the fields of linguistics and computer studies. The potential application of the elaborated case study in training undergraduate, master, and postgraduate students of Applied Linguistics is indicated. The article provides all the statistical data and codes written in the R script with comprehensive descriptions and explanations. The concluding part of the article summarizes the obtained results and highlights the issues for further research connected with the popularization of the statistical software complex R and raising the awareness of specialists in this statistical analysis system.
Downloads
References
R. Fox, “The Contribution of Linguistics Towards Transdisciplinarity in Organizational Discourse.” International Journal of Transdisciplinary Research, no. 1(4), pp.16 – 34, 2009. (in English)
L. A. Janda, Cognitive linguistics: the quantitative turn. Berlin: De Gruyter Mouton, 2013. doi: https://doi.org/10.1515/9783110335255. (in English)
L. A. Janda, “Linguistic profiles: A quantitative approach to theoretical questions.” Language and Method, no. 3, pp.127-145. 2016. (in English)
G. Desagulier, Corpus linguistics and statistics with R. Introduction to quantitative methods in linguistics. Cham: Springer International Publishing, 2017. doi: https://doi.org/10.1007/978-3-319-64572-8. (in English)
M. V. Kopotev, Principles of syntactic idiomaticity. Helsinki: Helsinki University Press, 2008. (in Russian)
The R Project for Statistical Computing. [Online]. Available: http://www.R-project.org/ (in English)
Comprehensive R archive network. [Online]. Available: https://cran.r-project.org.
V. V. Zhukovska, O. O. Mosiiuk, & V. V. Komarenko, (2018). “Using R in the research by future philologists.” Information Technologies and Learning Tools, vol.66(4), pp.272-285, 2018. doi: https://doi.org/10.33407/itlt.v66i4.2196. (in Ukrainian)
V. Brezina, Statistics in corpus linguistics. Cambridge: Cambridge University Press, 2018. doi: https://doi.org/10.1017/9781316410899. (in English)
S. Gries, Multifactorial Analysis in Corpus Linguistics: A Study of Particle Placement (Open linguistics series). New York, London: Continuum International Publishing Group Ltd., 2003. (in English)
S. Gries, Statistics for Linguistics with R: A Practical Introduction (Mouton Textbook). Berlin/Boston: De Gruyter Mouton., 2013. (in English)
G. Desagulier, Corpus linguistics and statistics with R. Cham: Springer., 2017. doi: https://doi.org/10.1007/978-3-319-64572-8. (in English)
R. Baayen, Analyzing linguistic data. Cambridge: Cambridge University Press. 2008. doi: https://doi.org/10.1017/CBO9780511801686. (in English)
N. Levshina, How to do linguistics with R. Amsterdam: John Benjamins Publishing., 2015. doi: https://doi.org/10.1075/z.195. (in English)
J. Klavan, M. Pilvik, & K. Uiboaed, “The use of multivariate statistical classification models for predicting constructional choice in spoken, non-standard varieties of Estonian.” SKY Journal of Linguistics. [Online], no. 28, pp.187-224. 2015. Available: http://www.linguistics.fi/julkaisut/SKY2015/SKYJoL28_Klavan.pdf (in English)
D. Divjak, & A. Arppe, 2013. “Extracting prototypes from exemplars What can corpus data tell us about concept representation?” Cognitive Linguistics, no.24(2), pp.221-274, 2013. doi: https://doi.org/10.1515/cog-2013-0008. (in English)
A. E. Goldberg, Explain me this: Creativity, Competition, and the Partial Productivity of Constructions. Princeton/ Oxford : Princeton University Press, 2019. doi: https://doi.org/10.1515/9780691183954. (in English)
M. Hilpert, “Constructional Approaches,” in The Oxford Handbook of English Grammar. B. Aarts, J. Bowie, G. Popova (eds). Oxford: Oxford University Press, pp.106-123. 2020. doi: https://doi.org/10.1093/oxfordhb/9780198755104.013.13. (in English)
J. Bybee, “From usage to grammar: The mind’s response to repetition.” Language, no.82, pp.711 – 733, 2006. (in English)
J. Bybee, “Usage-based Theory and Exemplar Representations of Constructions”, in The Oxford Handbook of Construction Grammar, T. Hoffmann, G. Trousdale (eds.) Oxford: Oxford University Press, pp.49 - 69, 2013. (in English)
BNC-BYU. (2020, Dec. 20). [Online]. Available: www.english-corpora.org/bnc/. (in English)
A. B. Shipunov, E. M. Baldin, P. A. Volkova, A. I. Korobeinikov, S. A. Nazarova, S. V. Petrov, V. G. Sufiyanov, (2021, July 27). Visual statistics. Use R!, [Online]. Available: https://cran.r-project.org/doc/contrib/Shipunov-rbook.pdf (in Russian)
Yu. V. Nikolskyi, V. V. Pasichnyk, Yu. M. Shcherbyna, Artificial intelligence systems, Lviv, 2015. (in Ukrainian)
Discriminant Analysis Essentials in R - Articles - STHDA. (2021, July 27). [Online]. Available: http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/#linear-discriminant-analysis---lda. (in English)
Package MASS. (2021, July 27). [Online]. Available: https://cran.r-project.org/web/packages/MASS/MASS.pdf. (in English)
M. Kuhn, Building predictive models in R using the caret package. Journal of Statistical Software, no.28(5). 2008. [Online]. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05/v28i05.pdf. (in English)
L. Coelho, and W. Richert, Building Machine Learning Systems with Python. Packt Publishing, 2013. (in English)
S. Narkhede, Understanding Confusion Matrix. (2021, July 27). [Online] Medium. Available: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62. (in English)
Educational-professional program «Applied Linguisitcs» (2021, July 27). [Online]. Available: https://eportfolio.zu.edu.ua/media/ StudyProgram/99/6dx45d.pdf (in English)
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 Олександр Олександрович Мосіюк, Вікторія Вікторівна Жуковська

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who publish in this journal agree to the following terms:
- Authors hold copyright immediately after publication of their works and retain publishing rights without any restrictions.
- The copyright commencement date complies the publication date of the issue, where the article is included in.
Content Licensing
- Authors grant the journal a right of the first publication of the work under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others freely to read, download, copy and print submissions, search content and link to published articles, disseminate their full text and use them for any legitimate non-commercial purposes (i.e. educational or scientific) with the mandatory reference to the article’s authors and initial publication in this journal.
- Original published articles cannot be used by users (exept authors) for commercial purposes or distributed by third-party intermediary organizations for a fee.
Deposit Policy
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) during the editorial process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see this journal’s registered deposit policy at Sherpa/Romeo directory).
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Post-print (post-refereeing manuscript version) and publisher's PDF-version self-archiving is allowed.
- Archiving the pre-print (pre-refereeing manuscript version) not allowed.