Publications

Displaying 1 - 8 of 8
  • Levshina, N. (2022). Frequency, informativity and word length: Insights from typologically diverse corpora. Entropy, 24(2): 280. doi:10.3390/e24020280.

    Abstract

    Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions

    Additional information

    datasets
  • Levshina, N., & Hawkins, J. A. (2022). Verb-argument lability and its correlations with other typological parameters. A quantitative corpus-based study. Linguistic Typology at the Crossroads, 2(1), 94-120. doi:10.6092/issn.2785-0943/13861.

    Abstract

    We investigate the correlations between A- and P-lability for verbal arguments with other typological parameters using large, syntactically annotated corpora of online news in 28 languages. To estimate how much lability is observed in a language, we measure associations between Verbs or Verb + Noun combinations and the alternating constructions in which they occur. Our correlational analyses show that high P-lability scores correlate strongly with the following parameters: little or no case marking; weaker associations between lexemes and the grammatical roles A and P; rigid order of Subject and Object; and a high proportion of verb-medial clauses (SVO). Low P-lability correlates with the presence of case marking, stronger associations between nouns and grammatical roles, relatively flexible ordering of Subject and Object, and verb-final order. As for A-lability, it is not correlated with any other parameters. A possible reason is that A-lability is a result of more universal discourse processes, such as deprofiling of the object, and also exhibits numerous lexical and semantic idiosyncrasies. The fact that P-lability is strongly correlated with other parameters can be interpreted as evidence for a more general typology of languages, in which some tend to have highly informative morphosyntactic and lexical cues, whereas others rely predominantly on contextual environment, which is possibly due to fixed word order. We also find that P-lability is more strongly correlated with the other parameters than any of these parameters are with each other, which means that it can be a very useful typological variable.
  • Levshina, N., & Lorenz, D. (2022). Communicative efficiency and the Principle of No Synonymy: Predictability effects and the variation of want to and wanna. Language and Cognition, 14(2), 249-274. doi:10.1017/langcog.2022.7.

    Abstract

    There is ample psycholinguistic evidence that speakers behave efficiently, using shorter and less effortful constructions when the meaning is more predictable, and longer and more effortful ones when it is less predictable. However, the Principle of No Synonymy requires that all formally distinct variants should also be functionally different. The question is how much two related constructions should overlap semantically and pragmatically in order to be used for the purposes of efficient communication. The case study focuses on want to + Infinitive and its reduced variant with wanna, which have different stylistic and sociolinguistic connotations. Bayesian mixed-effects regression modelling based on the spoken part of the British National Corpus reveals a very limited effect of efficiency: predictability increases the chances of the reduced variant only in fast speech. We conclude that efficient use of more and less effortful variants is restricted when two variants are associated with different registers or styles. This paper also pursues a methodological goal regarding missing values in speech corpora. We impute missing data based on the existing values. A comparison of regression models with and without imputed values reveals similar tendencies. This means that imputation is useful for dealing with missing values in corpora.

    Additional information

    supplementary materials
  • Levshina, N. (2022). Semantic maps of causation: New hybrid approaches based on corpora and grammar descriptions. Zeitschrift für Sprachwissenschaft, 41(1), 179-205. doi:10.1515/zfs-2021-2043.

    Abstract

    The present paper discusses connectivity and proximity maps of causative constructions and combines them with different types of typological data. In the first case study, I show how one can create a connectivity map based on a parallel corpus. This allows us to solve many problems, such as incomplete descriptions, inconsistent terminology and the problem of determining the semantic nodes. The second part focuses on proximity maps based on Multidimensional Scaling and compares the most important semantic distinctions, which are inferred from a parallel corpus of film subtitles and from grammar descriptions. The results suggest that corpus-based maps of tokens are more sensitive to cultural and genre-related differences in the prominence of specific causation scenarios than maps based on constructional types, which are described in reference grammars. The grammar-based maps also reveal a less clear structure, which can be due to incomplete semantic descriptions in grammars. Therefore, each approach has its shortcomings, which researchers need to be aware of.
  • Levshina, N. (2022). Corpus-based typology: Applications, challenges and some solutions. Linguistic Typology, 26(1), 129-160. doi:10.1515/lingty-2020-0118.

    Abstract

    Over the last few years, the number of corpora that can be used for language comparison has dramatically increased. The corpora are so diverse in their structure, size and annotation style, that a novice might not know where to start. The present paper charts this new and changing territory, providing a few landmarks, warning signs and safe paths. Although no corpora corpus at present can replace the traditional type of typological data based on language description in reference grammars, they corpora can help with diverse tasks, being particularly well suited for investigating probabilistic and gradient properties of languages and for discovering and interpreting cross-linguistic generalizations based on processing and communicative mechanisms. At the same time, the use of corpora for typological purposes has not only advantages and opportunities, but also numerous challenges. This paper also contains an empirical case study addressing two pertinent problems: the role of text types in language comparison and the problem of the word as a comparative concept.
  • Levshina, N. (2022). Comparing Bayesian and frequentist models of language variation: The case of help + (to) Infinitive. In O. Schützler, & J. Schlüter (Eds.), Data and methods in corpus linguistics – Comparative Approaches (pp. 224-258). Cambridge: Cambridge University Press.
  • Levshina, N. (2020). How tight is your language? A semantic typology based on Mutual Information. In K. Evang, L. Kallmeyer, R. Ehren, S. Petitjean, E. Seyffarth, & D. Seddah (Eds.), Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 70-78). Düsseldorf, Germany: Association for Computational Linguistics. doi:10.18653/v1/2020.tlt-1.7.

    Abstract

    Languages differ in the degree of semantic flexibility of their syntactic roles. For example, Eng-
    lish and Indonesian are considered more flexible with regard to the semantics of subjects,
    whereas German and Japanese are less flexible. In Hawkins’ classification, more flexible lan-
    guages are said to have a loose fit, and less flexible ones are those that have a tight fit. This
    classification has been based on manual inspection of example sentences. The present paper
    proposes a new, quantitative approach to deriving the measures of looseness and tightness from
    corpora. We use corpora of online news from the Leipzig Corpora Collection in thirty typolog-
    ically and genealogically diverse languages and parse them syntactically with the help of the
    Universal Dependencies annotation software. Next, we compute Mutual Information scores for
    each language using the matrices of lexical lemmas and four syntactic dependencies (intransi-
    tive subjects, transitive subject, objects and obliques). The new approach allows us not only to
    reproduce the results of previous investigations, but also to extend the typology to new lan-
    guages. We also demonstrate that verb-final languages tend to have a tighter relationship be-
    tween lexemes and syntactic roles, which helps language users to recognize thematic roles early
    during comprehension.

    Additional information

    full text via ACL website
  • Levshina, N. (2020). Efficient trade-offs as explanations in functional linguistics: some problems and an alternative proposal. Revista da Abralin, 19(3), 50-78. doi:10.25189/rabralin.v19i3.1728.

    Abstract

    The notion of efficient trade-offs is frequently used in functional linguis-tics in order to explain language use and structure. In this paper I argue that this notion is more confusing than enlightening. Not every negative correlation between parameters represents a real trade-off. Moreover, trade-offs are usually reported between pairs of variables, without taking into account the role of other factors. These and other theoretical issues are illustrated in a case study of linguistic cues used in expressing “who did what to whom”: case marking, rigid word order and medial verb posi-tion. The data are taken from the Universal Dependencies corpora in 30 languages and annotated corpora of online news from the Leipzig Corpora collection. We find that not all cues are correlated negatively, which ques-tions the assumption of language as a zero-sum game. Moreover, the cor-relations between pairs of variables change when we incorporate the third variable. Finally, the relationships between the variables are not always bi-directional. The study also presents a causal model, which can serve as a more appropriate alternative to trade-offs.

Share this page