Home About Datasets GitHub

UCI Phonotactic Calculator

Welcome to the UCI Phonotactic Calculator!

This is a work in progress! If you notice any bugs or have any comments, please contact Connor Mayer at cjmayer@uci.edu.

This is a research tool that allows users to calculate a variety of phonotactic acceptability metrics. These metrics are intended to capture how probable/acceptable a word is based on the sounds it contains and the order in which those sounds are sequenced. This is sometimes referred to as phonotactic probability, though we prefer the term acceptability because not all of the metrics employed here can be interpreted as probabilities. For example, a nonce word like [stik] 'steek' might have a relatively high phonotactic acceptability score in English even though it is not a real word, because there are many words that begin with [st], end with [ik], and so on. In Spanish, however, this word would have a low acceptability score because there are no Spanish words that begin with the sequence [st]. A sensitivity to the phonotactic constraints of one's language(s) is an important component of linguistic competence, and the various metrics computed by this tool instantiate different models of how this sensitivity is operationalized.

The general use case for this tool is as follows:

  1. Choose a training file. You can either upload your own or choose one of the default training files (see the About page for details on how these should be formatted and the Datasets page for a description of the default files). This file is intended to represent the input over which phonotactic generalizations are formed, and will typically be something like a dictionary (a large list of word types). The models used to calculate the phonotactic acceptability metrics will be fit to this data.
  2. Upload a test file. The trained models will assign scores for each metric to the words in this file. This file may duplicate data in the training file (if you are interested in the scores assigned to existing words) or not (if you are interested in the predictions the various models make about how speakers generalize to new forms).
  3. Choose which model to use. Currently the calculator supports two suites of models.
    1. Unigram/Bigram scores: this computes a suite of metrics that are based on unigram/bigram frequencies (that is, the frequencies of individual sounds and the frequencies of adjacent pairs of sounds). This includes type- and token-weighted variants of the positional unigram/bigram method from Jusczyk et al. (1994) and Vitevitch and Luce (2004), as well as type- and token-weighted variants of standard unigram/bigram probabilities.
    2. RNN Model: this computes an acceptability metric based on the recurrent neural network phonotactic model from Mayer and Nelson (2020). Unlike the unigram and bigram models, which compute probabilities based on local dependencies, this model allows long distance restrictions such as vowel or consonant harmony to be effectively modeled.
    See the About page for a detailed description of how these models differ and how to interpret the scores.

The UCI Phonotactic Calculator was developed by Connor Mayer (UCI), Arya Kondur (UCI), and Megha Sundara (UCLA). Please direct all inquiries to Connor Mayer (cjmayer@uci.edu).

Citing the UCI Phonotactic Calculator

If you publish work that uses the UCI Phonotactic Calculator, please cite the GitHub repository:

Mayer, C., Kondur, A., & Sundara, M. (2022). UCI Phonotactic Calculator (Version 0.1.0) [Computer software]. https://doi.org/10.5281/zenodo.7443706

Provide Input for Calculations

Upload a training file or select a default file