Home About Datasets GitHub

UCI Phonotactic Calculator

About the UCI Phonotactic Calculator


Data format

The simplest way to understand the format of the input data is to look at examples on the Datasets page. Read below for more details:

  1. Both the training and the test file must be in comma-separated format (.csv).
  2. The training file should consist of one or two columns with no headers.
    1. The first column (mandatory) contains a word list, with each symbol (phoneme, orthographic letter, etc.) separated by spaces. For example, the word 'cat' represented in IPA would be "k æ t". You may use any transcription system or representation you like, so long as the individual symbols are separated by spaces. Because symbols are space-separated, they may be arbitrarily long: this allows the use of transcription systems like ARPABET, which use more than one character to represent individual sounds.
    2. The second column (optional) contains the corresponding frequencies for each word. These must be expressed as raw counts. These values are used in the token-weighted variants of the unigram and bigram models, which ascribe greater influence to the phonotactics of more frequent words. If this column is not provided, the token-weighted metrics will not be computed, but the other metrics will be returned.
  3. The test file should consist of a single column containing the test word list. The same format as the training file must be used.
  4. The output file will contain one column containing the test words, one column containing the number of symbols in the word, and one column for each of the metrics.

Unigram/bigram scores

The UCI Phonotactic Calculator currently supports a suite of unigram and bigram metrics that share the property of being sensitive only to the frequencies of individual sounds or adjacent pairs of sounds. Here is a summary of the columns in the output file produced under this model class.

Column name Description
word The word
word_len The number of symbols in the word
uni_prob Unigram probability
uni_prob_freq_weighted Frequency-weighted unigram probability
uni_prob_smoothed Add-one smoothed unigram probability
uni_prob_freq_weighted_smoothed Add-one smoothed, frequency-weighted unigram probability
bi_prob Bigram probability
bi_prob_freq_weighted Frequency-weighted bigram probability
bi_prob_smoothed Add-one smoothed bigram probability
bi_prob_freq_weighted_smoothed Add-one smoothed, frequency-weighted bigram probability
pos_uni_score Positional unigram score
pos_uni_score_freq_weighted Frequency-weighted positional unigram score
pos_uni_score_smoothed Add-one smoothed positional unigram score
pos_uni_score_freq_weighted_smoothed Add-one smoothed, frequency-weighted positional unigram score
pos_bi_score Positional bigram score
pos_bi_score_freq_weighted Frequency-weighted positional bigram score
pos_bi_score_smoothed Add-one smoothed positional bigram score
pos_bi_score_freq_weighted_smoothed Add-one smoothed, frequency-weighted positional bigram score

These columns can be broken down into four broad classes:

  1. unigram probabilities (uni_prob, uni_prob_freq_weighted, uni_prob_smoothed, uni_prob_freq_weighted_smoothed)
  2. bigram probabilities (bi_prob, bi_prob_freq_weighted, bi_prob_smoothed, bi_prob_freq_weighted_smoothed)
  3. positional unigram scores (pos_uni_score, pos_uni_score_freq_weighted, pos_uni_score_smoothed, pos_uni_score_freq_weighted_smoothed)
  4. positional bigram scores (pos_bi_score, pos_bi_score_freq_weighted, pos_bi_score_smoothed, pos_bi_score_freq_weighted_smoothed)

Each of these classes has frequency-weighted and smoothed variants.

This document will first describe the unweighted (or type-weighted) and unsmoothed variants of each metric. Frequency weighting and smoothing is described in more detail afterwards.

Unigram probability (uni_prob)

In the equations below, w = x_1 \dots x_n refers to a word w that consists of symbols x_1 through x_n (where a symbol might be a phoneme, a character, etc.).

This is the standard unigram probability P(w=x_1 \dots x_n) \approx \prod_{i=1}^{n} P(x_i) where P(x) = \frac{C(x)}{\displaystyle\sum_{y \in \Sigma} C(y)} where C(x) is the number of times the symbol x occurs in the training data.

This metric reflects the probability of a word under a simple unigram model. The probability of a word is the product of the probability of its individual symbols. Note that the probability of the individual symbols is based only on their frequency of occurrence, not the position in which they occur.

If the test data contains symbols that do no occur in the training data, the tokens containing them will be assigned probabilities of 0.

Bigram probability (bi_prob)

This is the standard bigram probability P(w=x_1 \dots x_n) \approx \prod_{i=2}^{n} P(x_i|x_{i-1}) where P(x|y) = \frac{C(yx)}{C(y)} where C(y) is the number of times the symbol y occurs in the training data and C(yx) is the number of times the sequence yx occurs in the training data.

Each word is padded with a special start and end symbol, which allows us to calculate bigram probabilities for symbols that begin and end words.

This metric reflects the probability of words under a simple bigram model. The probability of a word is the product of the probability of all the bigrams it contains. Note that the probability of the bigrams is based only on their frequency of occurrence, not the position in which they occur or their sequencing with respect to one another.

Positional unigram score (pos_uni_prob)

This is a type-weighted variant of unigram score from Vitevitch and Luce (2004). PosUniScore(w=x_1 \dots x_n) = 1 + \sum_{i=1}^{n} P(w_i = x_i) where P(w_i = x) = \frac{C(w_i = x)}{\displaystyle\sum_{y \in \Sigma} C(w_i = y)} where w_i refers to the i^{\text{th}} position in a word and C(w_i = x) is the number of times in the training data the symbol x occurs in the i^{\text{th}} position of a word.

Vitevitch and Luce (2004) add 1 to the sum of the unigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.

Under this metric, the score assigned to a word is based on the sum of the probability of its individual symb1ols occuring at their respective positions. Note that the ordering of the symbols with respect to one another does not affect the score, only their relative frequencies within their given positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.

Positional bigram score (pos_bi_prob)

This is a type-weighted variant of the bigram score from Vitevitch and Luce (2004). PosBiScore(w=x_1 \dots x_n) = 1 + \sum_{i=2}^{n} P(w_{i-1} = x_{i-1}, w_i = x_i) where P(w_{i-1} = y, w_i = x) = \frac{C(w_{i-1} = y, w_i = x)}{\displaystyle\sum_{z \in \Sigma}\sum_{v \in \Sigma} C(w_{i-1} = z, w_i = v)} where w_i refers to the i^{\text{th}} position in a word and C(w_{i-1} = y, w_i = x) is the number of times in the training data the sequence yx occurs at the (i-1)^{\text{th}} and i^{\text{th}} positions of a word.

Vitevitch and Luce (2004) add 1 to the sum of the bigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.

Under this metric, the score assigned to a word is based on the sum of the probability of each contiguous pair of symbols occuring at their respective positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.

Token-weighted variants

Assuming that the training data consists of a list of word types (e.g., a dictionary), the above metrics can be described as type-weighted: the frequency of individual word types has no bearing on the scores assigned by the metrics.

The calculator also includes token-weighted variants of each of the above measures, where the phonotactic properties of frequent word types are weighted higher than those in less frequent word types. These are included under all the column names containing freq_weighted.

These measures are computed by changing the count function C such that it is the number of occurrences of the configuration in question multiplied by the natural log of the count of the word containing each occurrence.

For example, suppose we have a corpus containing two word types "kæt", which occurs 1000 times, and "tæk", which occurs 50 times. Under a token-weighted unigram model, C(æ) = ln(1000) + ln(50) \approx 10.82, while in a type-weighted unigram model C(æ) = 1 + 1 = 2.

The token-weighted positional ungiram and bigram scores correspond to the metrics presented in Vitevitch and Luce (2004), though they use the base-10 logarithm rather than the natural logarithm.

Smoothing

The calculator also includes add-one smoothed (or Laplace Smoothed) variants of each measure.

Under add-one smoothing, each configuration we could (unigrams, bigrams, positional unigrams, positional bigrams) begins with a default count of 1, rather than 0. This means that configurations that are not observed in the training data (that is, where C(x) = 0 for some configuration x) are treated as though they have been observed once, which gives them a small, rather than zero, probability. This effectively spreads some of the probability mass from attested configurations onto unattested ones.

Smoothing in these models assigns non-zero probabilities to unattested sequences of known symbols, but not to unknown symbols (which is why there is no smoothing for unigram probabilities). Any words in the test data containing symbols not found in the training data are assigned probabilities of zero.

In the token-weighted versions of the metrics, smoothing is also done by adding one to the log-weighted counts.


References

Vitevitch, M.S., & Luce, P.A. (2004). A web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481-487.