UCI Phonotactic Calculator

About the UCI Phonotactic Calculator

Data format

The simplest way to understand the format of the input data is to look at examples on the Datasets page. Read below for more details:

Both the training and the test file must be in comma-separated format (.csv).
The training file should consist of one or two columns with no headers.
1. The first column (mandatory) contains a word list, with each symbol (phoneme, orthographic letter, etc.) separated by spaces. For example, the word 'cat' represented in IPA would be "k æ t". You may use any transcription system or representation you like, so long as the individual symbols are separated by spaces. Because symbols are space-separated, they may be arbitrarily long: this allows the use of transcription systems like ARPABET, which use more than one character to represent individual sounds.
2. The second column (optional) contains the corresponding frequencies for each word. These must be expressed as raw counts. These values are used in the token-weighted variants of the unigram and bigram models, which ascribe greater influence to the phonotactics of more frequent words. If this column is not provided, the token-weighted metrics will not be computed, but the other metrics will be returned.
The test file should consist of a single column containing the test word list. The same format as the training file must be used.
The output file will contain one column containing the test words, one column containing the number of symbols in the word, and one column for each of the metrics.

Unigram/bigram scores

The UCI Phonotactic Calculator currently supports a suite of unigram and bigram metrics that share the property of being sensitive only to the frequencies of individual sounds or adjacent pairs of sounds. Here is a summary of the columns in the output file produced under this model class.

Column name	Description
`word`	The word
`word_len`	The number of symbols in the word
`uni_prob`	Unigram probability
`uni_prob_freq_weighted`	Frequency-weighted unigram probability
`uni_prob_smoothed`	Add-one smoothed unigram probability
`uni_prob_freq_weighted_smoothed`	Add-one smoothed, frequency-weighted unigram probability
`bi_prob`	Bigram probability
`bi_prob_freq_weighted`	Frequency-weighted bigram probability
`bi_prob_smoothed`	Add-one smoothed bigram probability
`bi_prob_freq_weighted_smoothed`	Add-one smoothed, frequency-weighted bigram probability
`pos_uni_score`	Positional unigram score
`pos_uni_score_freq_weighted`	Frequency-weighted positional unigram score
`pos_uni_score_smoothed`	Add-one smoothed positional unigram score
`pos_uni_score_freq_weighted_smoothed`	Add-one smoothed, frequency-weighted positional unigram score
`pos_bi_score`	Positional bigram score
`pos_bi_score_freq_weighted`	Frequency-weighted positional bigram score
`pos_bi_score_smoothed`	Add-one smoothed positional bigram score
`pos_bi_score_freq_weighted_smoothed`	Add-one smoothed, frequency-weighted positional bigram score

These columns can be broken down into four broad classes:

unigram probabilities (uni_prob, uni_prob_freq_weighted, uni_prob_smoothed, uni_prob_freq_weighted_smoothed)
bigram probabilities (bi_prob, bi_prob_freq_weighted, bi_prob_smoothed, bi_prob_freq_weighted_smoothed)
positional unigram scores (pos_uni_score, pos_uni_score_freq_weighted, pos_uni_score_smoothed, pos_uni_score_freq_weighted_smoothed)
positional bigram scores (pos_bi_score, pos_bi_score_freq_weighted, pos_bi_score_smoothed, pos_bi_score_freq_weighted_smoothed)

Each of these classes has frequency-weighted and smoothed variants.

Frequency-weighted (or token-weighted) variants weight the occurrence of each unigram/bigram or positional unigram/bigram by the log token frequency of the word type it appears in. This effectively means that sound sequences in high frequency words 'count for more' than sound sequences in low-frequency words.
Smoothed variants assign a small part of the total share of probability to unseen configurations by assigning them pseudo-counts of 1 (add-one smoothing). For example, in an unsmoothed bigram probability model, any word that contains a bigram not found in the corpus data will be assigned a probability of 0. In the smoothed model, it will be assigned a low probability as though it had been observed once in the training data. Note that smoothed models will still assign zero probabilities if the training data contains any symbols not observed in the test data.

This document will first describe the unweighted (or type-weighted) and unsmoothed variants of each metric. Frequency weighting and smoothing is described in more detail afterwards.

Unigram probability (`uni_prob`)

In the equations below, w = x_1 \dots x_n refers to a word w that consists of symbols x_1 through x_n (where a symbol might be a phoneme, a character, etc.).

This is the standard unigram probability P(w=x_1 \dots x_n) \approx \prod_{i=1}^{n} P(x_i) where P(x) = \frac{C(x)}{\displaystyle\sum_{y \in \Sigma} C(y)} where C(x) is the number of times the symbol x occurs in the training data.

This metric reflects the probability of a word under a simple unigram model. The probability of a word is the product of the probability of its individual symbols. Note that the probability of the individual symbols is based only on their frequency of occurrence, not the position in which they occur.

If the test data contains symbols that do no occur in the training data, the tokens containing them will be assigned probabilities of 0.

Bigram probability (`bi_prob`)

This is the standard bigram probability P(w=x_1 \dots x_n) \approx \prod_{i=2}^{n} P(x_i|x_{i-1}) where P(x|y) = \frac{C(yx)}{C(y)} where C(y) is the number of times the symbol y occurs in the training data and C(yx) is the number of times the sequence yx occurs in the training data.

Each word is padded with a special start and end symbol, which allows us to calculate bigram probabilities for symbols that begin and end words.

This metric reflects the probability of words under a simple bigram model. The probability of a word is the product of the probability of all the bigrams it contains. Note that the probability of the bigrams is based only on their frequency of occurrence, not the position in which they occur or their sequencing with respect to one another.

Positional unigram score (`pos_uni_prob`)

This is a type-weighted variant of unigram score from Vitevitch and Luce (2004). PosUniScore(w=x_1 \dots x_n) = 1 + \sum_{i=1}^{n} P(w_i = x_i) where P(w_i = x) = \frac{C(w_i = x)}{\displaystyle\sum_{y \in \Sigma} C(w_i = y)} where w_i refers to the i^{\text{th}} position in a word and C(w_i = x) is the number of times in the training data the symbol x occurs in the i^{\text{th}} position of a word.

Vitevitch and Luce (2004) add 1 to the sum of the unigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.

Under this metric, the score assigned to a word is based on the sum of the probability of its individual symb1ols occuring at their respective positions. Note that the ordering of the symbols with respect to one another does not affect the score, only their relative frequencies within their given positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.

Positional bigram score (`pos_bi_prob`)

This is a type-weighted variant of the bigram score from Vitevitch and Luce (2004). PosBiScore(w=x_1 \dots x_n) = 1 + \sum_{i=2}^{n} P(w_{i-1} = x_{i-1}, w_i = x_i) where P(w_{i-1} = y, w_i = x) = \frac{C(w_{i-1} = y, w_i = x)}{\displaystyle\sum_{z \in \Sigma}\sum_{v \in \Sigma} C(w_{i-1} = z, w_i = v)} where w_i refers to the i^{\text{th}} position in a word and C(w_{i-1} = y, w_i = x) is the number of times in the training data the sequence yx occurs at the (i-1)^{\text{th}} and i^{\text{th}} positions of a word.

Vitevitch and Luce (2004) add 1 to the sum of the bigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.

Under this metric, the score assigned to a word is based on the sum of the probability of each contiguous pair of symbols occuring at their respective positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.

Token-weighted variants

Assuming that the training data consists of a list of word types (e.g., a dictionary), the above metrics can be described as type-weighted: the frequency of individual word types has no bearing on the scores assigned by the metrics.

The calculator also includes token-weighted variants of each of the above measures, where the phonotactic properties of frequent word types are weighted higher than those in less frequent word types. These are included under all the column names containing freq_weighted.

These measures are computed by changing the count function C such that it is the number of occurrences of the configuration in question multiplied by the natural log of the count of the word containing each occurrence.

For example, suppose we have a corpus containing two word types "kæt", which occurs 1000 times, and "tæk", which occurs 50 times. Under a token-weighted unigram model, C(æ) = ln(1000) + ln(50) \approx 10.82, while in a type-weighted unigram model C(æ) = 1 + 1 = 2.

The token-weighted positional ungiram and bigram scores correspond to the metrics presented in Vitevitch and Luce (2004), though they use the base-10 logarithm rather than the natural logarithm.

Smoothing

The calculator also includes add-one smoothed (or Laplace Smoothed) variants of each measure.

Under add-one smoothing, each configuration we could (unigrams, bigrams, positional unigrams, positional bigrams) begins with a default count of 1, rather than 0. This means that configurations that are not observed in the training data (that is, where C(x) = 0 for some configuration x) are treated as though they have been observed once, which gives them a small, rather than zero, probability. This effectively spreads some of the probability mass from attested configurations onto unattested ones.

Smoothing in these models assigns non-zero probabilities to unattested sequences of known symbols, but not to unknown symbols (which is why there is no smoothing for unigram probabilities). Any words in the test data containing symbols not found in the training data are assigned probabilities of zero.

In the token-weighted versions of the metrics, smoothing is also done by adding one to the log-weighted counts.

References

Vitevitch, M.S., & Luce, P.A. (2004). A web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481-487.