The simplest way to understand the format of the input data is to look at examples on the Datasets page. Read below for more details:
The UCI Phonotactic Calculator currently supports a suite of unigram and bigram metrics that share the property of being sensitive only to the frequencies of individual sounds or adjacent pairs of sounds. Here is a summary of the columns in the output file produced under this model class.
Column name | Description |
---|---|
word | The word |
word_len | The number of symbols in the word |
uni_prob | Unigram probability |
uni_prob_freq_weighted | Frequency-weighted unigram probability |
uni_prob_smoothed | Add-one smoothed unigram probability |
uni_prob_freq_weighted_smoothed | Add-one smoothed, frequency-weighted unigram probability |
bi_prob | Bigram probability |
bi_prob_freq_weighted | Frequency-weighted bigram probability |
bi_prob_smoothed | Add-one smoothed bigram probability |
bi_prob_freq_weighted_smoothed | Add-one smoothed, frequency-weighted bigram probability |
pos_uni_score | Positional unigram score |
pos_uni_score_freq_weighted | Frequency-weighted positional unigram score |
pos_uni_score_smoothed | Add-one smoothed positional unigram score |
pos_uni_score_freq_weighted_smoothed | Add-one smoothed, frequency-weighted positional unigram score |
pos_bi_score | Positional bigram score |
pos_bi_score_freq_weighted | Frequency-weighted positional bigram score |
pos_bi_score_smoothed | Add-one smoothed positional bigram score |
pos_bi_score_freq_weighted_smoothed | Add-one smoothed, frequency-weighted positional bigram score |
These columns can be broken down into four broad classes:
Each of these classes has
In the equations below, w = x_1 \dots x_n refers to a word w that consists of symbols x_1 through x_n (where a symbol might be a phoneme, a character, etc.).
This is the standard unigram probability P(w=x_1 \dots x_n) \approx \prod_{i=1}^{n} P(x_i) where P(x) = \frac{C(x)}{\displaystyle\sum_{y \in \Sigma} C(y)} where C(x) is the number of times the symbol x occurs in the training data.This metric reflects the probability of a word under a simple unigram model. The probability of a word is the product of the probability of its individual symbols. Note that the probability of the individual symbols is based only on their frequency of occurrence, not the position in which they occur.
If the test data contains symbols that do no occur in the training data, the tokens containing them will be assigned probabilities of 0.
Each word is padded with a special start and end symbol, which allows us to calculate bigram probabilities for symbols that begin and end words.
This metric reflects the probability of words under a simple bigram model. The probability of a word is the product of the probability of all the bigrams it contains. Note that the probability of the bigrams is based only on their frequency of occurrence, not the position in which they occur or their sequencing with respect to one another.
Vitevitch and Luce (2004) add 1 to the sum of the unigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.
Under this metric, the score assigned to a word is based on the sum of the probability of its individual symb1ols occuring at their respective positions. Note that the ordering of the symbols with respect to one another does not affect the score, only their relative frequencies within their given positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.
Vitevitch and Luce (2004) add 1 to the sum of the bigram probabilities "to aid in locating these values when you cut and paste the output in the right field to another program." They recommend subtracting 1 from these values before reporting them.
Under this metric, the score assigned to a word is based on the sum of the probability of each contiguous pair of symbols occuring at their respective positions. Higher scores represent words with more probable phonotactics, but note that this score cannot be interpreted as a probability.
The calculator also includes token-weighted variants of each of the above measures, where the phonotactic properties of frequent word types are weighted higher than those in less frequent word types. These are included under all the column names containing freq_weighted.
These measures are computed by changing the count function C such that it is the number of occurrences of the configuration in question multiplied by the natural log of the count of the word containing each occurrence.
For example, suppose we have a corpus containing two word types "kæt", which occurs 1000 times, and "tæk", which occurs 50 times. Under a token-weighted unigram model, C(æ) = ln(1000) + ln(50) \approx 10.82, while in a type-weighted unigram model C(æ) = 1 + 1 = 2.
The token-weighted positional ungiram and bigram scores correspond to the metrics presented in Vitevitch and Luce (2004), though they use the base-10 logarithm rather than the natural logarithm.
Under add-one smoothing, each configuration we could (unigrams, bigrams, positional unigrams, positional bigrams) begins with a default count of 1, rather than 0. This means that configurations that are not observed in the training data (that is, where C(x) = 0 for some configuration x) are treated as though they have been observed once, which gives them a small, rather than zero, probability. This effectively spreads some of the probability mass from attested configurations onto unattested ones.
Smoothing in these models assigns non-zero probabilities to unattested sequences of known symbols, but not to unknown symbols (which is why there is no smoothing for unigram probabilities). Any words in the test data containing symbols not found in the training data are assigned probabilities of zero.
In the token-weighted versions of the metrics, smoothing is also done by adding one to the log-weighted counts.
Vitevitch, M.S., & Luce, P.A. (2004). A web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481-487.