Home About Datasets GitHub

UCI Phonotactic Calculator

File List

english.csv

A subset of the CMU Pronouncing Dictionary with CELEX frequencies > 1. This is notated in ARPABET. Numbers indicating vowel stress have been removed.

english_freq.csv

A subset of the CMU Pronouncing Dictionary with CELEX frequencies. This data is represented in ARPABET.

english_needle.csv

Data set from Needle et al. (2022). Consists of about 11,000 monomorphemic words from CELEX (Baayen et al. 1995) in ARPABET transcription.

english_onsets.csv

55 English onsets and their CELEX type frequencies in ARPABET format from Hayes & Wilson (2008). A subset of the onsets in the CMU Pronouncing Dictionary.

finnish.csv

From a word list generated by the Institute for the Languages of Finland (http://kaino.kotus.fi/sanat/nykysuomi/). Represented orthographically. See Mayer (2020) for details.

french.csv

French corpus used in Goldsmith & Xanthos (2009) and Mayer (2020). Represented in IPA.

polish_onsets.csv

Polish onsets with type frequencies from Jarosz (2017). Generated from a corpus of child-directed speech consisting of about 43,000 word types (Haman et al. 2011). Represented orthographically.

samoan.csv

Samoan word list from Milner (1993), compiled by Kie Zuraw. Represented in IPA.

spanish_stress.csv

A set of about 24,000 word types including inflected forms from the EsPal database (Duchon et al. 2013) in IPA with stress encoded. Frequencies from a large collection of Spanish subtitle data.

turkish.csv

A set of about 18,000 citation forms from the Turkish Electronic Living Lexicon database (TELL; Inkelas et al. 2000) in IPA.