Home About Datasets GitHub

UCI Phonotactic Calculator

File List


A subset of the CMU Pronouncing Dictionary with CELEX frequencies > 1. This is notated in ARPABET. Numbers indicating vowel stress have been removed.


A subset of the CMU Pronouncing Dictionary with CELEX frequencies. This data is represented in ARPABET.


Data set from Needle et al. (2022). Consists of about 11,000 monomorphemic words from CELEX (Baayen et al. 1995) in ARPABET transcription.


55 English onsets and their CELEX type frequencies in ARPABET format from Hayes & Wilson (2008). A subset of the onsets in the CMU Pronouncing Dictionary.


From a word list generated by the Institute for the Languages of Finland (http://kaino.kotus.fi/sanat/nykysuomi/). Represented orthographically. See Mayer (2020) for details.


French corpus used in Goldsmith & Xanthos (2009) and Mayer (2020). Represented in IPA.


Polish onsets with type frequencies from Jarosz (2017). Generated from a corpus of child-directed speech consisting of about 43,000 word types (Haman et al. 2011). Represented orthographically.


Samoan word list from Milner (1993), compiled by Kie Zuraw. Represented in IPA.


A set of about 24,000 word types including inflected forms from the EsPal database (Duchon et al. 2013) in IPA with stress encoded. Frequencies from a large collection of Spanish subtitle data.


A set of about 18,000 citation forms from the Turkish Electronic Living Lexicon database (TELL; Inkelas et al. 2000) in IPA.