UCI Phonotactic Calculator

File List

english.csv	A subset of the CMU Pronouncing Dictionary with CELEX frequencies > 1. This is notated in ARPABET. Numbers indicating vowel stress have been removed.
english_freq.csv	A subset of the CMU Pronouncing Dictionary with CELEX frequencies. This data is represented in ARPABET.
english_needle.csv	Data set from Needle et al. (2022). Consists of about 11,000 monomorphemic words from CELEX (Baayen et al. 1995) in ARPABET transcription.
english_onsets.csv	55 English onsets and their CELEX type frequencies in ARPABET format from Hayes & Wilson (2008). A subset of the onsets in the CMU Pronouncing Dictionary.
finnish.csv	From a word list generated by the Institute for the Languages of Finland (http://kaino.kotus.fi/sanat/nykysuomi/). Represented orthographically. See Mayer (2020) for details.
french.csv	French corpus used in Goldsmith & Xanthos (2009) and Mayer (2020). Represented in IPA.
polish_onsets.csv	Polish onsets with type frequencies from Jarosz (2017). Generated from a corpus of child-directed speech consisting of about 43,000 word types (Haman et al. 2011). Represented orthographically.
samoan.csv	Samoan word list from Milner (1993), compiled by Kie Zuraw. Represented in IPA.
spanish_stress.csv	A set of about 24,000 word types including inflected forms from the EsPal database (Duchon et al. 2013) in IPA with stress encoded. Frequencies from a large collection of Spanish subtitle data.
turkish.csv	A set of about 18,000 citation forms from the Turkish Electronic Living Lexicon database (TELL; Inkelas et al. 2000) in IPA.