english.csv |
A subset of the CMU Pronouncing Dictionary with CELEX frequencies > 1. This is notated in ARPABET. Numbers indicating vowel stress have been removed. |
english_freq.csv |
A subset of the CMU Pronouncing Dictionary with CELEX frequencies. This data is represented in ARPABET. |
english_needle.csv |
Data set from Needle et al. (2022). Consists of about 11,000 monomorphemic words from CELEX (Baayen et al. 1995) in ARPABET transcription. |
english_onsets.csv |
55 English onsets and their CELEX type frequencies in ARPABET format from Hayes & Wilson (2008). A subset of the onsets in the CMU Pronouncing Dictionary. |
finnish.csv |
From a word list generated by the Institute for the Languages of Finland (http://kaino.kotus.fi/sanat/nykysuomi/). Represented orthographically. See Mayer (2020) for details. |
french.csv |
French corpus used in Goldsmith & Xanthos (2009) and Mayer (2020). Represented in IPA. |
polish_onsets.csv |
Polish onsets with type frequencies from Jarosz (2017). Generated from a corpus of child-directed speech consisting of about 43,000 word types (Haman et al. 2011). Represented orthographically. |
samoan.csv |
Samoan word list from Milner (1993), compiled by Kie Zuraw. Represented in IPA. |
spanish_stress.csv |
A set of about 24,000 word types including inflected forms from the EsPal database (Duchon et al. 2013) in IPA with stress encoded. Frequencies from a large collection of Spanish subtitle data. |
turkish.csv |
A set of about 18,000 citation forms from the Turkish Electronic Living Lexicon database (TELL; Inkelas et al. 2000) in IPA. |