Word frequency is an important variable in cognitive processing. High-frequency words are perceived and produced faster and more efficiently than low-frequency words. At the same time, they are easier to recall but more difficult to recognize in episodic memory tasks.
To investigate the word frequency effect or to match stimuli on word frequency, psychologists need estimates of how often words occur in a language. In American English the Kucera and Francis (KF) frequencies have become the norm. This is surprising because the KF frequencies are dated (from 1967) and based on a corpus of 1.014 million words only. Several studies have confirmed the bad quality of the Kucera and Francis word frequencies (Burgess & Livesay, 1998; Zevin & Seidenberg, 2002; Balota et al., 2004).
Another word frequency measure regularly used is based on the Celex database (Baayen, Piepenbrock, & van Rijn, 1993). This measure is better than Kucera and Francis, but not optimal either (Balota et al., 2004; Zevin & Seidenberg, 2002).
To assess the quality of a frequency measure, one needs word processing times. These have become available as part of the Elexicon project (http://elexicon.wustl.edu/). Brysbaert & New (Behavior Research Methods, in press) calculated the percentages of variance accounted for by Kucera and Francis, and Celex in the accuracies and reactions times of a lexical decision task.
| AccAll words N=37,059 |
RTAll words N=31,201 |
|
| Kucera and Francis | 19.6 | 57.7 |
| Celex | 25.2 | 60.6 |
Brysbaert & New compiled a new frequency measure on the basis of American subtitles (51 million words in total). There are two measures:
The percentage of variance accounted for by these measures is significantly higher than the variance accounted for by Kucera & Francis, and Celex.
| AccAll words N=37,059 |
RTAll words N=31,201 |
|
| SUBTLWF | 30.1 | 62.3 |
| SUBTLCD | 31.3 | 62.9 |
For short words, the percentages of variance accounted for are also better than the fit with HAL, Zeno et al., and the word frequencies based on the British National Corpus. In addition, the corpus indicates which words are likely to be used as names (e.g., Mark, Archer, etc.). The frequencies of these words are overestimated, as more variance in RTs is accounted for when the frequencies of these words starting with a lowercase letter are used rather than the total frequencies. The full analysis by Brysbaert & New can be read here.
The new frequency measures based in the SUBTLEXUS database can be found here:
| Lg10WF | SUBTLWF |
| 1.00 | 0.2 |
| 2.00 | 2 |
| 3.00 | 20 |
| 4.00 | 200 |
| 5.00 | 2000 |
| Lg10CD | SUBTLCD |
| 0.95 | 0.1 |
| 1.93 | 1 |
| 2.92 | 10 |
| 3.92 | 100 |
Click here to enter a list of words and immediately get your SUBTLEX frequencies. This site also allows you to select stimuli within a specific frequency range (e.g. between 1 and 10 per million).
We have now tagged the SUBTLEX-US corpus with the CLAWS tagger, so that we can add Part-of-Speech (PoS) information to the SUBTLEX-US word frequencies. Five new columns have been added to the file:
You find more information about the tagging in Brysbaert, New, & Keuleers (Behavior Research Methods, 2012).
You find a zipped Excel version of the SUBTLEX-US word frequency file with PoS information here.
You find a zipped text version of the file here.