Word frequency is an important variable in cognitive processing. High-frequency words are perceived and produced faster and more efficiently than low-frequency words. At the same time, they are easier to recall but more difficult to recognize in episodic memory tasks.
To investigate the word frequency effect or to match stimuli on word frequency, psychologists need estimates of how often words occur in a language. In American English the Kucera and Francis (KF) frequencies have become the norm. This is surprising because the KF frequencies are dated (from 1967) and based on a corpus of 1.014 million words only. Several studies have confirmed the bad quality of the Kucera and Francis word frequencies (Burgess & Livesay, 1998; Zevin & Seidenberg, 2002; Balota et al., 2004).
Another word frequency measure regularly used is based on the Celex database (Baayen, Piepenbrock, & van Rijn, 1993). This measure is better than Kucera and Francis, but not optimal either (Balota et al., 2004; Zevin & Seidenberg, 2002).
To assess the quality of a frequency measure, one needs word processing times. These have become available as part of the Elexicon project (http://elexicon.wustl.edu/). Brysbaert & New (Behavior Research Methods, in press) calculated the percentages of variance accounted for by Kucera and Francis, and Celex in the accuracies and reactions times of a lexical decision task.
| AccAll words N=37,059 |
RTAll words N=31,201 |
|
| Kucera and Francis | 19.6 | 57.7 |
| Celex | 25.2 | 60.6 |
Brysbaert & New compiled a new frequency measure on the basis of American subtitles (51 million words in total). There are two measures:
The percentage of variance accounted for by these measures is significantly higher than the variance accounted for by Kucera & Francis, and Celex.
| AccAll words N=37,059 |
RTAll words N=31,201 |
|
| SUBTLWF | 30.1 | 62.3 |
| SUBTLCD | 31.3 | 62.9 |
For short words, the percentages of variance accounted for are also better than the fit with HAL, Zeno et al., and the word frequencies based on the British National Corpus. In addition, the corpus indicates which words are likely to be used as names (e.g., Mark, Archer, etc.). The frequencies of these words are overestimated, as more variance in RTs is accounted for when the frequencies of these words starting with a lowercase letter are used rather than the total frequencies. The full analysis by Brysbaert & New can be read here.
The new frequency measures based in the SUBTLEXUS database can be found here:
| Lg10WF | SUBTLWF |
| 1.00 | 0.2 |
| 2.00 | 2 |
| 3.00 | 20 |
| 4.00 | 200 |
| 5.00 | 2000 |
| Lg10CD | SUBTLCD |
| 0.95 | 0.1 |
| 1.93 | 1 |
| 2.92 | 10 |
| 3.92 | 100 |
Click here to enter a list of words and immediately get your SUBTL frequencies. This site also allows you to select stimuli within a specific frequency range (e.g. between 1 and 10 per million).