Wordlists - Benchmarking

Overview

FoundationStone 3.1 or better, allows you to update the frequencies in a wordlist against a list of frequencies. This is important - you want to learn the most frequent words first.

Two lists are available, a Modern and a Classical list. You'll find the actual lists inside the application at .../JavaSupport/Benchmarks/.

The modern list, is prepared from a spider of 30 million words of the left leaning Israeli daily newspaper Haaretz. This work is available online at The Hebrew Frequency Database. Words of length 2 or higher are included, and each word as it appeared on Haaretz's site is included with a frequency per million words.

The Classical list is a complete sum of the words in the Tanakh, with full niqudot (vowels). The words are normalised to a frequency per million words.

How To Use

Typically, you will have either a modern or classical list to benchmark. This is straightforward.

Sometimes however you have a list with both modern and classical words. Because the two options are normalised to the same basis - ie they are both represented as words per million - you can combine them. The recommended way of doing this for a mostly modern list with some classical words is -

  • Run the modern list with the overwrite higher frequencies check box on. This will reset all words that match the modern.

  • Run the Biblical list with the overwrite higher frequencies check box on. This will reset all words that match the Biblical.

  • Rerun the Modern list with the overwrite higher frequencies check box off. This will push any important Biblical words that have fallen out of regular usage to be more frequent, while leaving high frequency modern words.

Bugs

Both these functions are potentially affected on the Windows platform with an old Java Hebrew Unicode bug, that should be fixed by now. Select the Utilities -> Check Java Version menu item to see if you're affected. Bug was: 6462930

The Modern benchmark uses primary strength Unicode matching, with means that differences between a פ and a ף should be ignored, dagesh and vowels are completely ignored. What actually happens is Unicode tertiary matching where these are picked up as different.

The Classical benchmark uses secondary strength Unicode matching, where פ and ף are considered identical, dagesh is ignored, but vowel differences are ignored. Again, if the bug is present this defaults to tertiary matching.

If affected, your benchmark won't actually change any of the frequencies (unless you have an exact match). So it doesn't hurt to run a benchmark regardless.