The widespread use of computers and the growing availability of texts in
electronic form allow teachers and researchers to easily analyse texts. By
calculating the most frequently occuring words and teaching/learning those,
students are able to understand a language at an accelerated pace.
Other uses for this capability includes searching joblists for the most
popular skill to aquire; looking for the the most commonly asked questions
in a digest of email etc.
Plot of fraction of words recognised (vertical axis) versus number of most
frequent words known in the Torah -
To read the graph, consider the following example -
Say you know a vocabulary of the two hundred most frequent words. Go to the
horizontal axis and identify where this couresponds to on the vertical
axis. 200 words corresponds to mastery of .3 or 30% of the text, guaranteed.
To summarise, if you know -
| Words | % | |
|---|---|---|
| 200 | .................... | 30 |
| 500 | .................... | 45 |
| 1000 | .................... | 55 |
| 2750 | .................... | 70 |
Another way of looking at this is:
56% of the text is known using a vocabulary of 1075 words, that is all
words occuring 10 or more times -
Looking at the list shows the large tail of words (44%) are made up mostly
of prefix/suffix versions of the known words, that is the... and...
from... etc, and so the perhaps 50% of those words will be known. A
proper analysis is more involved than a simple wordcount alone and it seems
required.
The Online Bible includes a BHM version, which places a colon (:) between
the prefixes and the root words. Breaking the words on this colon, and at
a hyphen then running the wordcount again gives the following more
heartening results. (Remember also that if the suffuxes are removed as
well the wordlist shrinks again).
Plot of fraction of words recognised (vertical axis) versus number of most
frequent words known in the Torah (prefix modified version) -
84% of the Torah (prefix modified version) is known using a vocabulary of
1061 words, that is all words occuring 10 or more times -
| Words | % | |
|---|---|---|
| 200 | .................... | 60 |
| 500 | .................... | 75 |
| 1000 | .................... | 85 |
| 2750 | .................... | 95 |
Plot of fraction of words recognised (vertical axis) versus number of most
frequent words known in the Tanach (entire Hebrew Bible, prefix modified
version) -
| Words | % | |
|---|---|---|
| 200 | .................... | 55 |
| 500 | .................... | 65 |
| 1000 | .................... | 75 |
| 2750 | .................... | 88 |
83% of the Tanach (entire Hebrew Bible, prefix modified version) is known
using a vocabulary of 2004 words, that is all words occuring 18 or more
times -
Lets now compare this to the same text translated into English.
Plot of fraction of words recognised (vertical axis) versus number of most
frequent words known in the English Torah -
This is somewhat of a surprise, as one would expect English to have many
more words than the Hebrew. English has its origins in three languages -
Latin (the classic languague shared my modern European languages), French
(from the Normandy conquests) and Anglo-Saxon. For example the three words
sovereign, monarch and king show this heritage that gives English novelists
"just the right word". The Revised Standard Edition is not a simplified
English translation.
However, this is explained by the English word always appearing the same.
Students of English, take heart!
Mail me for more information or if you would like to share some of your results.