Lexical density refers to the ratio of lexical and functional words in any given text or collections of text. It is a branch of computational linguistics and linguistic analysis. It is linked to vocabulary, the known words of any individual and can be used to compare the spoken and written lexicons of any one person. Lexicon differs from total vocabulary because it does not include functional words such as pronouns and particles.
The density of a speech or text is calculated by comparing the number of lexical words and the number of functional words. Short sentences and small texts can be calculated using mental arithmetic or by simple counting. Larger comparisons, say of Charles Dickens or William Shakespeare, are done by feeding the information into a computer program. The program will sift the text into functional and lexical words.
Balanced lexical density is approximately 50 percent. This means that half of each sentence is made up of lexical words and half of functional words. A low-density text will have less than a 50:50 ratio and a high-density text will have more than 50:50. Academic texts and government, jargon-filled documents tend to produce the highest densities.
One flaw in the calculation of lexical density is that it does not take into account the different forms and cases of constituent words. The statistical analysis aims only with studying the ratio of word types. It does not produce a study of one individual’s lexical knowledge. If it did, the lexical density analysis would differentiate between forms such as “give” and “gave.” Theoretically, lexical density can be applied to texts in order to study the frequency of certain lexical units.
A person’s written lexicon can be aided through the use of dictionaries and thesauruses. Such tools provide alternate words and clarify meanings. When speaking, a person must rely on his or her mental vocabulary only. This means that lexical density can be used as a tool to compare spoken and written lexicons. The lexical density of spoken languages tends to be lower than that of a written text.
Computational linguistics is a statistical modeling area of linguistic analysis. It was born out of the Cold War and America’s desire to use computers to translate texts from Russian into English. Doing so required the use of mathematics, statistics, artificial intelligence and computer programming. The largest problem for programmers was getting the computer to understand complex grammar and language pragmatics. This gave rise to the China Room theory that computers can perform literal translations of words, but cannot, ultimately, understand languages.