For most of my text analysis, I use a program called Range, which is freely available from Prof. Nation's web site ([url removed, login to view]). Unfortunately, the program is limited in its ability to count the frequency of words. Because it changes all the words to uppercase letters, it cannot distinguish between proper nouns and common words. For example, it cannot tell the difference between Potter and Black as characters in a book and potter and black, the craftsperson and colour, respectively. It is also unable to count polywords and their inflected forms, such as ice cream and ice creams, or 'chunks' such as Thank you very much. While it has the ability to count ice-cream when the hyphen function is selected, it will then return many new words created on the spot with a language like you-can-create-any-adjective-you-want-with-hyphenation English. I would like to sponsor the creation of a more sensitive program that is able to do these things. I want a program that can read in a number of text files or web pages and a number of word lists, and outputs the location for (a) the words on the word list, i.e., the order in which it first appeared and the number of times the words and polywords appear in the text, and (b) the words not on the word list, their location and their frequency. If there are multiple text files as input, the output should indicate the location and frequency in each file as well. One possible solution is that the program creates one summary file with output similar to the initial report in Range, but separate output files for location and frequency in comma separated variables (csv) files.
Input: 1. Text files or web pages to be analyzed. Mostly these files will be text files like those from Project Gutenberg or stored Web pages. Text files are generally formatted in one of two ways, with one or two line breaks between paragraphs. If the text has been saved with split wrapped lines (with line breaks on the right margin), the paragraphs are usually separated with two new line breaks. If the paragraphs are saved without line breaks, they are usually separated by one new line break between paragraphs. 2. Lists of words and polywords, and their derived forms, to count. This list can contain hundreds of thousands of words. Output: Text file with the first occurrence of the word, and the frequency of the word in the text. For example, the program could read the file and a following word list TITLE: Lexemes program, programs, programming, programmed black TITLE: Proper Nouns Black TITLE: Polywords web page, web pages TITLE: Functors a,an the and output: Lexemes Frequency Word,location,frequency program,12,14 black,76,4 Proper Nouns Frequency Word,location,frequency Black,67,7 Polywords Frequency Word,location,frequency web page,172,4 Functors Frequency Word,location,frequency a,11,34 The word lists The program must mark the text so that words are not counted twice. The program must also be able to distinguish between proper nouns (such as Black, the name) and common words (such as black, the colour). Functions: 1. Count the number of polywords. 2. Count the number of proper nouns 3. Count the number of lemmas 4. Find the first occurrence in terms of running word. 5. Mark up the text with html codes so that the various word lists can be identified. Phase 1: Find and count polywords and their inflected or derived forms in a text file. Phase 2: Find and count proper nouns and their inflected or derived forms in a text file. The program must take into account (a) the rules of capitalization, and (b) the occurrence on a proper noun word list in determining if a word is to be counted as a proper noun or a common word. Phase 3: Find and count (a) polywords and their inflected or derived forms, (b) proper nouns and their inflected or derived forms, and (c) common words and their inflected or derived forms in either a text file or a html file. Phase 4: Find and count (a) polywords and their inflected or derived forms, (b) proper nouns and their inflected or derived forms, and (c) common words and their inflected or derived forms in either a text file or a html file saved to a disk. Phase 5: Find and count (a) polywords and their inflected or derived forms, (b) proper nouns and their inflected or derived forms, and (c) common words and their inflected or derived forms on web pages from a list of URLs. Algorythm If a word is in the polyword file, count all occurrences as polywords. Do not count the word again as a single lemma. If the word is on the proper word list, then count all capitalized occurrences of the word as proper noun. ISSUES eg. Black was the night. Black was on a man on a mission. If the word Black is on the proper noun list, then the first sentence will be counted incorrectly. If there is a way to avoid this, I would be happy to hear suggestions. If a word on a word list contains an apostrophy (didn't), then count it as part of the lemma. Else, count it as two separate words (Harry's = Harry + 's). 's is special. It can be possessive (Harry's hat), reduced third person singular be (Harry's going home), or reduced third person singular have (Harry's got a new hat).