Given a big text (corpus) about 1GB, I want to extract two-word, 3-word, 4-word and 5-word collocations or patterns using Log-Likelihood Ratio.
More specifically,
The requirements are:
(1) Given the corpus, I'd like to get the bigrams, trigrams, 4-grams and 5-grams using LLR
(2) Also, I want to find the collocations for any word which contains three or four specific letters. Like the collocations for words that have the letters "a - d - f" in that order but no matter if they are following one another or they are separated by other letters.
In both cases, I wish to have the output sorted. And of course, as I said earlier, the corpus is 1G so it's really big.
I prefer working with Python but I'm a novice so the code needs to be clear, easy to use and understand.
P.S. Budget limited to $100
Thanks
NLTK (the python library for this kind of thing) has a class which calculates LL for bigram and trigrams, and it has a general class which is only missing one method - the contingency matrix. I could write new classes which handle 4 grams and 5 grams (by writing contingency matrix calculations). For very large data sets (1GB might cause memory issues) I would do two passes. The first pass would hash the words, so you could exclude all the n-grams which are certainly below a certain threshold.