Collocations Extraction using Python

$30-250 USD

Closed

Posted

over 10 years ago

$30-250 USD

Paid on delivery

Given a big text (corpus) about 1GB, I want to extract two-word, 3-word, 4-word and 5-word collocations or patterns using Log-Likelihood Ratio. More specifically, The requirements are: (1) Given the corpus, I'd like to get the bigrams, trigrams, 4-grams and 5-grams using LLR (2) Also, I want to find the collocations for any word which contains three or four specific letters. Like the collocations for words that have the letters "a - d - f" in that order but no matter if they are following one another or they are separated by other letters. In both cases, I wish to have the output sorted. And of course, as I said earlier, the corpus is 1G so it's really big. I prefer working with Python but I'm a novice so the code needs to be clear, easy to use and understand. P.S. Budget limited to $100 Thanks

Data Mining

Python

Statistics

Project ID: 5323958

About the project

5 proposals

Remote project

Active 10 yrs ago

Looking to make some money?

Email address

Benefits of bidding on Freelancer

Set your budget and timeframe

Get paid for your work

Outline your proposal

It's free to sign up and bid on jobs

5 freelancers are bidding on average $134 USD for this job

@anuyadav1

hello i can write python script for this , i can hadle 1 gb big text . thank you . .

$100 USD in 3 days

4.8

(63 reviews)

5.9

@srinichal

I like to discuss further and deliver the project . .

$147 USD in 3 days

4.7

(28 reviews)

5.4

@njwiggin

I have a great deal of python experience, will complete the project in a timely manner, and do it correctly. Thank you for considering my bid.

$100 USD in 3 days

0.0

(0 reviews)

0.0

@peterjrow

NLTK (the python library for this kind of thing) has a class which calculates LL for bigram and trigrams, and it has a general class which is only missing one method - the contingency matrix. I could write new classes which handle 4 grams and 5 grams (by writing contingency matrix calculations). For very large data sets (1GB might cause memory issues) I would do two passes. The first pass would hash the words, so you could exclude all the n-grams which are certainly below a certain threshold.

$66 USD in 3 days