Find Jobs
Hire Freelancers

Collocations Extraction using Python

$30-250 USD

Closed
Posted over 10 years ago

$30-250 USD

Paid on delivery
Given a big text (corpus) about 1GB, I want to extract two-word, 3-word, 4-word and 5-word collocations or patterns using Log-Likelihood Ratio. More specifically, The requirements are: (1) Given the corpus, I'd like to get the bigrams, trigrams, 4-grams and 5-grams using LLR (2) Also, I want to find the collocations for any word which contains three or four specific letters. Like the collocations for words that have the letters "a - d - f" in that order but no matter if they are following one another or they are separated by other letters. In both cases, I wish to have the output sorted. And of course, as I said earlier, the corpus is 1G so it's really big. I prefer working with Python but I'm a novice so the code needs to be clear, easy to use and understand. P.S. Budget limited to $100 Thanks
Project ID: 5323958

About the project

5 proposals
Remote project
Active 10 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
5 freelancers are bidding on average $134 USD for this job
User Avatar
hello i can write python script for this , i can hadle 1 gb big text . thank you . .
$100 USD in 3 days
4.8 (63 reviews)
5.9
5.9
User Avatar
I like to discuss further and deliver the project . .
$147 USD in 3 days
4.7 (28 reviews)
5.4
5.4
User Avatar
I have a great deal of python experience, will complete the project in a timely manner, and do it correctly. Thank you for considering my bid.
$100 USD in 3 days
0.0 (0 reviews)
0.0
0.0
User Avatar
NLTK (the python library for this kind of thing) has a class which calculates LL for bigram and trigrams, and it has a general class which is only missing one method - the contingency matrix. I could write new classes which handle 4 grams and 5 grams (by writing contingency matrix calculations). For very large data sets (1GB might cause memory issues) I would do two passes. The first pass would hash the words, so you could exclude all the n-grams which are certainly below a certain threshold.
$66 USD in 3 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of UNITED STATES
Champaign, United States
4.8
1
Payment method verified
Member since Aug 30, 2013

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.