Find Jobs
Hire Freelancers

Crawl a provided set of websites for email addresses

$250-750 USD

Completed
Posted about 11 years ago

$250-750 USD

Paid on delivery
You will receive a large CSV file (approx 1.2mm rows) of names of professors at American universities. For each professor the URL of the university is listed as well. Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor. Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example: page 1: John Smith, [login to view URL](at)[login to view URL] page 2: John Smith, [login to view URL](at)[login to view URL] page 3: John Smith, [login to view URL](at)[login to view URL] page 4: John Smith, [login to view URL](at)[login to view URL] From this example it is pretty clear that is likely to be the correct address. The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found. Here are a few sample rows from the input file: ID Name Department InstitutionID InstitutionName State Location URL 1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA [login to view URL] 2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA [login to view URL] 3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA [login to view URL] 15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA [login to view URL] 18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA [login to view URL] 29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA [login to view URL] 45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX [login to view URL] 47 Swain, John Physics 696 Swain, John MA Boston, MA [login to view URL] 48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE [login to view URL] 57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA [login to view URL] To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above): Query: "Khuri, Sami site:[login to view URL]" [login to view URL] As you can see the first result in this case is actually a very good page to collect the email from: [login to view URL] Generally speaking the first 10-20 results are very likely contain the correct address. Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct. The project must be delivered in at most 1 month.
Project ID: 4330935

About the project

27 proposals
Remote project
Active 11 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
Awarded to:
User Avatar
I have an extensive experience in scrapers and parsers, so this project won't be a problem to make for me. Please see PM for a question and more info.
$250 USD in 7 days
0.0 (0 reviews)
0.0
0.0
27 freelancers are bidding on average $403 USD for this job
User Avatar
I can deliver the project
$320 USD in 6 days
4.9 (76 reviews)
6.8
6.8
User Avatar
I am confident to handle your project. Please check your inbox for details, thank you
$250 USD in 10 days
4.7 (120 reviews)
6.2
6.2
User Avatar
I can do it
$400 USD in 7 days
5.0 (19 reviews)
6.0
6.0
User Avatar
I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.
$250 USD in 5 days
4.2 (23 reviews)
6.0
6.0
User Avatar
Hi .Net/C#/ASP expert here. Please check PM for details. Thanks
$700 USD in 15 days
4.6 (21 reviews)
5.8
5.8
User Avatar
Hi, our team specializes in web crawler. Please see PM for details.
$380 USD in 10 days
5.0 (10 reviews)
5.6
5.6
User Avatar
Details in PMB.
$750 USD in 14 days
5.0 (8 reviews)
5.1
5.1
User Avatar
I worked on many similar projects, I have big experience in data mining projects. I can finish this task in short time, with the best quality.
$750 USD in 30 days
5.0 (2 reviews)
4.5
4.5
User Avatar
I've done many similar projects, actually I already have a module to start with, it will crawl every university website from the csv looking for the name and a pattern of an email, it will look for the left side of the email address, in most cases the name of the person appears in part of this string. most univerties if not all make the email address of the professor out of their names. We can make a small AI engine that check all the patterns of the name and compare it to the left side of the address found on the same page. here regex is of paramount importance. Google API will be a way to confirm that the result is correct. Anyway, I'm confident I can make this project to your fullest satifaction. Hope to work for you soon
$400 USD in 7 days
5.0 (7 reviews)
4.5
4.5
User Avatar
10+ years' hands-on programming experiences, can manage this work in C# and Java, may provide previous sample codes if required. Thanks/Denial
$500 USD in 20 days
4.8 (4 reviews)
3.9
3.9
User Avatar
I've read your project specs fully and carefully. They are very well written. I can definitely code this scraper for you; it's my specialty ;). I will send you a message with my proposed approach. Also, my bid is very negotiable. I wouldn't charge over $200 for this. Regards
$250 USD in 10 days
5.0 (8 reviews)
3.5
3.5
User Avatar
Hi, Good Day!!! Upon reading the project description. I am willing to work on this. I have an extensive experiences in web crawling on any languages. Thanks
$250 USD in 7 days
5.0 (7 reviews)
3.4
3.4
User Avatar
Please check the PMB for detail.
$250 USD in 7 days
5.0 (4 reviews)
3.3
3.3
User Avatar
Will make the script run on the amazon cloud to parallelize it and be able to extract the results quickly.
$500 USD in 10 days
5.0 (1 review)
2.8
2.8
User Avatar
Hello, please refer to your INBOX. Thank You. Best Regards.
$250 USD in 7 days
5.0 (3 reviews)
2.2
2.2
User Avatar
Hi, Please check Inbox for details.
$250 USD in 30 days
5.0 (1 review)
1.0
1.0
User Avatar
This sounds pretty simple. I have experience writing crawlers in different languages. Is this a throw-away app, or is it going to be maintained
$700 USD in 20 days
0.0 (0 reviews)
0.0
0.0
User Avatar
Hi ! I can do this program for you. Please refer pm.
$720 USD in 40 days
0.0 (1 review)
0.0
0.0
User Avatar
Hi, I had experience which is similar to this project. Please see my private message. Thank you, Vuong
$500 USD in 20 days
0.0 (0 reviews)
0.0
0.0
User Avatar
Hi. I can do it for you. Please check the PMB for more details. Thanks!
$250 USD in 30 days
0.0 (1 review)
2.7
2.7

About the client

Flag of UNITED STATES
Stamford, United States
5.0
65
Payment method verified
Member since Jan 29, 2010

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.