Text scraper and parser

Completed Posted Dec 25, 2013 Paid on delivery
Completed

Hello,

I'm interested in someone building a web app to scrape this section of [url removed, login to view]: [url removed, login to view]

INPUT

The user will input a starting URL and set the number of individual listings to scrape. From there, the app will visit each individual listing like this from the starting URL: [url removed, login to view], clicking "next" at the bottom of the listings page until it hits the number of listings set by the user.

NO DESIGN

This is only a prototype for private use. So I don't want any fancy design. Just minimalistic UI needed to operate the app.

BYPASS BOT PROTECTION

Craigslist does have countermeasures in place to prevent bot activity, so part of this job must include reliably bypassing those bot defenses.

PARSING SCRAPED TEXT

Once the app starts to scrape the text of the individual adverts, it needs to parse out the following attributes list below. Many are "yes" or "no" answers, but some are numbers and some are text. Some you can rely on simply searching for exact match text...but others may require some more fuzzy logic.

For example, "one" may be written as "one" or "1" and the app should include basic text interpretation like this. I'll leave it up to the coder's judgment to maximize accuracy within reason...this does not need to be 100% accurate.

LIST OF ATTRIBUTES TO PARSE FROM TEXT

- Monthly $ rent

- Individual listing URL

- Length of lease

- Street address

- City

- State

- Zip code

- # of bedrooms

- # of bathrooms

- # of square feet

- If a washer/dryer or "laundry" is included in the unit

- If there is laundry on site instead of in the unit

- Is parking included?

----- If so, what kind? Street? Car Port? Single? Double?

Covered? Tandem? Other?

- Housing type (condo, duplex, etc.)

- If includes furnishing

- If includes air conditioning

- If includes gated access

- If includes a swimming pool

- If it includes a fitness center or gym

- If includes a fireplace

- If includes balcony

- If includes a deck

- If includes a back yard

- If includes a patio

- If includes water view

- If includes city view

- If includes courtyard view

- If includes a rooftop terrace or garden

- If cats allowed

- If dogs allowed

- Is newly remodeled?

- If includes stainless steel appliances?

- If includes storage?

- # of stories

- If includes free wifi?

- If includes granite counter tops

- If includes Quartz counter tops

- If it's nearby a public park

- If it mentions "sunlight"

- If it includes hardwood floor

- If it mentions "grocery"

- If it mentions "quiet"

- If it mentions "school" or "schools"

- If it mentions "laundromat" or "laundromats"

- If it mentions Caltrain

- If it mentions Muni

- If it mentions "Google bus"

FINAL OUTPUT

Final output is a CSV file where each attribute listed above is a column.

So this app requires a very simple UI where the user starts the crawl at the beginning, and then when it's finished, he can download the final CSV.

PROGRESS BAR

There must be a progress indicator that estimates time to completion of the crawling and parsing while it's happening. There must also be an option to stop a crawl before it's finished.

TESTING AND INSTALLATION

Please develop on your server. After user acceptance testing, the final step will be to install the working app on my server.

Thanks for your bids!

HTML MySQL PHP Python Software Architecture

Project ID: #5260474

About the project

4 proposals Remote project Active Dec 26, 2013

Awarded to:

MaximSky

Hello Thank you for providing the clear project description. Ready to create such scraper for you. I'm online and can start right now. I'm professional web developer with 10 years experience in areas such as PH More

$9 USD / hour
(14 Reviews)
4.9

4 freelancers are bidding on average $17/hour for this job

suraj99p

Yeah you are craigslist have strong defense mechanism to identify bots. But a browser based bot can do the job. I have made ad poster bot for craigslist, i can make a selium based bot to this task.

$20 USD / hour
(5 Reviews)
3.4