Custom Web Crawler

Cancelled Posted Nov 28, 2011 Paid on delivery
Cancelled Paid on delivery

I need a web crawling platform that will crawl the first 1000-5000 search results of Google, Bing, and Yahoo for keywords that I specify. The crawler will gather links to common file sharing sites that I will provide from the search results above.

After a list of links has been gathered the crawler will then crawl through all the file sharing site links to check that they are relevant to my needs by searching for specific keywords that I specify. My current crawling system comes up with many false positives so I need to be able to filter results effectively. Support for regex expressions would be ideal.

The crawler must maintain a database of links that have already been checked that way it does not check them again. Also it must ensure there is a delay before checking the same file site url to avoid being blocked for too many consecutive connections.

The crawler will only crawl the first page of search results unless a broken link is found. Then it will crawl to the next page to get the actual link. For example common file searching sites will put .. or ... in the link to prevent crawlers from capturing the full link. But if the crawler follows these links the full url will often be on the next page.

The crawler should be coded in either python or perl and be able to run in the background on a linux server. I will be using ubuntu server specifically. I will be scheduling regular crawls via a cron job or the developer may develop a scheduling system as well.

The crawler must allow for multiple uses for different keyword searches. So ideally the search parameters would be stored in a configuration file and config file parameters passed to the crawler upon execution.

Some file sharing sites use a captcha system and the crawler will be unable to determine if these links are relevant. I need the crawler to save these links and send to me for manual checking.

I also require the crawler to save all html text/links from pages scanned to the local harddrive so that I can filter results manually via regex if needed. I also need to be able to delete the data stored when I determine it is no longer needed.

Once all the above results have been gathered and filtered I want the crawler to email me a list of urls found.

## Deliverables

The crawler must be able to gather links to the following sites and have the ability to add more if needed.

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

Engineering PHP Project Management Script Install Shell Script Software Architecture Software Testing Web Hosting Website Management Website Testing

Project ID: #3722303

About the project

Remote project Active Nov 28, 2011