Find Jobs
Hire Freelancers

Python Scraper / Crawler

$30-250 USD

Completed
Posted almost 9 years ago

$30-250 USD

Paid on delivery
I need a Python scraper. The main function will be to check on [login to view URL] (website history) for SPAM content. It should scrape full website history and compare with a keyword list I will provide. If any of the \"Spam Keywords\" is found on any of the saved history. the domain will be marked as spam. I need an easy UI to upload the URL list, and to load the \"Spam Keyword\" Also I would like to export the results in CSV. I need it to be fast as I will be uploading big lists 5, 10 , 15k. The scraper will be mounted on a DigitalOcean or similar VPS (you will mount it). I will feed the scraper with a URL list. The scraper need to go to [login to view URL] and enter 1 URL from the list and click enter. Or, what is the same visit this url: [login to view URL]*/URLGOESHERE For this example we will check the url: [login to view URL] So the scraper will visit: [login to view URL]*/[login to view URL] The result will be like this: [login to view URL] This will show a \"history\" of the url [login to view URL] I need to crawl. Every blue dot in the calendar is a saved copy of the past website. Every Black Block on the YEAR TIMELINE ([login to view URL]), means there are saved copy that month on that Year. I need the scraper to browse all blue dots, save into a DB (or local temporary dir) all the content of the saved copies (html code/text), then move between all the Years(black blocks) and continue scraping / crawling and saving all the content of all the Blue dots. For example clicking on the Last (more recent) blue dot for this url will enter here: [login to view URL]://[login to view URL] Note: You can see the date on the URL 2014-08-02 So it will need to crawl and save all the raw data from ALL the history (html files) into a DB or local folder. In this example it will enter and save Aug-2-2014 [login to view URL]://[login to view URL] May-7-2014[login to view URL]://[login to view URL] Then jump to 2008 (black blocks on the timeline) and crwal Feb-11-2008 [login to view URL]://[login to view URL] Then jump to 2007 crawl and save everything and so on to 1999 Then the scraper needs to search for \"Spam Keywords\" within all the content scraped for that particular URL. If there is no coincidence, the URL is clean, If it find any Word, the url is marked as SPAM. Then continue with the next URL on file I should be able to export the result easily in comma delimited format: URL, SPAM or CLEAN Statos, and if its marked as spam insert Spam Kw founded. After submitting the url list, the scraper should start working. I would like to see the status of the list like: Filename / in proccess or Finished status / if finished show export button. Im attaching a sample url list for you to check in depth what im looking for. I need a coder who is free to work on this project full time so we can do this ASAP. Let me know if you have any question.
Project ID: 7795647

About the project

3 proposals
Remote project
Active 9 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
Awarded to:
User Avatar
Hey there! This project would be an absolutely perfect job for using Twisted and BeautifulSoup together. Instead of using threads in Python for network applications, a better practice is to use nonblocking network I/O, because of the GIL (Global Interpreter Lock) which makes it so that only one thread in a process can run Python code at a time. Using Twisted, I can run non-threaded but also non-blocking HTTP operations to fetch the web pages required. In addition, using a DeferredSemaphore I can allow you to modify how many network operations are performed at any given time. BeautifulSoup is a well-known HTML parsing library, and using an LXML backend means that because LXML uses a C library known as libxml, it will be almost as fast as using C to parse the html directly. Thanks for the opportunity, I hope you have a nice day.
$177 USD in 2 days
5.0 (6 reviews)
4.1
4.1
3 freelancers are bidding on average $209 USD for this job
User Avatar
We are a venezuelan enterprise. All our projects are guaranteed. We are a developer team. It will be a pleasure resolve all your dudes before award the project. We work with multiples programming languages and technologies. Grettings, Victor Villalobos Firesoft C.A.
$250 USD in 3 days
4.9 (12 reviews)
4.3
4.3
User Avatar
A proposal has not yet been provided
$252 USD in 3 days
4.9 (28 reviews)
3.9
3.9
User Avatar
Hola, he hecho web scrapping con python varias veces para diversos proyectos. El mas difícil de todos fue hacer web scrapping sobre páginas de apuestas, lo cual es particularmente difícil porque ellos intentar protegerse contra eso cambiando los tags y estructuras de sus páginas. Creo que puedo serte útil.
$200 USD in 7 days
4.9 (3 reviews)
3.1
3.1
User Avatar
Thank you for viewing my profile you can check have 5***/5*** also you can check happy customer. always trying to get customer satisfaction. I have gone through your project requirement specification and as per my previous experience with this we are capable to do this project. please give a chance to proved our skill. Welcome to one of the best services available for your online needs. We provide you industry standard Mobile Apps, Software, Desktop Apps, Web Stores, Websites and Web Apps. So, you get all at one place! We can provide you the following: - iOS applications - Android applications - Software for all purposes - Desktop applications - Websites - Web applications - Ecommerce websites/ Web stores We specialize in: - Objective C, Cocoa, iOS 4, 5, 6 - Java, Google Android - WordPress, Joomla - WP Ecommerce, Magento - PHP5, PHP, MySQL - C#, C++, C - ActionScript 3.0, AIR, XML Please check reviews of our happy customers to boost your confident about us! Thanks. replay me on pm we need more discussion about your project ..
$130 USD in 7 days
5.0 (10 reviews)
1.6
1.6

About the client

Flag of ARGENTINA
Capital Federal, Argentina
5.0
5
Payment method verified
Member since Jun 28, 2014

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.