Scrape, process and enrich biographical pages from >80 websites using Scrapy

Closed Posted 7 years ago Paid on delivery
Closed Paid on delivery

Scrape, process and enrich biographical pages from 80 websites using Scrapy

I need a Scrapy project of spiders to regularly scrape executive biographical information from approximately 80 specified company websites.

For each company. the process should:

i) Scrape the personal profile records found on individual pages linked to from an index of people on each site. Some indexes may be paginated or contain JavaScript. Field names across all spiders conform to a common schema ([url removed, login to view]), which I will supply.

For each person’s record, you will also perform the following using Item Pipelines, middleware or whatever necessary:

ii) check for and eliminate duplicates

iii) perform small amounts of simple text processing, likely using available Python libraries or features (a. split full names into components, b. create username from split names, c. join two splits in to new name field and d. create domain-name field from source URL)

iv) enrich each record with additional corresponding personal information, sourced externally via two APIs - Email Hunter and Clearbit.

I will supply JSON files containing Xpath and Regex rules that should make up each spider. This should make it quick for anyone with the knowledge to write the spiders. However, changes may be required to these to a) enhance or b) correct them. If supplied selectors are incorrect, you will use the appropriate selectors etc to find the correct information in discussion with me.

IP rotation or other protection may be required. I already have ReverseProxies/MicroLeaves service available to run through if necessary, but you will advise.

This project was previously completed a year ago using different software. Previously-scraped data is available for comparison, and, for each of the 80 companies, I expect the data sets to broadly match up. The project goal is to migrate the process to Scrapy, to capture the same data using a different, more robust and repeatable technology.

Code files for each spider and the project should be supplied to me, and you will deploy them to ScrapyCloud, where I intend to run and manage them regularly. You will test the running and management in ScrapyCloud first.

The project should be set for weekly periodic scraping in ScrapyCloud. However, changes in target content from week to week are expected to be extremely minimal, as most content is static. This makes weekly scraping only for the purpose of a refresh. (Company staff may come, go or change details).

Where you have already once enriched found data with information from the two paid APIs discussed, you should not make repeat calls to these APIs for the same information. Re-enriching data with already-accessed information would incur unnecessary and expensive duplicate API calls. Therefore, you will use the right methods (ie. incremental crawls, dotscrapypersistence or similar) to minimise both the effort and cost of i) crawling and ii) API enriching.

My end goal is, using ScrapingHub, to download CSV or JSON files, containing the full records of each company’s personnel.

You will also be available after deployment to ensure that performance in the first few cycles executes as expected.

This brief is written in detail to ensure common and the best start possible. You should comprehend the project goals, suggest and discuss methods before the job is agreed.

Python

Project ID: #11689432

About the project

8 proposals Remote project Active 7 years ago

8 freelancers are bidding on average £571 for this job

SuiGenSolutions

Hello Sir, I have extensive experience in development of crawlers using Scrapy. Please refer our profile. We have done more than 100 projects on Scrapy alone. We use proxies and even captcha services for efficient More

£526 GBP in 5 days
(35 Reviews)
5.9
mantislin

Dear sir, I am scraping expert, I have did too many scraping projects, please check my reviews then you will know. Can you tell me more details? then I will provide example data/script for you. Thanks, More

£548 GBP in 6 days
(4 Reviews)
4.6
Maariyaa

Can we discuss about the project. Feel free ask me question if any. I Look forward to hearing from you. Have a nice day and stay fine:-) Best Regards Maariyaa

£750 GBP in 10 days
(4 Reviews)
3.3
novepi

Hello, My current bid is just a placeholder. Let me know if your budget is flexible, otherwise just ignore my bid. Thanks Aydin

£555 GBP in 10 days
(1 Review)
2.4
excelentwork

Hello Sir/Madam, Thanks for project post. here we checked the posted details and review it, here we need some more clarification in it, So Please message us to clear our doubts and start work on it. Thanks

£356 GBP in 10 days
(0 Reviews)
0.0
prashushinde9

Hello, I understood the initial scope of this project. Although i want to discuss further this job in order to prepare the final concept for this project. After Complete discussion over the call or in chat, i wi More

£773 GBP in 20 days
(0 Reviews)
0.0
JeffreyCJensen

I am currently working on two Web Scraping Projects so my mind is already functioning in this domain. I have a long time-line for completion. But my price is low. I think I can do this for you.

£495 GBP in 90 days
(0 Reviews)
0.0
riteshkachariya

HI! My name is Hitesh Kachariya and I am an expert in scrape data from website. I would love to have the opportunity to discuss your project with you. To complete your initial project I would review your existing asset More

£500 GBP in 15 days
(0 Reviews)
0.0