Backend Tool To Scrape Data & Grab API Data & Store in SQL

$30-100 USD

Cancelled

Posted

about 12 years ago

$30-100 USD

Paid on delivery

**Love to build intelligent & slick backend tools? Groovy with APIs? King of scraping data? Then this could really interest you... ** I have a large scale project in mind and I'd like YOU to help me with part 1. All going well, this could evolve into a lot more work for you. We're going to start off with a tool that scrapes data from Google's keyword tool ([login to view URL]). We'll store that in a database table. Next, we'll use Namecheap's API to check for domain availability, marking the results back in the DB. Pretty simple. To make it a bit more smart, scraping will made through private proxies to spread the load. We'll also de-duplicate repeat keywords. That's the summary; full details are provided in the details section. Remember, this is just part 1 to get the ball rolling. The tool gets seriously cool later on :) Part 1 really doesn't have a UI. But later on it will have some basic features. However, design isn't important since the tool is not public-facing. Serious bonus points if you were able to make use of something like Twitter's Bootstrap library ([login to view URL]) and Glyphicon's icons ([login to view URL]) to make it user friendly and slick. I'd welcome use of PHP and MySQL for this task as I'm able to code a little and would then be able to make tiny tweaks myself. If you have a strong preference for another language, then just let me know. You'll notice that I have a glowing profile on vWorker, so rest assured that I'm great to work with :) All applicants must understand the task details before submitting their bid. Please also include the word "bananas" in your bid so that I know you've read this! Template answers will be ignored! ## Deliverables The tool starts off with a large empty textarea box. This is where users will enter in a line separated list of keywords (they'll typically put in around 50 each time). A submit button then adds those keywords into a 'queue' (ie, into the database with a status of 0 = newly added, we should also store the timestamp of entry, and give each keyword an incremental ID number). The form then refreshes blank and that's all the user sees. Next, the tool needs to fetch data from Google's keyword tool. [login to view URL] is the URL. They use CAPTCHAs if you're not logged in, so we need to have a dummy Google account that is logged in with a cookie to be able to fetch the data. The keywords will go into the "word or phrase" textarea. We also need to select only the "[Exact]" "match type" option. Location should be only "United States" and the rest as per default. We're going to get back a bunch of results broken down into two sections; "search terms" and "keyword ideas". We're interested in booth. We need to collect and store: "global searches", "local searches", and "CPC". We'll also have a database field to store keyword type; 1 for "search terms", 0 for "keyword ides". Google provides the resulting data in CSV/XML which might be easier to parse into the DB than scraping the HTML table. We're then going to do a domain availability lookup for keywords where local search volume is >1000 and CPC is >$1. We'll use Namecheap's free API: [login to view URL]:domains:check (I can provide you with the API access code/key, or you can create your own free account for testing). We need to check for availability for .com/.net/.org domain names for the keyword phrase, removing word spacing. So if the keyword is "free ipod nano" then the domain we want to check for is "[login to view URL]", "[login to view URL]" and "[login to view URL]". We'll have three fields in the database and use 0/1 for availability. That's the process. But we need to add in an additional process that limits the amount of scraping we're doing from our server's IP. I will provide a list of around 50 private proxies ([login to view URL]). We should make use of them all to spread the load. I suggest only submitting 5 keywords per IP each 15-minutes. Maybe be need to use something like a CRON job to process another batch of KWs every 15-minutes? It would be really great if we could even find a way for those 15-minutes to be less predictable (ie. generating a random 0-5 minute time to add on to the 15, or perhaps sleep/rand would be useful?). The scraper should do what it can to appear as close to a real user as possible (so using a browser user-agent, for example). We don't need to worry about this kind of balancing when performing availability checks with Namecheap. Lastly, we can also filter duplicates. So when the user submits their list of keywords, we can ignore those already in the database. Also, when we fetch for suggestions, we can ignore those that are already present.

Backend Tool To Scrape Data & Grab API Data & Store in SQL

$30-100 USD

$30-100 USD

About the project

Looking to make some money?

Benefits of bidding on Freelancer

About the client

Client Verification

Other jobs from this client

Similar jobs