Web Crawler in PHP

$100-150 USD

Cancelled

Posted

over 13 years ago

$100-150 USD

Paid on delivery

Hello! I need a web crawler that checks the crawled websites for adsense script code and if found look for a contact website and save results to a local database. You can find the complete project decription here on this site as well. Thank you, Marc ## Deliverables The spider's main job is to find adsense code containing websites and - if such a website is found - try to find a contact mail address on the standard contact website and save website address and mail contact address into a database. That's the general overview. Now let me start with the details. The crawler should start with a certain web address, let us say [[login to view URL]][1]. Now the crawler has two jobs here. Check out if adsense code is available and grabbing new website addresses for further crawling. In detail: Job 1: The crawler has to find out if adsense code is part of this website's source. This is very easy to find out. The answer is "yes" if this website's source contains the word "google_ad_client" or the word "GA_googleFillSlot". If the answer is "no" only write the website's address into the database to know for later links on this website that it must not be crawler again. $insert = "INSERT into visited_websites (address,adsense,impressum) VALUES ('[login to view URL]', 0, 0)"; $ret = mysql_query($insert); If the answer is "yes", so if there is adsense code found, the crawler should look for a link to the "impressum" (what is the german standard word for "contact" and to find on nearly every german website). The word "impressum" can be part of the linked website (or folder) or part of the linked word. Here are some exaples how this link to the impressum may look like: <a class="link" href="[login to view URL]">Impressum</a> <a href="<[login to view URL]>" class="dropdown">Impressum</a> <a href="<[login to view URL]>" title="Impressum">Impressum</a> <a href="<[login to view URL]>" title="Impressum">Impressum</a> <a rel="nofollow" href="[login to view URL]" target="_blank" title="Internetradio"><img src="<[login to view URL]>" border="0" alt="Impressum"></a> <a href="/Impressum-(Info)">Impressum</a> <link rel="copyright" href="/de/impressum/" title="Copyright" /> <link rel="bookmark" href="/impressum/" title="Impressum" /> <a href="javascript:openNewWindow('[login to view URL]')">Impressum</a> <a href="/impressum/" class="impressum" rel="nofollow">Impressum & AGB</a> If the crawler is NOT able to find a link to the impressum it should write into the database $insert = "INSERT into visited_websites (address,adsense,impressum) VALUES ('[login to view URL]', 'google_ad_client', 0)"; $ret = mysql_query($insert); comment: There are two different adsense source codes available. One contains the word "google_ad_client" and the other the word "GA_googleFillSlot". Depending on the found word the third database value must be instert. In my example it was "google_ad_client". If there IS a link to a impressum website the crawler should parse this website and try to find a mail address (only the first one if there are more than 1 available). Often the webmaster try to mask the mail address to prevent spiders from grabbing. You must not make the spider (crawler) Einstein-like but it should be able to find patterns which COULD be a mail adress. The crawler must not dekrypt the masked mail address! This is not your job. Just jind patterns which look like a mail address and write the found into the database. Here are some examples how such masked mail addresses on the impressum side could look like: Scid1[at][login to view URL] info (at) krankenversicherungprivat (dot) org matthias<at>[login to view URL] redaktion [at] freeware [punkt] de So, the pattern is that a "at", "dot" or "punkt" in braces are a very good marker to realize: Hey! Here is a mail address! (by the way, "punkt" is the german word for "dot"). So, I think the easiest instructure for your crawler is this: If there is one of the following expressions in the impressum's source code, grab all before and after this marker (including the marker) up to the next html or script tag: [at|AT|dot|DOT|punkt|PUNKT|@|et|ET][2] opening brace: [ or ( or { closing brace: ] or ) or } Between opening/closing braces and the marker word may be one of the following chars be present: " | ' | or a space or nothing or combined. And there are not allowed to be more than two chards between the marker word and the brace. Example: <font face="verdana">hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org</font> Here your crawler should have recognized (at) and (" dot' ] as markers. (of course it is enough to find the first marker). So the crawler should grab all between the html tag before and after the marker. And that is hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org Not a marker is for example { " AT "} because there are more than 2 chars between { and AT Now the database command is $insert = "INSERT into visited_websites (address,adsense,impressum, mail_pattern) VALUES ('[login to view URL]', 'google_ad_client', 1 'hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org')"; $ret = mysql_query($insert); Oh, very important: Mostly the mail address is written unmask. Such mail addresses your crawler must find as well of course! :-) Please use the standard pattern recognizing for mail addresses first before - if neccessary - look for the masked mail addresses. If your crawler does NOT find something which could be a mail address, the database command is $insert = "INSERT into visited_websites (address,adsense,impressum, mail_pattern) VALUES ('[login to view URL]', 'google_ad_client', 1, 0)"; $ret = mysql_query($insert); Job 2. The second job is to scan this website for other website addresses which can be crawled later. The links must be a) linking to other domains b) link to .de or .at domains (german and austrian domains) The found links must be written in the database but only to the start site. Example: If the crawler is analysing the website [[login to view URL]][1] for other domains and fount the link [[login to view URL]][3] if should only save [[login to view URL]][4] to the database. $insert = "INSERT into websites_to_visit (address) VALUES ('[login to view URL]')"; $ret = mysql_query($insert); From this database table 'websites_to_visit' your crawler can pick up new website addresses for crawling when it finished the current website crawling. $select=mysql_query("SELECT url FROM websites_to_visit WHERE crawled = 0 LIMIT 1); $update = "UPDATE websites_to_visit set crawled = 1 WHERE url = '[login to view URL]'"; The update query is for marking a web address as already crawled. Okay, this is the description for the crawler. The only crawler setting I need is a variable to declare how many simultanous crawling proccesses are allowed. For example if $simultanous = 5; the crawler should proccess 5 web addresses simultanously. That's it! If you have any question, please ask me!

Web Crawler in PHP

$100-150 USD

$100-150 USD

About the project

Looking to make some money?

Benefits of bidding on Freelancer

About the client

Client Verification

Other jobs from this client

Similar jobs