We want to compare 3 versions of a web page to extract the nonchanging text (article body content). However the dynamic content on a page is making it a hard problem to solve (ads widgets etc) as dynamic ads give false positives for content changes detected.
Therefore, our theory is to visit a page 3 times and we want to exclude all dynamic text that changes on every page refresh. Leaving the article content. In production this will be used on millions of different sites, so footprints can't be used to extract content under a certain tag. It should work for any webpage.
It sounds simple, but we need to have a very low memory footprint as it will be done on millions of web pages. The script will return the non-changing text from the html of webpage, and then have a comparison function to compare text difference to other versions of the page to see how much of a change there is.
Explain your approach and how it will be faster than any we can think of or if there are any PHP library's you can use to help.
14 freelancers are bidding on average $135 for this job
Hi, I'll like to work with you. This projects seems to be a challenge, and I love challenges. Please provide me the url of the site and I'll start with a demo before you chose a coder. Thanks. Leo.-