Parallel Python Code That Counts How Many Websites Have Canvas
$10-30 USD
Completed
Posted over 6 years ago
$10-30 USD
Paid on delivery
I need a simple Python script that scrapes a list of websites in a csv file (e.g. top 500,000 Alexa sites attached), and checks if the website uses Canvas in the HTML (by checking for "<Canvas>") or in JavaScript (by checking for "createElement("canvas")" or "createElement('canvas')"). The code should output the number and percentage of websites using Canvas out of the list.
It is recommended that the code uses the Python Libraries “Requests” and/or "BeautifulSoup4" with a similar logic as the one I started writing (attached). The following points need to be satisfied:
• The code uses parallel computing for efficiency, so it doesn't run for so long
• The http header has to look like it came from a real browser, so websites don't block it
• The reading time of a website should not exceed 30 seconds, and should time out if no response for 30 seconds and go to the next website
• The script needs to count and print the number of successfully read and unread sites from the csv file of top sites (as the one I am attaching does for the unread). The unread sites could be because a website is no longer available or responsive, or any other reason
• The script needs to handle errors and doesn't crash
• The script has to print the duration of execution (how many hours, minutes or seconds)
• The script has to print the number and percentage of sites containing Canvas either in the HTML source code or JavaScript
It would be great if we can have a version that is not parallel to compare the performance, but not super important