Find Jobs
Hire Freelancers

Create a scraper program to download and process files for CFR regulations

$250-750 USD

In Progress
Posted over 10 years ago

$250-750 USD

Paid on delivery
File Scraper, downloader, and file processing. This project consists of two parts: 1. Spider through a website ad download all files that result from the spidering 2. Format each file downloaded to a specific format Part one: You will be given a batch of starting URL's that look like this: [login to view URL] You will follow each of these URL's that will lead to another page with links that look like this: [login to view URL] You will follow each of these URL's that will lead again to another page with links that look like this: [login to view URL] You will now follow each of these links that leads to a page that links to specific documents. The links within the pages tend to look like this: <table width="480"><tr> <td><table width="120"> <tr><td> <a class="tpl" href="/cgi/t/text/text-idx?c=ecfr&SID=f68f503ab8017206c54fb367aaaa7851&amp;rgn=div8&amp;view=text&amp;node=10:1.0.1.1.4.1.56.1&amp;idno=10"> &sect;5.100</a></td></tr> </table></td> <td><table width="354"> <tr><td>Purpose and effective date.</td></tr> </table></td> </tr></table> each of these links leads to a page that needs to be saved with the following naming structure that looks like this: [login to view URL] other examples of naming structures: 6cfrAppendix A to Part [login to view URL] Part two of this project: After you have downloaded each file, you will need to put each file into a specific html page structure. 1. You will first strip all of the information before <!-- startDynamic --> and after the <!-- endDynamic --> 2. You will now need to create a header for each record that looks like the files that are part of the samples. 3. You will need to replace the string in the text when it comes across a graphic: example string: Please replace: <img src="/graphics/ With this string: <img src="[login to view URL] AND replace this string: <a href="/graphics/pdfs/ With this string: <a href="[login to view URL] 4. You will need to create a footer at the bottom of each section, after the p class=” cita, that looks this this example: <p class="cita">[54 FR 53314, Dec. 28, 1989]</p> <br><p><center>Copyright 2013 Compliance Publishing Corporation (877) 500-6737</center> </body> </html> 5. You must be able to accommodate both regular regulations and the Appendix sections 6. Some of the titles have one less level. This program must be able selectable to how many levels deep the individual text is located. 7. All of the search and replace definitions must be kept ‘outside’ of the program in text files that can be modified as needed. 8. We require the source code as well as the finished program at the end of the project 9. Attached is a program that completed most of these tasks, but no longer works correctly because of a minor change in the text formatting (the programmer is no longer available). You may wish to use this program as a guide. 10. Attached are raw data documents and finished documents to be used as a guide. Please review the information carefully before you provide a bid, as there will be no changes to the contract price once we accept your bid. Please view the attached file for a sample of what the file format will be when completed. There are both regular and appendix text in this sample. Program must work in Windows Server 2008 We provide all funds in a Freelancer escrow account. You must complete this project within 30 days (or less) You must reply to all communications within 24 hours
Project ID: 5338385

About the project

12 proposals
Remote project
Active 10 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

About the client

Flag of UNITED STATES
Edina, United States
4.9
142
Payment method verified
Member since Aug 13, 2008

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.