Automated dfownlaod of flash video and pdf, conversion of pdf to text and perl scripts to remove extract content
£20-250 GBP
Cancelled
Posted over 11 years ago
£20-250 GBP
Paid on delivery
The Leveson Inquiry consists of about 222 half day hearings ([login to view URL]). Each of these has a flash video and a pdf transcript. This project is to write a perl script that will download each hearing and the transcript, use pdftotext (or other CentOS yum installable software) to convert the pdf to text and then clean up the text removing page numbers, line numbers, etc so that only the speaker names and the words spoken remain (times and dates are optional). For example, for [login to view URL] the output should be:
#date Tuesday, 10 July 2012
#time 10.00 am
#speaker LORD JUSTICE LEVESON
I have misstated the position in
relation to Associated Newspapers Limited, for which
I apologise. I intend now to hand down a ruling dealing
with the way forward in connection with the issue that
has been raised.
#speaker MR JAY
Sir, we're continuing with Lord Hunt.
The resulting software must run on CentOS6 (similar to Fedora and many other Linux variants). The perl script is a couple of hours of work to get it really clean (all the #speaker tags correct and no text other than what was spoken). I'm guessing the flash down load is a couple of hours of work - if it takes longer then manual download may be used. Overall, I would guess at less than a day. If sucessful I would like to repeat this project.