Extract information from .json files and spider content
This project was awarded to hectorenavarrou for $353 USD.Get free quotes for a project like this
Project Budget$250 - $750 USD
Given a list of URLs of pages at reddit, extract the information and put it in our JSON format.
- Reddit has .json files for each of it's stories. Given a list of URL to reddit postings (e.g. [url removed, login to view]), your goal is to extract *all* the comment information for the posts. Reddit provides .json files that correspond to these stories: .....
- The information needs to be extracted and stored in the following way:
- The first time you encounter a user, they should get a unique 16 character ID. We will provide you with the code snippet to generate IDs. The final user records need to look like this:
Reddit stories have an initial post and then comments. Both posts and comments should be stored in a JSON file that looks like this:
"id": "24_CHARACTER_ID", // Generated with the same code snipped as user id's
"localid": "50t2rc", // This is the id that reddit uses to identify the post or comment
"author": "ID_FROM_USER_TABLE", // The 16 character ID that you generated for this user
"title": "Graffiti in abandoned Greece hotel", // NOTE: Posts have a "title", comments have "content" but no title
"link": "[url removed, login to view]", // Some posts have a link, some don't. No comments have links.
"content": "RAW_CONTENT_OF_POST_OR_CONTENT", // All comments have content, only some posts have content. for example, posts of the type "Ask Me Anything" (AMA)
"parent": "24_CHARACTER_ID_OF_PARENT_POST_OR_COMMENT", // All comments have a parent. This is the ID you generated for the immediate parent, which could be a post or a comment. Posts have NO parent
"children": ["ARRAY", "OF", "CHILD", "IDS"], // An array that holds the ID's of the immediate children for this post or comment. These are the IDs you generated for this post or comment.
- For stories with many comments, not all comments will appear in the first .json file. You may have to look at additional URL's to extract all the comments for a story.
- You must maintain referential integrity for the ID's. In other words, if you assign an id of "aK2usyH1KpeyeQ3fma4wa2fb" to a comment, this same ID must be used whenever this comment is a parent or child of another comment or post.
PHP. We will require all source code needed to make the spider work.
Extremely simple. A webpage with a <textarea> to input a list of URL's to be spidered/scraped and a button that says "go". The return page should be the JSON for the users and the posts/comments.
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online