Closed

Extract information from .json files and spider content

This project was awarded to hectorenavarrou for $353 USD.

Get free quotes for a project like this
Employer working
Awarded to:
Project Budget
$250 - $750 USD
Total Bids
36
Project Description

Summary:

Given a list of URLs of pages at reddit, extract the information and put it in our JSON format.

Details:

- Reddit has .json files for each of it's stories. Given a list of URL to reddit postings (e.g. [url removed, login to view]), your goal is to extract *all* the comment information for the posts. Reddit provides .json files that correspond to these stories: .....

- The information needs to be extracted and stored in the following way:

- The first time you encounter a user, they should get a unique 16 character ID. We will provide you with the code snippet to generate IDs. The final user records need to look like this:

[

{

"id": "16_CHARACTER_ID",

"reddit": "THEIR_REDDIT_USERNAME",

"date": "TIMESTAMP_OF_WHEN_THIS_RECORD_CREATED",

},

...

]

Reddit stories have an initial post and then comments. Both posts and comments should be stored in a JSON file that looks like this:

[

{

"id": "24_CHARACTER_ID", // Generated with the same code snipped as user id's

"localid": "50t2rc", // This is the id that reddit uses to identify the post or comment

"author": "ID_FROM_USER_TABLE", // The 16 character ID that you generated for this user

"title": "Graffiti in abandoned Greece hotel", // NOTE: Posts have a "title", comments have "content" but no title

"link": "[url removed, login to view]", // Some posts have a link, some don't. No comments have links.

"content": "RAW_CONTENT_OF_POST_OR_CONTENT", // All comments have content, only some posts have content. for example, posts of the type "Ask Me Anything" (AMA)

"date": "TIMESTAMP_OF_WHEN_POSTED",

"crawled": "TIMESTAMP_OF_WHEN_SPIDERED",

"parent": "24_CHARACTER_ID_OF_PARENT_POST_OR_COMMENT", // All comments have a parent. This is the ID you generated for the immediate parent, which could be a post or a comment. Posts have NO parent

"children": ["ARRAY", "OF", "CHILD", "IDS"], // An array that holds the ID's of the immediate children for this post or comment. These are the IDs you generated for this post or comment.

}

]

Important:

- For stories with many comments, not all comments will appear in the first .json file. You may have to look at additional URL's to extract all the comments for a story.

- You must maintain referential integrity for the ID's. In other words, if you assign an id of "aK2usyH1KpeyeQ3fma4wa2fb" to a comment, this same ID must be used whenever this comment is a parent or child of another comment or post.

Programming language:

PHP. We will require all source code needed to make the spider work.

User interface:

Extremely simple. A webpage with a <textarea> to input a list of URL's to be spidered/scraped and a button that says "go". The return page should be the JSON for the users and the posts/comments.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online