StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
8849170
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-01-13T10:35:35.733
FavoriteCount
0
LastActivityDate
2012-01-13T10:56:27.283
LastEditDate
2012-01-13T10:56:27.283
LastEditorUserId
816374
OwnerUserId
816374
ParentId
8848998
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
That really deppends on what sort of websites and data you face. Option 1: DOM / XPATH based If you need to parse tables and very detailed things you need to parse each site with a separate algorithm. One way would be to parse each of the specific site into a DOM representation and adress each value per XPATH. This will take some time and is affected by structure changes and if you have to scrape each of these sites with this it will cost you more than a morning. Option 2: Density based However if you need to parse something like a blog article and you may want to extract only the articles text there are pretty good density based algorithm which work accross HTML structure changes. One of those is described here: <a href="https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf" rel="nofollow">https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf</a> A implementation is provided here: <a href="http://apoc.sixserv.org/code/ce_density.rb" rel="nofollow">http://apoc.sixserv.org/code/ce_density.rb</a> You would have to port it to php. For blogs and news sites this is a really effective way. Option 3: Pragmatic If you do not care about layout and structure and only want to have the data provided. You might download contents and try to strip the tags solely. However this will have a lot of noise in the resulting text. Update After updating your post you might follow the following in order: <ol> <li>Check which page is illegal to scrape. On this list there are for sure some which you will not be allowed to scrape.</li> <li>You will need much more time than a day. I would talk about this and the legal problems with project lead.</li> <li>Choose one option per page</li> </ol>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POIs it better to try for one mega screen scraper or split it into a scraper for different sites?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USfyr
UserOwnerUserId
1. USfyr
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POIs it better to try for one mega screen scraper or split it into a scraper for different sites?
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTDownMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.