StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWin32.: How to scrape HTML without regular expressions?
primarykey
Id
1790575
data
AcceptedAnswerId
0
AnswerCount
12
ClosedDate
CommentCount
12
CommunityOwnedDate
CreationDate
2009-11-24T14:52:30.297
FavoriteCount
4
LastActivityDate
2009-12-02T13:16:28.687
LastEditDate
2017-05-23T12:34:00.643
LastEditorUserId
-1
OwnerUserId
12597
ParentId
0
PostTypeId
1
Score
15
ViewCount
2090
LastEditorDisplayName
text
Body
<p>A recent <a href="http://www.codinghorror.com/blog/archives/001311.html" rel="nofollow noreferrer">blog entry by a Jeff Atwood</a> says that you should never parse HTML using regular expressions - yet doesn't give an alternative.</p> <p>i want to scrape search search results, extracting values:</p> <pre><code><div class="used_result_container"> ... ... <div class="vehicleInfo"> ... ... <div class="makemodeltrim"> ... <a class="carlink" href="[Url]">[MakeAndModel]</a> ... </div> <div class="kilometers">[Kilometers]</div> <div class="price">[Price]</div> <div class="location"> <span class='locationText'>Location:</span>[Location] </div> ... ... </div> ... ... </div> ...and it repeats </code></pre> <p>You can see the values i want to extract, [enclosed in brackets]:</p> <ul> <li>Url</li> <li>MakeAndModel</li> <li>Kilometers</li> <li>Price</li> <li>Location</li> </ul> <p><em>Assuming</em> we <strong>accept</strong> the premise that parsing HTML:</p> <ul> <li>generally a bad idea</li> <li><a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">rapidly devolves into madness</a></li> </ul> <p>What's the way to do it?</p> <p>Assumptions:</p> <ul> <li>native Win32</li> <li>loose html</li> </ul> <hr> <p>Assumption clarifications:</p> <p><strong>Native Win32</strong></p> <ul> <li>.NET/CLR is not native Win32</li> <li>Java is not native Win32</li> <li>perl, python, ruby are not native Win32</li> <li>assume C++, in Visual Studio 2000, compiled into a native Win32 application</li> </ul> <p>Native Win32 applications <strong>can</strong> call library code:</p> <ul> <li>copied source code</li> <li>DLLs containing function entry points</li> <li>DLLs containing COM objects</li> <li>DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects</li> </ul> <p><strong>Loose HTML</strong></p> <ul> <li>xml is not loose HTML</li> <li>xhtml is not loose HTML</li> <li>strict HTML is not loose HTML</li> </ul> <p>Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality i was present the assumption that any HTML parser must be generous in the HTML it accepts.</p> <hr> <h2>Clarification#2</h2> <p><strong><em>Assuming</em></strong> you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would <em>you</em> walk a DOM tree? i need a DIV node that is a class of <em>used_result_container</em>, which has a child DIV of class of <em>vehicleInfo</em>. But the nodes don't necessarily have to be direct children of one another. </p> <p>It sounds like i'm trading one set of regular expression problems for another. If they change the structure of the HTML, i will have to re-write my code to match - as i would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do i do instead?</p> <p>And would i not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.</p> <p>i intentionally kept the original question vague, so as not to lead people down the wrong path. i didn't want to imply that the solution, necessarily, had anything to do with:</p> <ul> <li>walking a DOM tree</li> <li>xpath queries</li> </ul> <h2>Clarification#3</h2> <p>The sample HTML i provided i trimmed down to the important elements and attributes. The mechanism i used to trim the HTML down was based on my internal bias that uses regular expressions. i naturally think that i need various "<strong>sign-posts</strong> in the HTML that i look for.</p> <p>So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of <em>all</em> the original HTML.</p> <h2>Update 4</h2> <p>The only propsed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: <strong>then what</strong>?</p> <p>Now that i have the DOM, what do i do with it? It seems that i still have to walk the tree with some sort of <em>regular DOM expression parser</em>, capable of forward matching and capture.</p> <p>In this particular case i need all the <em>used_result_container</em> <strong>DIV</strong> nodes which contain <em>vehicleInfo</em> DIV nodes as children. Any <em>used_result_container</em> DIV nodes that do not contain <em>vehicleInfo</em> has a child are not relavent.</p> <p>Is there a DOM regular expression parser with capture and forward matching? i don't think XPath can select higher level nodes based on criteria of lower level nodes:</p> <pre><code>\\div[@class="used_result_container" && .\div[@class="vehicleInfo"]]\* </code></pre> <p><strong>Note:</strong> i use XPath so infrequently that i cannot make up hypothetical xpath syntax very goodly.</p>
Tags
<html><windows><regex><winapi><screen-scraping>
Title
Win32.: How to scrape HTML without regular expressions?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USIan Boyd
plurals
PostLinksPostIdRelatedPostId
1. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
2. PO
  singulars
  PostTypePostTypeId
  PTAnswer
3. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POWin32.: How to scrape HTML without regular expressions?
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  POWin32.: How to scrape HTML without regular expressions?
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  POWin32.: How to scrape HTML without regular expressions?
  UserUserId
  USMajkel
  VoteTypeVoteTypeId
  VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.