StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POParsing HTML document: Regular expression or LINQ?
primarykey
Id
907563
data
AcceptedAnswerId
907639
AnswerCount
4
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2009-05-25T17:58:35.713
FavoriteCount
0
LastActivityDate
2015-01-16T12:59:03.943
LastEditDate
2017-05-23T12:12:37.640
LastEditorUserId
-1
OwnerUserId
23199
ParentId
0
PostTypeId
1
Score
8
ViewCount
8744
LastEditorDisplayName
text
Body
<p>Trying to parse an HTML document and extract some elements (any links to text files).</p> <p>The current strategy is to load an HTML document into a string. Then find all instances of links to text files. It could be any file type, but for this question, it's a text file.</p> <p>The end goal is to have an <code>IEnumerable</code> list of string objects. That part is easy, but parsing the data is the question.</p> <pre><code><html> <head><title>Blah</title> </head> <body> <br/> <div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div> <span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span> <div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div> <div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div> <div>Thanks for visiting!</div> </body> </html> </code></pre> <p>The initial approaches are:</p> <ul> <li>load the string into an XML document, and attack it in a Linq-To-Xml fashion.</li> <li>create a regex, to look for a string starting with <code>href=</code>, and ending with <code>.txt</code></li> </ul> <p>The question being: </p> <ul> <li>what would that regex look like? I am a regex newbie, and this is part of my regex learning. </li> <li>which method would you use to extract a list of tags?</li> <li>which would be the most performant way?</li> <li>which method would be the most readable/maintainable?</li> </ul> <p><hr> <strong>Update:</strong> Kudos to <a href="https://stackoverflow.com/questions/907563/parsing-html-document-regular-expression-or-linq/907571#907571">Matthew</a> on the HTML Agility Pack suggestion. It worked just fine! The XPath suggestion works as well. I wish I could mark both answers as 'The Answer', but I obviously cannot. They are both valid solutions to the problem.</p> <p>Here's a C# console app using the regex suggested by <a href="https://stackoverflow.com/questions/907563/parsing-html-document-regular-expression-or-linq/907639#907639">Jeff</a>. It reads the string fine, and will not include any href that is not ended with .txt. With the given sample, it correctly does NOT include the <code>.txt.snarg</code> file in the results (as provided in the HTML string function).</p> <pre><code>using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; using System.IO; namespace ParsePageLinks { class Program { static void Main(string[] args) { GetAllLinksFromStringByRegex(); } static List<string> GetAllLinksFromStringByRegex() { string myHtmlString = BuildHtmlString(); string txtFileExp = "href=\"([^\\\"]*\\.txt)\""; List<string> foundTextFiles = new List<string>(); MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase); foreach (Match m in textFileLinkMatches) { foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group } return files; } static string BuildHtmlString() { return new StringReader(@"<html><head><title>Blah</title></head><body><br/> <div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div> <span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span> <div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div> <div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div> <div>Thanks for visiting!</div></body></html>").ReadToEnd(); } } } </code></pre>
Tags
<c#><regex><linq><parsing><linq-to-xml>
Title
Parsing HTML document: Regular expression or LINQ?
singulars
PostAcceptedAnswerId
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USp.campbell
plurals
PostLinksPostIdRelatedPostId
1. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
PostLinksRelatedPostIdPostId
1. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
2. PO
  singulars
  PostTypePostTypeId
  PTAnswer
3. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POParsing HTML document: Regular expression or LINQ?
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  POParsing HTML document: Regular expression or LINQ?
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  POParsing HTML document: Regular expression or LINQ?
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.