Note that there are some explanatory texts on larger screens.

plurals
  1. POC# Web Parsing Conflict
    primarykey
    data
    text
    <p>It seems that Im encountering quite a few problems in a simple attempt to parse some HTML. As practice, I'm writting a mutli-threaded web crawler that starts with a list of sites to crawl. This gets handed down through a few classes, which should eventually return the content of the sites back to my system. This seems rather straightforward, but I've had no luck in either of the following tasks:</p> <p>A. Convert the content of a website ( In string format, from an HttpWebRequest Stream ) to an HtmlDocument ( Cannot create a new instance of an HtmlDocument? Doesn't make much sense... ) by using the HtmlDocument.Write() Method.</p> <p>or</p> <p>B. Collect an HtmlDocument via a WebBrowser instance.</p> <p>Here is my code as it exists, any advice would be great...</p> <pre><code> public void Start() { if (this.RunningThread == null) { Console.WriteLine( "Executing SiteCrawler for " + SiteRoot.DnsSafeHost); this.RunningThread = new Thread(this.Start); this.RunningThread.SetApartmentState(ApartmentState.STA); this.RunningThread.Start(); } else { try { WebBrowser BrowserEmulator = new WebBrowser(); BrowserEmulator.Navigate(this.SiteRoot); HtmlElementCollection LinkCollection = BrowserEmulator.Document.GetElementsByTagName("a"); List&lt;PageCrawler&gt; PageCrawlerList = new List&lt;PageCrawler&gt;(); foreach (HtmlElement Link in LinkCollection) { PageCrawlerList.Add(new PageCrawler(Link.GetAttribute("href"), true)); continue; } return; } catch (Exception e) { throw new Exception("Exception encountered in SiteCrawler: " + e.Message); } } } </code></pre> <p>This code seems to do nothing when it passes over the 'Navigate' method. I've attempted allowing it to open in a new window, which pops a new instance of IE, and proceeds to navigate to the specified address, but not before my program steps over the navigate method. I've tried waiting for the browser to be 'not busy', but it never seems to pick up the busy attribute anyway. I've tried creating a new document via the Browser.Document.OpenNew() so that I might populate it with data from a WebRequest stream, however as Im sure you can assume I get back a Null Pointer exception when I try to reach through the 'Document' portion of that statement. I've done some research and this appears to be the only way to create a new HtmlDocument.</p> <p>As you can see, this method is intended to kick off a 'PageCrawler' for every link in a specified page. I am sure that I could parse through the HTML character by character to find all of the links, after using an HttpWebRequest and collecting the data from the stream, but this is far more work than should be necessary to complete this.</p> <p>If anyone has any advice it would be greatly appreciated. Thank you.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload