Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>If your problem is with waiting for the response from the web request, then the actual engine or technique you use to parse it probably has a lot less to do with performance, than simply waiting for each response from the web synchronously. If you have a long list of pages you're scraping, then you can do better by running simultaneous requests asynchronously. It's not clear that's what is going on though.</p> <p>Try <a href="https://github.com/jamietre/CsQuery" rel="nofollow">CsQuery</a> - also on <a href="http://www.nuget.org/packages/CsQuery" rel="nofollow">NuGet</a> - a new C# port of jQuery which should do what you want. It has methods for grabbing data synchronously and asynchronously, so if you did want to start parallel web requests, it can do that out of the box. At the most basic level though, the code would be this to do it synchronously:</p> <pre><code>CQ doc = CQ.CreateFromUrl("http://www.jquery.com"); string allStuffInsideTag = doc["sometag"].Contents().RenderSelection(); </code></pre> <p>It works like jquery. The "CQ" object is the same as a jQuery object. <code>Contents</code> is the jQuery method to return all children of an element; <code>RenderSelection</code> is a CsQuery method that renders the full HTML of every element in the selection set. So this would return the full text &amp; html of everything inside every <code>sometag</code> block.</p> <p>Also it indexes each document for all common selector types and is much faster than HTML Agility Pack.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COThank you. TBH I am not sure if the problem is with waiting for the response or with the processing. Now, you mention it though, it is much more likely to be the requests that are taking the time. I don't have a long list (at least when time is an issue). There will be a max of 10 requests. Doing the requests async might be an option, although it may get a bit messy as I would need them all to return a value before moving on. The CsQuery certainly looks like an elegant solution and impressive if it is faster than Agility as you say.
      singulars
    2. COIt should not be too hard to manage waiting for all the async callbacks. In C#5 you can use `Tasks.WaitAll`: http://msdn.microsoft.com/en-us/library/dd270695.aspx .. personally I'm still using VS2010, so I did implement something similar in CsQuery, see When.All under: https://github.com/jamietre/csquery#promises CsQuery is much faster than HAP for selectors, actually I just ran some numbers: http://blog.outsharked.com/2012/06/csquery-performance-vs-fizzler.html but again - depending on what you're extracting and the size of the pages -- you could just be waiting for the server.
      singulars
    3. COHi, Sorry, not been on for a while cos of trying to sort some car trouble. I Think I am probably gonna request and process the one time critical page, then process the rest after sending a response to the user. I will then see what the response time is like and will keep your suggestions in mind if I need to do further optimising (which I will do at some point). Many of your suggestions will also be very helpful in other projects, so thank you very much for your input.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload