Note that there are some explanatory texts on larger screens.

plurals
  1. POASP.NET Web Page Mirror, Replacing all relative URLs with absolute Paths
    primarykey
    data
    text
    <p>I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.</p> <p>This question has been partially answered here <a href="https://stackoverflow.com/a/2719712/696638">https://stackoverflow.com/a/2719712/696638</a></p> <p>Using a combination of the answer above and this blog post <a href="http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/" rel="nofollow noreferrer">http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/</a> I have built the following;</p> <pre><code>public partial class Crawler : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { Response.Clear(); string url = Request.QueryString["path"]; WebClient client = new WebClient(); byte[] requestHTML = client.DownloadData(url); string sourceHTML = new UTF8Encoding().GetString(requestHTML); HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(sourceHTML); foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) { if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) { HtmlAttribute att = link.Attributes["href"]; string href = att.Value; // ignore javascript on buttons using a tags if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue; Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute); if (!urlNext.IsAbsoluteUri) { urlNext = new Uri(new Uri(url), urlNext); att.Value = urlNext.ToString(); } } } Response.Write(htmlDoc.DocumentNode.OuterHtml); } } </code></pre> <p>This only replaces the href attribute for links. By expanding this I'd like to know what the most efficient way would be to include;</p> <ul> <li><code>href</code> attribute for <code>&lt;a&gt;</code> elements</li> <li><code>href</code> attribute for <code>&lt;link&gt;</code> elements</li> <li><code>src</code> attribute for <code>&lt;script&gt;</code> elements</li> <li><code>src</code> attribute for <code>&lt;img&gt;</code> elements</li> <li><code>action</code> attribute for <code>&lt;form&gt;</code> elements</li> </ul> <p>And any others people can think of?</p> <p>Could these be found using a single call to <code>SelectNodes</code> with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload