StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow do search engines find relevant content?
text
Body
copied!<p>How does Google find relevant content when it's parsing the web?</p> <p>Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?</p> <p>My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.</p> <p>Let's say we had this URL:</p> <pre><code>http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html </code></pre> <p>Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!</p> <p>A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc. </p> <p>I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.</p> <p>Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?</p> <hr> <p>Some ideas that I have in mind are:</p> <ul> <li>Find all paragraphs and order by plain text length</li> <li>Somehow find the Width and Height of <code>div</code> containers and order by (W+H) - @Benoit</li> <li>Check meta keywords, title, description and check relevancy within the paragraphs</li> <li>Find all image tags and order by largest, and length of nodes away from main paragraph</li> <li>Check for object data, such as videos and count the nodes from the largest paragraph / content div</li> <li>Work out resemblances from previous pages parsed</li> </ul> <hr> <p>The reason why I need this information:</p> <p>I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.</p> <ul> <li>An image (if applicable)</li> <li>A < 255 paragraph from the best slice of text</li> <li>Keywords that would be used for our search engine, (Stack Overflow style)</li> <li>Meta data Keywords, Description, all images, change-log (for moderation and administration purposes)</li> </ul> <p>Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.</p> <p>I'm not asking for trade secrets, I'm asking what your personal approach to this would be.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload