Note that there are some explanatory texts on larger screens.

plurals
  1. POTesting if HTML table is used for layout vs. data?
    primarykey
    data
    text
    <p>This is more of a web scraping question. What are the recognized approaches to automatically determining if a <code>&lt;table&gt;</code> is used for layout vs. is used for <em>data</em> in some HTML document you've never seen before?</p> <p>I'd like to be able to pass in any HTML file as a string into some function that spits out all of the <em>data tables</em> in an HTML page, but ignores tables used purely for layout. But sites like <a href="http://news.ycombinator.com/newcomments" rel="nofollow">http://news.ycombinator.com/newcomments</a> use HTML tables for layout, which makes it tricky.</p> <p>This function shouldn't be tailored to any specific websites' DOM structure, so it should work with any HTML string (or have as high a success rate as possible).</p> <p>Are there any algorithms/checks people have figured out over the years that can distinguish between layout and data tables? It should be possible, it's just a matter of writing down all the variables and trial/error - which I imagine many people have already mapped out somewhere.</p> <p>I don't necessarily need the function (that would be awesome though, but I imagine it would require a lot of fine-tuning). Just looking for some tried strategies.</p> <p><strong>Update</strong></p> <p>Here's a good start (thanks @JaredFarrish):</p> <ul> <li><a href="http://www2002.org/CDROM/refereed/199/" rel="nofollow">A Machine Learning Based Approach for Table Detection on The Web</a></li> <li>Keywords: Table Detection, Layout Analysis, Machine Learning, Decision tree, Support Vector Machine, Information Retrieval</li> </ul>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload