Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>That really deppends on what sort of websites and data you face.</p> <p><strong>Option 1: DOM / XPATH based</strong></p> <p>If you need to parse tables and very detailed things you need to parse each site with a separate algorithm. One way would be to parse each of the specific site into a DOM representation and adress each value per XPATH. This will take some time and is affected by structure changes and if you have to scrape each of these sites with this it will cost you more than a morning.</p> <p><strong>Option 2: Density based</strong></p> <p>However if you need to parse something like a blog article and you may want to extract only the articles text there are pretty good density based algorithm which work accross HTML structure changes. One of those is described here: <a href="https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf" rel="nofollow">https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf</a></p> <p>A implementation is provided here: <a href="http://apoc.sixserv.org/code/ce_density.rb" rel="nofollow">http://apoc.sixserv.org/code/ce_density.rb</a></p> <p>You would have to port it to php. For blogs and news sites this is a really effective way.</p> <p><strong>Option 3: Pragmatic</strong></p> <p>If you do not care about layout and structure and only want to have the data provided. You might download contents and try to strip the tags solely. However this will have a lot of noise in the resulting text.</p> <p><strong>Update</strong></p> <p>After updating your post you might follow the following in order:</p> <ol> <li><p>Check which page is illegal to scrape. On this list there are for sure some which you will not be allowed to scrape.</p></li> <li><p>You will need much more time than a day. I would talk about this and the legal problems with project lead.</p></li> <li><p>Choose one option per page</p></li> </ol>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload