Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>its a common "problem" of web-crawlers bots... some pages contains dynamically content added from contained sources. this content can be loaded on page load or triggered (like your example - by scrolling down). when the target page is downloaded and scraped the DOM structure doesn't, in most of cases, include the html elements of the outer included data.</p> <p>what i suggest u to do is to identify the source path of this data, that could be done by carefully checking the scrips on the DOM. and calling him as a secondary source that includes all the missing data you need.</p> <p>edit:</p> <p>in the example you linked- its easy:</p> <pre><code> - install firebug. - scroll down the page to check the script that fires the request. - now you can see the link and the vars that are used for dynamicly adding the content. </code></pre> <p>www.healthtap.com/#topics/Women%27s%20health:</p> <p>dinamyclly response link:</p> <p><a href="https://www.healthtap.com/topics/Women%27s%20health.json" rel="nofollow">https://www.healthtap.com/topics/Women%27s%20health.json</a>? extended_categories=1&amp;auth_token=false&amp;per_page=8&amp;page=7&amp;per_page=8&amp;auth_token=false&amp;generate_token=true</p> <p>as you can see there a few parameters you can play with:</p> <pre><code> 1/ topics/ + the page firs value name + .json? 2/ per page= num -&gt; how much results to return 3/ generate_token=true -&gt; its a security value but just change it to false and it work fine.... </code></pre> <p>now you can play with this link and load all the data you need and merge it with the main page you crawled.</p> <p>tested!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload