Note that there are some explanatory texts on larger screens.

plurals
  1. POmulti-thread, multi-curl crawler in PHP
    primarykey
    data
    text
    <p><strong>Hi everyone once again!</strong></p> <p>We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.</p> <p>Let's use some pseudo code to understand the logic:</p> <pre><code> 1) While ($links_to_be_scanned &gt; 0). 2) Foreach ($links_to_be_scanned as $link_to_be_scanned). 3) Scan_the_link() and run some other functions. 4) Extract the new links from the xdom. 5) Push the new links into $links_to_be_scanned. 5) Push the current link into $links_already_scanned. 6) Remove the current link from $links_to_be_scanned. </code></pre> <p><strong>Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.</strong></p> <p>I understand that we're gonna have to create a $links_being_scanned or some kind of queue.</p> <p>I'm really not sure how to approach this problem to be honest, <strong>if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.</strong></p> <p><strong>Thanks in advance! Chris;</strong></p> <p>Extended:</p> <p>I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.</p> <p>Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.</p> <p>So now rethinking, we would have to do something like this:</p> <pre><code> While (There's links to be scanned) Foreach ($Link_to_scann as $link) If (There's less than 10 scanners running) Launch_a_new_scanner($link) Remove the link from $links_to_be_scanned array Push the link into $links_on_queue array Endif; </code></pre> <p><strong>And each scanner does (This should be run in parallel):</strong></p> <pre><code> Create an object with the given link Send a curl request to the given link Create a dom and an Xdom with the response body Perform other operations over the response body Remove the link from the $links_on_queue array Push the link into the $links_already_scanned array </code></pre> <p><strong>I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?</strong></p> <p>Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.</p> <p>I assume I would have to approach this using fsockopen or pcntl_fork.</p> <p><strong>Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!</strong></p> <p>Thanks a lot!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload