Note that there are some explanatory texts on larger screens.

plurals
  1. POHow can I extract external referrers from an Apache access log with Perl?
    text
    copied!<p>I need some help getting a regex working to parse all referrers from an apache access log file which <strong>come from real links offsite</strong> and which are valid referrals from real people rather than bots or spiders. I'm working in Perl.</p> <p>This bit of code <em>almost</em> works already [the access log is opened with the filehandle $fh]:</p> <pre><code>my $totalreferals = 0; while ( my $line = &lt;$fh&gt; ) { if ($line !~ m! \[\d{2}/\w{3}/\d{4}(?::\d\d){3}.+?\] \s"GET\s\S+\sHTTP/\d.\d" \s\S+ \s\S+ \s("-"|"http://(www\.|)mywebsite\.com.*" !xi ) { $totalreferals++; } $line =~ m! \[(\d{2}/\w{3}/\d{4})(?::\d\d){3}.+?\] \s"GET\s(\S+)\sHTTP/\d.\d" \s(\S+) \s\S+ \s"http://w{1,3}\.google\. (?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/ [^\"]*q=([^\"&amp;]+)[^\"]*" !xi or next; my ( $datestr, $path, $status, $query ) = ( $1, $2, $3, $4 ); . . #do other stuff . . } </code></pre> <p>The above regex successfully eliminates all internal links recorded in the access_log plus records that don't have a referrer, but it gives a $totalreferals that is otherwise way too large. </p> <p>Examples of log $line that are being counted by the 1st regex, but which I want excluded are:</p> <pre><code>61.247.221.45 - - [02/Jan/2009:20:51:41 -0600] "GET /oil-paintings/section.php/2451/0 HTTP/1.1" 200 85856 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)" </code></pre> <p>-- Appears to be a spider from Korea</p> <hr> <pre><code>93.84.41.131 - - [31/Dec/2008:02:36:54 -0600] "GET /paintings/artists/w/Waterhouse_John_William/oil-big/Waterhouse_Destiny.jpg HTTP/1.1" 200 19924 "http://smrus.web-box.ru/Schemes" "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5" </code></pre> <p>-- Request is for an image embedded within another website (we allow this)</p> <hr> <pre><code>87.115.8.230 - - [31/Dec/2008:03:08:17 -0600] "GET /paintings/artists/recently-added/july2008/big/Crucifixion-of-St-Peter-xx-Guido-Reni.JPG HTTP/1.1" 200 37348 "http://images.google.co.uk/im........DN&amp;frame=small" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5" </code></pre> <p>-- Request is from google images (could be viewing the image full-size, or spidering it)</p> <hr> <pre><code>216.145.5.42 - - [31/Dec/2008:02:21:49 -0600] "GET / HTTP/1.1" 200 53508 "http://whois.domaintools.com/mywebsite.com" "SurveyBot/2.3 (Whois Source)" </code></pre> <p>-- Request is from a whois bot</p> <hr>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload