StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
14536035
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2013-01-26T10:27:55.743
FavoriteCount
0
LastActivityDate
2013-01-26T10:55:53.953
LastEditDate
2013-01-26T10:55:53.953
LastEditorUserId
706866
OwnerUserId
706866
ParentId
544450
PostTypeId
2
Score
80
ViewCount
0
LastEditorDisplayName
text
Body
<p>You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits. </p> <p>Here are a few c# solutions for you:</p> <h3>1) Simplest</h3> <p>Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.</p> <pre><code>bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase); </code></pre> <h3>2) Medium</h3> <p>Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers.</p> <pre><code>List<string> Crawlers3 = new List<string>() { "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google", "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne", "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco", "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio", "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired", "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves", "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker", "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget", "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic", "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite", "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus", "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au", "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy", "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch", "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper", "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper", "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm", "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); bool iscrawler = Crawlers3.Exists(x => ua.Contains(x)); </code></pre> <h3>3) Paranoid</h3> <p>Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.</p> <pre><code>// crawlers that have 'bot' in their useragent List<string> Crawlers1 = new List<string>() { "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot", "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google", "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot", "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot", "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot", "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot", "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot", "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot", "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot" }; // crawlers that don't have 'bot' in their useragent List<string> Crawlers2 = new List<string>() { "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial", "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz", "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder", "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther", "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider", "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33", "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber", "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox", "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn", "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online", "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler", "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven", "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess", "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy", "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton", "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria", "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e", "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam", "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget", "legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); string match = null; if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x)); else match = Crawlers2.FirstOrDefault(x => ua.Contains(x)); if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua); bool iscrawler = match != null; </code></pre> <p><strong>Notes:</strong></p> <ul> <li>It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.</li> <li>Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.</li> <li>Always put the heaviest bots at the start of the list, so they match sooner.</li> <li>Put the lists into a static class so that they are not rebuilt on every pageview.</li> </ul> <p><strong>Honeypots</strong></p> <p>The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.</p> <p><code>Postives:</code> It will match some unknown crawlers that aren’t declaring themselves.</p> <p><code>Negatives:</code> Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.</p>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PODetecting honest web crawlers
  singulars
  PostTypePostTypeId
  PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USDave Sumter
UserOwnerUserId
1. USDave Sumter
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.