StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODetecting a (naughty or nice) URL or link in a text string
text
Body
copied!<p><strong>How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?</strong></p> <p>The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. <em>It should not be economical for a spammer to post links because most users could not successfully get to the page</em>. I would like suggestions, references, or discussion on best-practices.</p> <p>Some objectives:</p> <ul> <li>The low-hanging fruit like well-formed URLs (<code>http://some-fqdn/some/valid/path.ext</code>)</li> <li>URLs but without the <code>http://</code> prefix (i.e. a valid FQDN + valid HTTP path)</li> <li>Any other funny business</li> </ul> <p>Of course, I am blocking spam, but the same process could be used to auto-link text.</p> <h2>Ideas</h2> <p>Here are some things I'm thinking.</p> <ul> <li>The content is native-language prose so I can be trigger-happy in detection</li> <li>Should I strip out all whitespace first, to catch "<code>www .example.com</code>"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?</li> <li>Maybe multiple passes is a better strategy, with scans for: <ul> <li>Well-formed URLs</li> <li>All non-whitespace followed by '.' followed by any valid TLD</li> <li>Anything else?</li> </ul></li> </ul> <h2>Related Questions</h2> <p>I've read these and they are now documented here, so you can just references the regexes in those questions if you want.</p> <ul> <li><a href="https://stackoverflow.com/questions/37684/replace-url-with-html-links-javascript">replace URL with HTML Links javascript</a></li> <li><a href="https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url">What is the best regular expression to check if a string is a valid URL</a></li> <li><a href="https://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex">Getting parts of a URL (Regex)</a></li> </ul> <h2>Update and Summary</h2> <p>Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:</p> <ol> <li>@Jon Bright's technique of detecting TLDs (a good defensive chokepoint)</li> <li>For those suspicious strings, replace the dot with a dot-looking character as per @capar</li> <li>A good dot-looking character is @Sharkey's subscripted &middot; (i.e. "<sub>·</sub>"). &middot; is also a word boundary so it's harder to casually copy & paste.</li> </ol> <p>That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:</p> <ul> <li>Strip out all dotted-quads (@Sharkey's comment to his own answer)</li> <li>@Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.</li> <li>Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)</li> <li>Looking at Chrome's source for its smart address bar to see what clever tricks Google uses</li> <li>Calling out to OWASP AntiSAMY or other web services for spam/malware detection.</li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload