Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>OK, you're probably not going to like this answer much - but then maybe you will? I have a regular expression (adapted from ) that seems to find URLs in text. You can see a demo <a href="http://regex101.com/r/vV2nC5" rel="nofollow">on regex101.com</a> .</p> <p>The actual expression is very very long - this is because it's got "every legal TLD (top level domain) in it, which is a good start for finding "good" URLs. Here it is</p> <pre><code>((?:(?:http|ftp|https):\/{2}){0,1}(?:(?:[0-9a-z_-]+\.)+(?:aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mn|mn|mo|mp|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|nom|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ra|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|arpa)(?::[0-9]+)?(?:(?:\/(?:[~0-9a-zA-Z\#\+\%@\.\/_-]+))?(?:\?[0-9a-zA-Z\+\%@\/&amp;\[\];=_-]+)?)?)) </code></pre> <p>As you can see the vast majority of the expression is taken up making sure that the TLD is one of the many legal ones (270 alternatives. I didn't know there were this many until I stumbled on <a href="http://mathiasbynens.be/demo/url-regex" rel="nofollow">http://mathiasbynens.be/demo/url-regex</a> where I found the seeds of this expression).</p> <p>Changes I made to the expression I found at the link above - mostly I just made all the groups (except the outer one) non-capturing so there is just a single "match". In the sample I posted I showed that a "good" protocol definition (like <code>http://</code>) will be included in the capture, while a "bad" one (like <code>http:/</code>) will be ignored - however the following URL will still be captured. I also showed that adding punctuation right after the expression (tested with <code>;</code> and <code>!</code>) doesn't phase the expression: it captures "up to that point" and not beyond.</p> <p>Play with it and see how you like it. It is relatively poor (according to the above link) for "pathological" URLs, and doesn't work with Arabic etc - but I don't think, based on your question, that this would be an issue.</p> <p>A short explanation:</p> <pre><code>(?:(?:http|ftp|https):\/{2}){0,1} (?:http|ftp|https) - match one of http, ftp, or https - non capturing "OR" group :\/{2} - followed by a colon and exactly two forward slashes (?: …){0,1} - the whole thing zero or one times (so no protocol, or properly formed) (?:(?:[0-9a-z_-]+\.)+ [0-9a-z_-]+\. - at least one of the characters in the given range, followed by a period (?: )+ - the whole thing one or more times, non-capturing (?:aero|asia …) - one of these strings, non-capturing (these are all the valid TLDs) (?::[0-9]+)? - zero or one times a colon followed by one or more digits: port specification - this makes sure that www.something.us:8080 is valid </code></pre> <p>Everything else that follows matches all the different things that can go after - directories, queries, etc.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload