StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPattern Matching for URL classification
text
Body
copied!<p>As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.</p> <p>Eg : If the url is : <a href="http://cnnworld/sports/abcd" rel="nofollow">http://cnnworld/sports/abcd</a>, we would classify it under the category "sports"</p> <p>To accomplish this, we have a database with mappings of the format : Keyword -> Category</p> <p>Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.</p> <p>But this approach has a few problems, the main ones being : </p> <p>(i) Our database is very big and such repeated querying runs extremely slowly</p> <p>(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.</p> <p>I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.</p> <p>A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url. </p> <p>I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload