StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I think your suggestion of breaking apart the URL to find useful bits and then querying for just those items sounds like a decent way to go. </p> <p>I tossed together some Java that might help illustrate code-wise what I think this would entail. The most valuable portions are probably the regexes, but I hope the general algorithm of it helps some as well:</p> <pre class="lang-java prettyprint-override"><code>import java.io.UnsupportedEncodingException; import java.net.URLDecoder; import java.util.List; public class CategoryParser { /** The db field that keywords should be checked against */ private static final String DB_KEYWORD_FIELD_NAME = "keyword"; /** The db field that categories should be pulled from */ private static final String DB_CATEGORY_FIELD_NAME = "category"; /** The name of the table to query */ private static final String DB_TABLE_NAME = "KeywordCategoryMap"; /** * This method takes a URL and from that text alone determines what categories that URL belongs in. * @param url - String URL to categorize * @return categories - A List&lt;String&rt; of categories the URL seemingly belongs in */ public static List<String> getCategoriesFromUrl(String url) { // Clean the URL to remove useless bits and encoding artifacts String normalizedUrl = normalizeURL(url); // Break the url apart and get the good stuff String[] keywords = tokenizeURL(normalizedUrl); // Construct the query we can query the database with String query = constructKeywordCategoryQuery(keywords); System.out.println("Generated Query: " + query); // At this point, you'd need to fire this query off to your database, // and the results you'd get back should each be a valid category // for your URL. This code is not provided because it's very implementation specific, // and you already know how to deal with databases. // Returning null to make this compile, even though you'd obviously want to return the // actual List of Strings return null; } /** * Removes the protocol, if it exists, from the front and * removes any random encoding characters * Extend this to do other url cleaning/pre-processing * @param url - The String URL to normalize * @return normalizedUrl - The String URL that has no junk or surprises */ private static String normalizeURL(String url) { // Decode URL to remove any %20 type stuff String normalizedUrl = url; try { // I've used a URLDecoder that's part of Java here, // but this functionality exists in most modern languages // and is universally called url decoding normalizedUrl = URLDecoder.decode(url, "UTF-8"); } catch(UnsupportedEncodingException uee) { System.err.println("Unable to Decode URL. Decoding skipped."); uee.printStackTrace(); } // Remove the protocol, http:// ftp:// or similar from the front if (normalizedUrl.contains("://")) { normalizedUrl = normalizedUrl.split(":\\/\\/")[1]; } // Room here to do more pre-processing return normalizedUrl; } /** * Takes apart the url into the pieces that make at least some sense * This doesn't guarantee that each token is a potentially valid keyword, however * because that would require actually iterating over them again, which might be * seen as a waste. * @param url - Url to be tokenized * @return tokens - A String array of all the tokens */ private static String[] tokenizeURL(String url) { // I assume that we're going to use the whole URL to find tokens in // If you want to just look in the GET parameters, or you want to ignore the domain // or you want to use the domain as a token itself, that would have to be // processed above the next line, and only the remaining parts split String[] tokens = url.split("\\b|_"); // One could alternatively use a more complex regex to remove more invalid matches // but this is subject to your (?:in)?ability to actually write the regex you want // These next two get rid of tokens that are too short, also. // Destroys anything that's not alphanumeric and things that are // alphanumeric but only 1 character long //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+"); // Destroys anything that's not alphanumeric and things that are // alphanumeric but only 1 or 2 characters long //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+"); return tokens; } private static String constructKeywordCategoryQuery(String[] keywords) { // This will hold our WHERE body, keyword OR keyword2 OR keyword3 StringBuilder whereItems = new StringBuilder(); // Potential query, if we find anything valid String query = null; // Iterate over every found token for (String keyword : keywords) { // Reject invalid keywords if (isKeywordValid(keyword)) { // If we need an OR if (whereItems.length() > 0) { whereItems.append(" OR "); } // Simply append this item to the query // Yields something like "keyword='thisKeyword'" whereItems.append(DB_KEYWORD_FIELD_NAME); whereItems.append("='"); whereItems.append(keyword); whereItems.append("'"); } } // If a valid keyword actually made it into the query if (whereItems.length() > 0) { query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME + " WHERE " + whereItems.toString() + ";"; } return query; } private static boolean isKeywordValid(String keyword) { // Keywords better be at least 2 characters long return keyword.length() > 1 // And they better be only composed of letters and numbers && keyword.matches("\\w+") // And they better not be *just* numbers // && !keyword.matches("\\d+") // If you want this ; } // How this would be used public static void main(String[] args) { List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification"); List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965"); } } </code></pre> <p>The Generated Query for the SO link would look like:</p> <pre class="lang-sql prettyprint-override"><code>SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification' </code></pre> <p>Plenty of room for optimization, but I imagine it to be much faster than checking the string for every possible keyword.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload