Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.</p> <p>There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <code>&lt;!DOCTYPE ...&gt;</code> declaration, check for the <code>&lt;?xml ...?&gt;</code> directive, and check to see if the file contains a root <code>&lt;html&gt;</code> tag. Of course, these should all be case-insensitive checks.</p> <p>You can also try to identify the type of file based on its <a href="http://en.wikipedia.org/wiki/Internet_media_type" rel="nofollow">MIME type</a> (for example, <em>text/html</em> or <em>text/xml</em>). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.</p> <pre><code>package com.example.mimetype; import java.io.BufferedInputStream; import java.io.IOException; import java.io.InputStream; import java.net.FileNameMap; import java.net.MalformedURLException; import java.net.URL; import java.net.URLConnection; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import net.sf.jmimemagic.Magic; import net.sf.jmimemagic.MagicException; import net.sf.jmimemagic.MagicMatchNotFoundException; import net.sf.jmimemagic.MagicParseException; public class MimeUtils { // After calling this method, you can retrieve a list of URLs for each mimetype. public static Map&lt;String, List&lt;String&gt;&gt; sortLinksByMimeType(List&lt;String&gt; links) { Map&lt;String, List&lt;String&gt;&gt; mapMimeTypesToLinks = new HashMap&lt;String, List&lt;String&gt;&gt;(); for (String url : links) { try { String mimetype = getMimeType(url); System.out.println(url + " has mimetype " + mimetype); // If this mimetype hasn't already been initialized, initialize it. if (! mapMimeTypesToLinks.containsKey(mimetype)) { mapMimeTypesToLinks.put(mimetype, new ArrayList&lt;String&gt;()); } List&lt;String&gt; lst = mapMimeTypesToLinks.get(mimetype); lst.add(url); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } return mapMimeTypesToLinks; } public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException { // first attempt at determining MIME type--returned null for all URLs that I tried // FileNameMap filenameMap = URLConnection.getFileNameMap(); // return filenameMap.getContentTypeFor(url); // second attempt at determining MIME type--worked better, but still returned null for many URLs // URLConnection c = new URL(url).openConnection(); // InputStream in = c.getInputStream(); // String mimetype = URLConnection.guessContentTypeFromStream(in); // in.close(); // return mimetype; URLConnection c = new URL(url).openConnection(); BufferedInputStream in = new BufferedInputStream(c.getInputStream()); byte[] content = new byte[100]; in.read(content); in.close(); return Magic.getMagicMatch(content, false).getMimeType(); } public static void main(String[] args) { List&lt;String&gt; links = new ArrayList&lt;String&gt;(); links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java"); links.add("http://stackoverflow.com"); links.add("http://stackoverflow.com/feeds"); links.add("http://amazon.com"); links.add("http://google.com"); sortLinksByMimeType(links); } } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload