Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The <code>globStatus</code> method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a <code>PathFilter</code>.</p> <p>Regarding the glob pattern, the following are supported:</p> <pre><code>Glob | Matches ------------------------------------------------------------------------------------------------------------------- * | Matches zero or more characters ? | Matches a single character [ab] | Matches a single character in the set {a, b} [^ab] | Matches a single character not in the set {a, b} [a-b] | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b [^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b {a,b} | Matches either expression a or b \c | Matches character c when it is a metacharacter </code></pre> <p><code>PathFilter</code> is simply an interface like this:</p> <pre><code>public interface PathFilter { boolean accept(Path path); } </code></pre> <p>So you can implement this interface and implement the <code>accept</code> method where you can put your logic to filter files.</p> <p>An example taken from <a href="http://rads.stackoverflow.com/amzn/click/1449389732" rel="noreferrer">Tom White's excellent book</a> which allows you to define a <code>PathFilter</code> to filter files that match a certain regular expression:</p> <pre><code>public class RegexExcludePathFilter implements PathFilter { private final String regex; public RegexExcludePathFilter(String regex) { this.regex = regex; } public boolean accept(Path path) { return !path.toString().matches(regex); } } </code></pre> <p>You can directly filter your input with a <code>PathFilter</code> implementation by calling <code>FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class)</code> when initializing your job.</p> <p><strong>EDIT</strong>: Since you have to pass the class in <code>setInputPathFilter</code>, you can't directly pass arguments, but you should be able to do something similar by playing with the <code>Configuration</code>. If you make your <code>RegexExcludePathFilter</code> also extend from <code>Configured</code>, you can get back a <code>Configuration</code> object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the <code>accept</code>.</p> <p>For example if you initialize like this:</p> <pre><code>conf.set("date", "2013-01-15"); </code></pre> <p>Then you can define your filter like this:</p> <pre><code>public class RegexIncludePathFilter extends Configured implements PathFilter { private String date; private FileSystem fs; public boolean accept(Path path) { try { if (fs.isDirectory(path)) { return true; } } catch (IOException e) {} return path.toString().endsWith(date); } public void setConf(Configuration conf) { if (null != conf) { this.date = conf.get("date"); try { this.fs = FileSystem.get(conf); } catch (IOException e) {} } } } </code></pre> <p><strong>EDIT 2</strong>: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload