Note that there are some explanatory texts on larger screens.

plurals
  1. POJava patterns: engineering data flows for data mining tasks
    primarykey
    data
    text
    <p>I am a data miner, an as such, I spend a lot of time transforming raw data in various ways to enable consumption by predictive models. For instance, read a file in a certain format, tokenize, gram-ify, and project into some numeric representation. Over the years I have developed a rich set of methods to do most of the data processing tasks i can think of, but I dont have a nice way of configuring these components in all but the most rudimentary ways- typically what i do is a lot of calls to specific methods in the source code that is dependent on a specific task. I'm now trying to refactor my libraries into something that's much nicer, but i'm not too sure what this is. </p> <p>My current thinking is, have a list of function objects, each defining some method (say, operate( ... ) ), that are called in sequence, each either processing the contents of some data flow by reference, or consuming the output of the previous function object. This is close to what I want, but because the type of data being input and output will vary, using generics becomes very difficult. To use my above example, i'd like to pass something through this "pipeline" that processes data like:</p> <pre><code>input: string filename filename -&gt; collection of strings collection&lt;string&gt; -&gt; (stemming, stopword removal) -&gt; collection of strings collection&lt;string&gt; -&gt; (tokenize) -&gt; collection of string arrays collection&lt;string[]&gt; -&gt; (gram-ify) -&gt; augment individual token strings with n-grams -&gt; collection of string arrays collection&lt;string[]&gt; -&gt; projection into numeric vectors -&gt; collection&lt; double[] &gt; </code></pre> <p>this is a simple example, but imagine i have 100s of such components, and i'd like to add them to some data flow. this meets my easy to configure requirement- i could easily built a pipeline factory that reads some yaml file and builds this out. however, the design patterns of the components has been stumping me for a while? what do the appropriate interfaces look like? it seems like the only easy way to do things here is have objects get passed, essentially doing away with objects (or have some context object get passed that has a Object as a member variable), then checking for compatibility at input, throwing runtime exceptions. both options seem equally bad. however, i feel like i'm close to a really nice and flexible system here. can you guys help me push this over the fence? </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload