Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Given your problem description, the characteristics of yoru data, and your ML background and personal preferences, i would recommend <a href="http://orange.biolab.si" rel="noreferrer">Orange</a>.</p> <p>Orange is a mature, free and open source project with a large selection of ML algorithms and excellent documentation and training materials. Most users probably use the GUI supplied with Orange, but the framework is scriptable with Python.</p> <p>Using this framework therefore, will of course enable you to quickly experiment with a variety of classifiers because <em>(i)</em> they are all in one place; and <em>(ii)</em> each is accessed a common configuration syntax GUI. All of the ML techniques within the Orange framework can be run in "demo" mode one or more sample data sets supplied with the Orange install. The documentation supplied in the Orange Install is excellent. In addition, the Home Page includes links to numerous tutorials that cover probably every ML technique included in the framework.</p> <p>Given your problem, perhaps begin with a <strong><em>Decision Tree</em></strong> algorithm (either <a href="http://orange.biolab.si/doc/catalog/Classify/C4.5.htm" rel="noreferrer">C4.5</a> or <em>ID3</em> implementation). A fairly recent edition of Dr. Dobbs Journal (online) includes an excellent article on using decision trees; the use case is web server data (from the server access log). </p> <p>Orange has a <a href="http://orange.biolab.si/doc/catalog/Classify/C4.5.htm" rel="noreferrer">C4.5 implementation</a>, available from the GUI (as a "widget"). If that's too easy, about 100 lines is all it takes to code one in python. <a href="http://www.oreillynet.com/mac/blog/2007/06/an_addendum_to_building_decisi.html" rel="noreferrer">Here</a>'s the source for a working implementation in that language</p> <p>I recommend starting with a Decision Tree for several reasons. </p> <ol> <li><p>If it works on your data, you will not only have a trained classifier, but you will also have a <strong>visual representation of the entire classification schema</strong> (represented as a binary tree). Decision Trees are (probably) unique among ML techniques in this respect.</p></li> <li><p>The <strong>characteristics of your data</strong> are aligned with the optimal performance scenario of C4.5; the data can be either categorical or continuous variables (though this technique performs better with if more features (columns/fields) discrete rather than continuous, which seems to describe your data); also Decision Tree algorithms can accept, without any pre-processing, incomplete data points</p></li> <li><p><strong>Simple data pre-processing.</strong> The data fed to a decision tree algorithm does not require as much data pre-processing as most other ML techniques; pre-processing is often (usually?) the most time-consuming task in the entire ML workflow. It's also sparsely documented, so it's probably also the most likely source of error.</p></li> <li><p><strong>You can deduce the (relative) weight of each variable from each node's distance from the root--in other words, from a quick visual inspection of the trained classifier</strong>. Recall that the trained classifier is a just a binary tree (and is often <a href="http://www.oreillynet.com/mac/blog/2005/10/graphviz_why_draw_when_you_can.html" rel="noreferrer">rendered</a> this way) in which the nodes correspond to one value of one feature (variable, or column in your data set); the two edges joined to that node of course represent the data points split into two groups based on each point's value for that feature (e.g., if the feature is the categorical variable "Publication Date in HTML Page Head?", then through the left edge will flow all data points in which the publication date is not within the opening and closing head tags, and the right node gets the other group). What is the significance of this? Since a node just represents a state or value for a particular variable, that variable's importance (or weight) in classifying the data can be deduced from its position in the tree--i.e., the closer it is to the root node, the more important it is.</p></li> </ol> <p><br/>From your Question, it seems you have two tasks to complete before you can feed your training data to a ML classifier.</p> <p><strong>I. identify plausible class labels</strong></p> <p>What you want to predict is a date. Unless your resolution requirements are unusually strict (e.g., resolved to a single date) i would build a classification model (which returns a class label given a data point) rather than a regression model (returns a single continuous value). </p> <p>Given that your response variable is a date, a straightforward approach is to set the earliest date to the baseline, 0, then represent all other dates as an integer value that represents the distance from that baseline. Next, discretize all dates into a small number of <strong><em>ranges</em></strong>. One very simple technique for doing this is to calculate the five summary descriptive statistics for your response variable (min, 1st quartile, mean, 3rd quartile, and max). From these five statistics, you get four sensibly chosen date ranges (though probably not of equal span or of equal membership size. </p> <p>These four ranges of date values then represent your class labels--so for instance, classI might be all data points (web pages, i suppose) whose response variable (publication date) is 0 to 10 days after 0; classII is 11 days after 0 to 25 days after 0, etc.</p> <p><em><strong>[Note: added the code below in light of the OP's comment below this answer, requesting clarification.]</em></strong></p> <pre><code># suppose these are publication dates &gt;&gt;&gt; pd0 = "04-09-2011" &gt;&gt;&gt; pd1 = "17-05-2010" # convert them to python datetime instances, e.g., &gt;&gt;&gt; pd0 = datetime.strptime(pd0, "%d-%m-%Y") # gather them in a python list and then call sort on that list: &gt;&gt;&gt; pd_all = [pd0, pd1, pd2, pd3, ...] &gt;&gt;&gt; pd_all.sort() # 'sort' will perform an in-place sort on the list of datetime objects, # such that the eariest date is at index 0, etc. # now the first item in that list is of course the earliest publication date &gt;&gt;&gt; pd_all[0] datetime.datetime(2010, 5, 17, 0, 0) # express all dates except the earliest one as the absolute differenece in days # from that earliest date &gt;&gt;&gt; td0 = pd_all[1] - pd_all[0] # t0 is a timedelta object &gt;&gt;&gt; td0 datetime.timedelta(475) # convert the time deltas to integers: &gt;&gt;&gt; fnx = lambda v : int(str(v).split()[0]) &gt;&gt;&gt; time_deltas = [td0,....] # d is jsut a python list of integers representing number of days from a common baseline date &gt;&gt;&gt; d = map(fnx, time_deltas) </code></pre> <p><strong>II. convert your raw data to an "ML-useable" form.</strong> </p> <p>For a C4.5 classifier, this task is far simpler and requires fewer steps than for probably every other ML algorithm. What's preferred here is to discretize to a relatively small number of values, as many of your parameters as possible--e.g., if one of your parameters/variables is "distance of the publication date string from the closing body tag", then i would suggest discretizing those values into ranges, as marketing surveys often ask participants to report their age in one of a specified set of spans (18 - 35; 36 - 50, etc.) rather than as a single integer (41).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload