StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Seems nobody know . I am answering here as other people face the same problem , i got where to look for now , have not fully implement it yet.</p> <p>it lies deep inside CountVectorizer from sklearn.feature_extraction.text :</p> <pre><code>def transform(self, raw_documents): """Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided in the constructor. Parameters ---------- raw_documents: iterable an iterable which yields either str, unicode or file objects Returns ------- vectors: sparse matrix, [n_samples, n_features] """ if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0: raise ValueError("Vocabulary wasn't fitted or is empty!") # raw_documents can be an iterable so we don't know its size in # advance # XXX @larsmans tried to parallelize the following loop with joblib. # The result was some 20% slower than the serial version. analyze = self.build_analyzer() term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) return self._term_count_dicts_to_matrix(term_counts_per_doc) </code></pre> <p>I have added self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) and it make it able to call from vectorizer outside like this :</p> <pre><code>load_files = recursive_load_files trainer_path = os.path.realpath(trainer_path) tester_path = os.path.realpath(tester_path) data_train = load_files(trainer_path, load_content = True, shuffle = False) data_test = load_files(tester_path, load_content = True, shuffle = False) print 'data loaded' categories = None # for case categories == None print "%d documents (training set)" % len(data_train.data) print "%d documents (testing set)" % len(data_test.data) #print "%d categories" % len(categories) print # split a training set and a test set print "Extracting features from the training dataset using a sparse vectorizer" t0 = time() vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7, stop_words='english',charset_error="ignore") X_train = vectorizer.fit_transform(data_train.data) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_train.shape print print "Extracting features from the test dataset using the same vectorizer" t0 = time() X_test = vectorizer.transform(data_test.data) print "Test printing terms per document" for counter in vectorizer.test_term_counts_per_doc: print counter </code></pre> <p>here is my fork , i also submitted pull requests:</p> <p><a href="https://github.com/v3ss0n/scikit-learn" rel="nofollow">https://github.com/v3ss0n/scikit-learn</a></p> <p>Please suggest me if there any better way to do.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload