Note that there are some explanatory texts on larger screens.

plurals
  1. POText classification using SVM works with unigrams but not higher order n-grams
    primarykey
    data
    text
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COMy first thought is that perhaps you are getting too many hash conflicts? What hash algorithm are you using? I would sanity check the hash and see how often you get two different n-grams hashing to the same value.
      singulars
    2. COThe function I'm using is relatively trivial: `Math.abs(text.hashCode() % _hash_size)` where `_hash_size` is the bound on the feature set size (ie. 4999). I haven't actually measured collision rate, but I assume the collisions are pretty low before I do the mod operation. As for quantifying the impact of the mod operation, I tried increasing the feature set size (ie. to ~100k) and this didn't any difference.
      singulars
    3. COSure, by modding the result you are causing collisions but the collisions are only partially related to the feature size, without the mod if you have over 50,000 different n-grams with java's hashCode you will start getting collisions. I'm assuming if you have several texts with over 50,000 words, you have more n-grams than that. Hash functions are not all created equal, I would use a cryptographic hash function like MD5 which has a much higher collision rate. I also wouldn't cut the feature set down by modding it like that - use a feature selection algorithm.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload