Note that there are some explanatory texts on larger screens.

plurals
  1. POcompression algorithm for street names wanted
    text
    copied!<p>I have a list of millions of street names and want to compress them using a compression algorithm. I am not sure which algorithm would fit best. Most street names have common substrings in them, such as for example "street", "way", ...</p> <p>The set of all street names is fixed and won't change dynamically. </p> <p>At first I was thinking of huffman coding, but that only codes single letters, so it won't give great performance. So I thought of generating a trie and counting the most common substrings. Then I could have some sort of code to traverse this trie in order to get the word back, and compress these codes using something like huffman coding. I'm not sure if this is not making it more complicated than it needs to be.</p> <p>Does anyone know a compression technique that makes sense in my case?</p> <p><strong>EDIT 1</strong></p> <p>My usecase is thus: I have a phone device with limited storage size. This phone needs to hold all the street names of all streets in a particular country. Now every street object has some values and amongst them the name of the street as a string. That takes up most space and I would like to minimize it. Since the names are quite similar, i.e. most ending on "...street" or "...way", I thought it may be worth implementing a specific compression algorithm geared towards this scenario.</p> <p>A simple gzip has brought a compression of about 50%. I think it should be possible to get more out of it.</p> <p><strong>EDIT 2</strong></p> <p>The solution of Ebbe M. Pedersen is actually giving very good performance results. Here is some code (written in C#):</p> <pre><code> private IndexedItem[] _items; public void CompressStrings(string[] strings) { Array.Sort(strings); _items = new IndexedItem[strings.Length]; string lastString = string.Empty; for (int i = 0; i &lt; strings.Length; i++) { byte j = 0; while (lastString.Length &gt; j &amp;&amp; lastString[j] == strings[i][j]) { j++; } _items[i] = new IndexedItem() { Prefix = j, Suffix = strings[i].Substring(j) }; lastString = strings[i]; } } private struct IndexedItem { public byte Prefix; public string Suffix; } </code></pre> <p>After compression I am also sending it through a DeflateStream, which results in a total compression of about <strong>30%</strong></p> <p>Thanks very much for the answers</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload