Thursday, November 1, 2018

Huffman coded wordlists

What does -iLog(i) imply for a wordlist?

It is the probability the word list matches the text, we use match in a vector sense, not just on one atomic word.  Set the requirements for match, 3% of the text matches the list.  The goal is to adjust your word lists such that -iLog(i) applies then we have the structure, to a specified accuracy, we have the Huffman tree.

We can do this with ensembles of plain text, related. Works fine, but is more inaccurate for any individual selection of plain text.

Extend the -iLog(i) concept to obtain structure, the step and skip. You have maximum distillation, the amount of word list space needed by the individual searcher is minimized. It all boils down to the same thing, congestion on maximum entropy encoding trees, a much better optimization process than  neural nets.  Word lists requires much more training upfront, like Watson.  We have join technology for that.

HTML pages have structure, captured in the text scrape process, and in general,  most formatted text has structure easily found, we just need to write a scraper for each format. However the great deal of data ends up HTML  sooner or later, and we have that done.

Consider the case that word lists are visual feature sets, named with no assumed a priori  relationship. The the training consists of pruning word lists, again, u til you arrive at the optimum -iLog(i), and you have a finite but small guess as to the image..

One can see all of these techniques involve reducing identification as process of creating a balanced Huffman tree, all pruned lists have equal probability, uniform.  Then against the large set of randomly selected text, in the trained designated area of interest, the end user can try word list structures in sequence, equally probable.

No comments: