Saturday, April 15, 2017

Huffman encoding in statistics

Consider this paper reported on the No Hesitations blog. he talks about a sampling system called bagging:
Given a standard training set D of size n, bagging generates m new training sets , each of size n′, by sampling from D uniformly and with replacement

So we have a sequence of deposit events called the training set.  We treat that set like a stationary sample.  In the basket brigade model, stationary means never changing basket sizes.  So, repeatedly pick a sequence of deposits, say 100, from the original list of 100.  With replacement means that some events can be selected duplicate times.   In fact, about 63 % will be duplicates in any of the pseudo sequences.

You have a model, you can send the pseudo sequence into the model, and he red/green indicator will tip away from amber. Now the duplicates will be more significant in the pseudo sequence, and we expect the light to be more influenced by the duplicates. So we start running pseudo sequences through the box and watch the light.  We find that some times the light barely moves, the duplicates were not all that important to the  model.  However, some pseudo sequences will really bounce the light one way or the other.

What is going on is that we are grabbing sets of possible duplicates at random trying to find the set with the most impact, and thise duplicates are the typical set, the set that makes the top of the bell shaped curve, and we care less about the outliers.

But it is typical with respect to what the model expects. This process can be better done if the original data can be encoded by significance, turned into a Huffman tree with some specified precision.   Then, using the Huffman tree, generate random events by feeding the tree a uniform random number. Increase the orecision of the encoder until the precision of the red/green light matches, where precision of the red/green are the accuracy of the notches in color change.

The assumption is that the model is a finite element integrator differentiator.  The random sequencer is adjusting the finite element sizes until the thing is tuned.

This is pre-discovery of duplicates, the Huffman encoder or compresser is really rounding off, but doing so by the significance of the data elements.  The probability space is allocated so data redundancy is equally erased as precision increases. All it requires is that the measurements be equally accurate for all the original data.

Let's call this Ito's equivalent of an impulse function, or an iterative inversion of the model.

No comments: