Tuesday, December 27, 2016

Data size vs significance

One of my favoritr, Dave Giles brings up this reference,


Abstract, abstracted:


The Internet has provided IS researchers with the opportunity to conduct studies with extremely large samples, frequently well over 10,000 observations. There are many advantages to large samples, but researchers using statistical inference must be aware of the p-value problem associated with them. In very large samples, p-values go quickly to zero, and solely relying on pvalues can lead the researcher to claim support for results of no practical significance. 

The concept of precision and significance helps.  

If I know my data is generated by a process that is only 4 bitt significance, then my I don't need more than a few hundred samples.  I can guestimate, if I have 4 digits representing re-arrangments then I can do factorial, and get 24 four re-arrangements, and maybe do that a few times, getting a hundred samples neededf.  Just off the top of my head, a dumb back of the envelope.

So, I have twenty camera models, and 300,000 sales on the Internet. What will I find in my samples? Any theory that comprises a five bit process. 

Here is a better way. If you knew the theory was a five to seven bit accuracy theory,then make a Huffman tree, and fill the bins at each node with the data having significance at that node.  The bins will be large, maybe hundreds per bin, keep dropping the outliers at each bin, recomputing the tree, until you get a tree with three members per bin.

At some point in the reduction, there is no insignificance to remove, but the bins are uneven.  The theories inevitably describe how the tree jams, how it deviates from null jamming. Then you have a quantization sufficient to test the theory, your theory is how the process mechanized the quantization.

No comments: