Thursday, November 1, 2018

Redneck Inc has competition

Google’s on-device text classification AI achieves 86.7% accuracy

Giggles wants to compete with my join system. I won. My system is simple, easy to convert to self learning.  

I was going to do taxes today, but now I have to read a bunch of technical articles about reducing sample size by 'distillation'. I don 't want Giggles to  whomp me.  My goal is to take their distillation technique and use it to narrow down word lists in classifying plain text.

What I do, let the pros move the work forward, then simplify their research.  Reading the stuff for the last five minutes makes me realize I need joint distributions from many word lists, and distill that down with a word list having the highest entropy (lowest redundancy) from the whole set of lists. Egad!

What is the Hamming distance between two words?


Image result for hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
OK, we need equal length word lists. But we can be a bit flexible if we assume the large word list is more general and weight its distance accordingly.  But what exactly is the distance between two words?   We have a multi-dimensional problem, we can measure parts of speech, measure the root origination of words,. Measuring by prior classification works if the higher classification was trained on very large data sets, all of them of interest.

This is the problem at hand, data reduction across a reliable dimension when many dimensions are available.  The research is about how to do this with little prior knowledge, assume the prior classification is n on -parametric, we do not know the real distribution of any dimension.. Tough problem.  On the server side, join is free to stack word lists against training samples, and selectively refine the match process as we 'stack join's output to join's input.  The distance we really want are is available in the structure of the original text. Plain text can capture some of that rather than just generating a one dimensional list of plain words, some text processing is necessary, just enough to get the basic structure and we can find and narrow the wordlists at each node of our graph.  It is a mess of spaghetti, but doable.   I am still trying to get my brain around the concept, find the simple recursion, make it automatic..

In join we have the Match process, that is where we can subsample. I need a research staff, must go kidnap more Russian and Ukrainian mathematicians.

Think of  it as Huffman compression of sets of word lists.  Start  with the original plain text.  Match it against th4e general dictionary of commons word, larger than five characters.

 One can extract the run time distributions, the order and count at which words match. Do this in overlapping sections of the original text, generating multiple word list out.  The multiple lists out should contain the article structure, and using a Hamming distance between lists, reconstruct the step and skip of the original text.  That structural summary is given to the end user who can use it for more detailed and local searches on the text.

Anyway, that is the plan, me and a bunch of kidnapped mathematicians here at old Redneck Inc. be happy that we have stackable join technology.

No comments: