nlp - Kneser-Ney smoothing of trigrams using Python NLTK -
i'm trying smooth set of n-gram probabilities kneser-ney smoothing using python nltk. unfortunately, whole documentation rather sparse.
what i'm trying this: parse text list of tri-gram tuples. list create freqdist , use freqdist calculate kn-smoothed distribution.
i'm pretty sure though, result totally wrong. when sum individual probabilities way beyond 1. take code example:
import nltk ngrams = nltk.trigrams("what piece of work man! how noble in reason! how infinite in faculty! in \ form , moving how express , admirable! in action how angel! in apprehension how god! \ beauty of world, paragon of animals!") freq_dist = nltk.freqdist(ngrams) kneser_ney = nltk.kneserneyprobdist(freq_dist) prob_sum = 0 in kneser_ney.samples(): prob_sum += kneser_ney.prob(i) print(prob_sum)
the output "41.51696428571428". depending on corpus size, value grows infinitely large. makes whatever prob() returns probability distribution.
looking @ nltk code implementation questionable. maybe don't understand how code supposed used. in case, give me hint please? in other case: know working python implementation? don't want implement myself.
the kneser-ney (also have @ goodman , chen great survey on different smoothing techniques) quite complicated smoothing few package aware of got right. not aware of python implementation, can try srilm if need probabilities, etc.
- there chance sample has words didn't occur in training data (aka out-of-vocabulary (oov) words), if not handled can mess probabilities get. perhaps can cause getting outrageously large , invalid prob?
Comments
Post a Comment