nlp - Kneser-Ney smoothing of trigrams using Python NLTK -

i'm trying smooth set of n-gram probabilities kneser-ney smoothing using python nltk. unfortunately, whole documentation rather sparse.

what i'm trying this: parse text list of tri-gram tuples. list create freqdist , use freqdist calculate kn-smoothed distribution.

i'm pretty sure though, result totally wrong. when sum individual probabilities way beyond 1. take code example:

import nltk  ngrams = nltk.trigrams("what piece of work man! how noble in reason! how infinite in faculty! in \ form , moving how express , admirable! in action how angel! in apprehension how god! \ beauty of world, paragon of animals!")  freq_dist = nltk.freqdist(ngrams) kneser_ney = nltk.kneserneyprobdist(freq_dist) prob_sum = 0 in kneser_ney.samples():     prob_sum += kneser_ney.prob(i) print(prob_sum)

the output "41.51696428571428". depending on corpus size, value grows infinitely large. makes whatever prob() returns probability distribution.

looking @ nltk code implementation questionable. maybe don't understand how code supposed used. in case, give me hint please? in other case: know working python implementation? don't want implement myself.

the kneser-ney (also have @ goodman , chen great survey on different smoothing techniques) quite complicated smoothing few package aware of got right. not aware of python implementation, can try srilm if need probabilities, etc.

there chance sample has words didn't occur in training data (aka out-of-vocabulary (oov) words), if not handled can mess probabilities get. perhaps can cause getting outrageously large , invalid prob?

Search This Blog

Ben

nlp - Kneser-Ney smoothing of trigrams using Python NLTK -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -