nlp - Kneser-Ney smoothing of trigrams using Python NLTK -


i'm trying smooth set of n-gram probabilities kneser-ney smoothing using python nltk. unfortunately, whole documentation rather sparse.

what i'm trying this: parse text list of tri-gram tuples. list create freqdist , use freqdist calculate kn-smoothed distribution.

i'm pretty sure though, result totally wrong. when sum individual probabilities way beyond 1. take code example:

import nltk  ngrams = nltk.trigrams("what piece of work man! how noble in reason! how infinite in faculty! in \ form , moving how express , admirable! in action how angel! in apprehension how god! \ beauty of world, paragon of animals!")  freq_dist = nltk.freqdist(ngrams) kneser_ney = nltk.kneserneyprobdist(freq_dist) prob_sum = 0 in kneser_ney.samples():     prob_sum += kneser_ney.prob(i) print(prob_sum) 

the output "41.51696428571428". depending on corpus size, value grows infinitely large. makes whatever prob() returns probability distribution.

looking @ nltk code implementation questionable. maybe don't understand how code supposed used. in case, give me hint please? in other case: know working python implementation? don't want implement myself.

the kneser-ney (also have @ goodman , chen great survey on different smoothing techniques) quite complicated smoothing few package aware of got right. not aware of python implementation, can try srilm if need probabilities, etc.

  • there chance sample has words didn't occur in training data (aka out-of-vocabulary (oov) words), if not handled can mess probabilities get. perhaps can cause getting outrageously large , invalid prob?

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -