Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 CuuDuongThanCong.com https://fb.com/tailieudientucntt This lecture Zipf’s law Dealing with unseen words/n-grams Add-one smoothing Linear smoothing Good-Turing smoothing Absolute discounting Kneser-Ney smoothing CS6501 Natural Language Processing 2 CuuDuongThanCong.com https://fb.com/tailieudientucntt Recap: Bigram language model <S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S> Let P(<S>) = 1 P( I | <S>) = 2 / 3 P(am | I) = 1 P( Sam | am) = 1/3 P( </S> | Sam) = 1/2 P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2 CS6501 Natural Language Processing 3 CuuDuongThanCong.com https://fb.com/tailieudientucntt More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day CS6501 Natural Language Processing 4 CuuDuongThanCong.com https://fb.com/tailieudientucntt Raw bigram counts Out of 9222 sentences CS6501 Natural Language Processing 5 CuuDuongThanCong.com https://fb.com/tailieudientucntt Raw bigram probabilities Normalize by unigrams: Result: CS6501 Natural Language Processing 6 CuuDuongThanCong.com https://fb.com/tailieudientucntt Zeros Training set: Test set … denied the allegations … denied the offer … denied the reports … denied the loan … denied the claims … denied the request P(“offer” | denied the) = 0 CS6501 Natural Language Processing 7 CuuDuongThanCong.com https://fb.com/tailieudientucntt Smoothing This dark art is why NLP is taught in the engineering school. There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.
Credit: the following slides are adapted from Jason Eisner’s NLP course CS6501 Natural Language Processing 8 CuuDuongThanCong.com https://fb.com/tailieudientucntt What is smoothing? 20 200 2000 2000000 CS6501 Natural Language Processing 9 CuuDuongThanCong.com https://fb.com/tailieudientucntt ML 101: bias variance tradeoff Different samples of size 20 vary considerably though on average, they give the correct bell curve! 20 20 20 20 CS6501 Natural Language Processing 10 CuuDuongThanCong.com https://fb.com/tailieudientucntt Overfitting CS6501 Natural Language Processing 11 CuuDuongThanCong.com https://fb.com/tailieudientucntt The perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t We need to train robust models that generalize! One kind of generalization: Zeros! Things that don’t ever occur in the training set But occur in the test set CS6501 Natural Language Processing 12 CuuDuongThanCong.com https://fb.com/tailieudientucntt The intuition of smoothing When we have sparse statistics: P(w | denied the) 3 allegations 2 reports allegations 1 claims outcome reports attack 1 request … request claims man 7 total Steal probability mass to generalize better P(w | denied the) 2.5 allegations allegations allegations outcome 1.5 claims reports … man request claims 0.5 request 2 other Credit: Dan Klein 7 total CS6501 Natural Language Processing 13 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-one estimation (Laplace smoothing) Pretend we saw each word one more time than we did Just add one to all the counts! MLE estimate: c(wi-1, wi ) PMLE (wi | wi-1 ) = c(wi-1 ) Add-1 estimate: c(wi-1, wi ) +1 PAdd-1 (wi | wi-1 ) = c(wi-1 ) +V CS6501 Natural Language Processing 14 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-One Smoothing xya 100 100/300 101 101/326 xyb 0 0/300 1 1/326 xyc 0 0/300 1 1/326 xyd 200 200/300 201 201/326 xye 0 0/300 1 1/326 … xyz 0 0/300 1 1/326 Total xy 300 300/300 326 326/326 CS6501 Natural Language Processing 15 CuuDuongThanCong.com https://fb.com/tailieudientucntt Berkeley Restaurant Corpus: Laplace smoothed bigram counts CuuDuongThanCong.com https://fb.com/tailieudientucntt Laplace-smoothed bigrams V=1446 in the Berkeley Restaurant Project corpus CuuDuongThanCong.com https://fb.com/tailieudientucntt Reconstituted counts CuuDuongThanCong.com https://fb.com/tailieudientucntt Compare with raw bigram counts CuuDuongThanCong.com https://fb.com/tailieudientucntt Problem with Add-One Smoothing We’ve been considering just 26 letter types … xya 1 1/3 2 2/29 xyb 0 0/3 1 1/29 xyc 0 0/3 1 1/29 xyd 2 2/3 3 3/29 xye 0 0/3 1 1/29 … xyz 0 0/3 1 1/29 Total xy 3 3/3 29 29/29 CS6501 Natural Language Processing 20 CuuDuongThanCong.com https://fb.com/tailieudientucntt Problem with Add-One Smoothing Suppose we’re considering 20000 word types see the abacus 1 1/3 2 2/20003 see the abbot 0 0/3 1 1/20003 see the abduct 0 0/3 1 1/20003 see the above 2 2/3 3 3/20003 see the Abram 0 0/3 1 1/20003 … see the zygote 0 0/3 1 1/20003 Total 3 3/3 20003 20003/20003 CS6501 Natural Language Processing 21 CuuDuongThanCong.com https://fb.com/tailieudientucntt Problem with Add-One Smoothing Suppose we’re considering 20000 word types see the abacus 1 1/3 2 2/20003 event” = event0 never happened see the abbot “Novel 0/3 1 1/20003 in training data. see the19998 Here: 0 abduct novel events, 0/3 with 1 1/20003 total estimated probability see the above 19998/20003. 2 2/3 3 3/20003 Add-one smoothing thinks we are extremely likely to see see the Abram 0 0/3 1 1/20003 novel events, rather than words we’ve seen. … see the zygote 0 0/3 1 1/20003 Total 3 3/3 20003 20003/20003 CS6501 Natural Language Processing 22 600.465 - Intro to NLP - J.com https://fb.com/tailieudientucntt 22 Infinite Dictionary? In fact, aren’t there infinitely many possible word types? see the aaaaa 1 1/3 2 2/(∞+3) see the aaaab 0 0/3 1 1/(∞+3) see the aaaac 0 0/3 1 1/(∞+3) see the aaaad 2 2/3 3 3/(∞+3) see the aaaae 0 0/3 1 1/(∞+3) … see the zzzzz 0 0/3 1 1/(∞+3) Total 3 3/3 (∞+3) (∞+3)/(∞+3) CS6501 Natural Language Processing 23 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-Lambda Smoothing A large dictionary makes novel events too probable.
To fix: Instead of adding 1 to all counts, add = 0.01? This gives much less probability to novel events. But how to pick best value for ? That is, how much should we smooth? CS6501 Natural Language Processing 24 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-0.001 Smoothing Doesn’t smooth much (estimated distribution has high variance) xya 1 1/3 1.026 1 CS6501 Natural Language Processing 25 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-1000 Smoothing Smooths too much (estimated distribution has high bias) xya 1 1/3 1001 1/26 xyb 0 0/3 1000 1/26 xyc 0 0/3 1000 1/26 xyd 2 2/3 1002 1/26 xye 0 0/3 1000 1/26 … xyz 0 0/3 1000 1/26 Total xy 3 3/3 26003 1 CS6501 Natural Language Processing 26 CuuDuongThanCong.com https://fb.com/tailieudientucntt Add-Lambda Smoothing A large dictionary makes novel events too probable. To fix: Instead of adding 1 to all counts, add = 0.01? This gives much less probability to novel events. But how to pick best value for ? That is, how much should we smooth? E., how much probability to “set aside” for novel events? Depends on how likely novel events really are! Which may depend on the type of text, size of training corpus, … Can we figure it out from the data? We’ll look at a few methods for deciding how much to smooth.
CS6501 Natural Language Processing 27 CuuDuongThanCong.com https://fb.com/tailieudientucntt Setting Smoothing Parameters How to pick best value for ? (in add- smoothing) Try many values & report the one that gets best results? Training Test How to measure whether a particular gets good results? Is it fair to measure that on test data (for setting )? Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. Rule: Test data cannot influence system development. No peeking! Use it only to evaluate the final system(s).
Report all results on it. CS6501 Natural Language Processing 28 CuuDuongThanCong.com https://fb.com/tailieudientucntt Setting Smoothing Parameters How to pick best value for ? (in add- smoothing) Try many values & report the one that gets best results? Feynman’s Advice: “The first principle is that you Test Training must not How to measure fool ayourself, whether particular and gets good results? you are the easiest person Is it fair to measure that on test data (for setting )? to fool.” Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. Rule: Test data cannot influence system development.
No peeking! Use it only to evaluate the final system(s). Report all results on it. CS6501 Natural Language Processing 29 CuuDuongThanCong.com https://fb.com/tailieudientucntt Setting Smoothing Parameters How to pick best value for ? Try many values & report the one that gets best results? Training Test Dev. Training … and Pick that … when we collect counts Now use that report gets best from this 80% and smooth to get results of results on them using add- smoothing.
smoothed that final this 20% … counts from model on all 100% … test data. CS6501 Natural Language Processing 30 600.465 - Intro to NLP - J.com https://fb.com/tailieudientucntt Large or small Dev set? Here we held out 20% of our training set (yellow) for development. Would like to use > 20% yellow: 20% not enough to reliably assess Would like to use > 80% blue: Best for smoothing 80% best for smoothing 100% CS6501 Natural Language Processing 31 CuuDuongThanCong.com https://fb.com/tailieudientucntt Cross-Validation Try 5 training/dev splits as below Pick that gets best average performance Dev. Tests on all 100% as yellow, so we can more reliably assess Still picks a that’s good at smoothing the 80% size, not 100%.
But now we can grow that 80% without trouble CS6501 Natural Language Processing 32 600.465 - Intro to NLP - J.com https://fb.com/tailieudientucntt 32 N-fold Cross-Validation (“Leave One Out”) Test … Test each sentence with smoothed model from other N-1 sentences Still tests on all 100% as yellow, so we can reliably assess Trains on nearly 100% blue data ((N-1)/N) to measure whether is good for smoothing that CS6501 Natural Language Processing 33 600.465 - Intro to NLP - J.com https://fb.com/tailieudientucntt 33 N-fold Cross-Validation (“Leave One Out”) Test … Surprisingly fast: why? Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts CS6501 Natural Language Processing 34 600.465 - Intro to NLP - J.com https://fb.com/tailieudientucntt 34 More Ideas for Smoothing Remember, we’re trying to decide how much to smooth., how much probability to “set aside” for novel events? Depends on how likely novel events really are Which may depend on the type of text, size of training corpus, … Can we figure this out from the data? CS6501 Natural Language Processing 35 CuuDuongThanCong.com https://fb.com/tailieudientucntt How likely are novel events? 20000 types 300 tokens 300 tokens a 150 0 both 18 0 candy 0 1 donuts 0 2 every 50 versus 0 ??? 0 0 grapes 0 0/300 1 0/300 his 38 0 ice cream 0 7 … which zero would you expect is really rare? CS6501 Natural Language Processing 36 CuuDuongThanCong.com https://fb.com/tailieudientucntt How likely are novel events? 20000 types 300 tokens 300 tokens a 150 0 both 18 0 candy 0 1 donuts 0 2 every 50 versus 0 farina 0 0 grapes 0 1 his 38 0 ice cream 0 7 determiners:…a closed class CS6501 Natural Language Processing 37 CuuDuongThanCong.com https://fb.com/tailieudientucntt How likely are novel events? 20000 types 300 tokens 300 tokens a 150 0 both 18 0 candy 0 1 donuts 0 2 every 50 versus 0 farina 0 0 grapes 0 1 his 38 0 ice cream 0 7 … (food) nouns: an open class CS6501 Natural Language Processing 38 CuuDuongThanCong.com https://fb.com/tailieudientucntt Zipfs’ law the r-th most common word 𝑤𝑟 has P(𝑤𝑟 ) ∝ 1/r A few words are very frequent http://wugology.com/zipfs-law/ CS6501 Natural Language Processing 39 CuuDuongThanCong.