Penguin
Note: You are viewing an old revision of this page. View the current version.

A way of filtering based on statistics, for every document (email) that arrives, you look at each word in that document and see the probability that that word appears in previous SPAM or HAM1? documents (emails). You then use a Naive Bayesian calculation to figure out the probability that it's SPAM or HAM. If it's SPAM you put it into the SPAM folder.

It's called "Naive Bayesian" because it assumes that events (Words) are independant, when they are obviously not. However, it works remarkably well, and attempts to make it "smarter" tend to end up with the error rate getting higher and higher. It's simple, fast, effective, wrong, and actually works. Welcome to the glorious world of MachineLearning?.

1?: Ham, obviously, is not spam.