Bayesian Filtering - Waikato Linux Users Group

Edit PageHistory Diff Info LikePages

A statistical filtering method that assigns probabilities to each unit of information that appears in a document and uses the total of probabilities to decide which category the document belongs to. It is commonly used to distinguish between Spam and Ham in Email, where each unit of information is a word and the probabilities are usually assigned according to a Naive Bayesian calculation. It could, however, be used to sort messages into any number of categories, and it can be applied with any corpus of documents which are to be categorized, not just Email.

Naive Bayesian assumes that events are independant, ie words appearing in a document are unrelated to each other. Obviously, they are not, but disregarding that information still allows remarkably accurate judgements. Attempts to make it "smarter" in fact tend to reduce accuracy. Naive Bayesian is simple, fast, wrong, effective, and accurate. Welcome to the glorious world of machine learning.

CategoryAntiSpam

No page links to BayesianFiltering.

Last edited on Wednesday, July 6, 2005 12:22:02 am by AristotlePagaltzis

Edit PageHistory Diff Info LikePages

Upcoming meetings

Past meetings