Penguin
Diff: BayesianFiltering
EditPageHistoryDiffInfoLikePages

Differences between current version and revision by previous author of BayesianFiltering.

Other diffs: Previous Major Revision, Previous Revision, or view the Annotated Edit History

Newer page: version 3 Last edited on Wednesday, July 6, 2005 12:22:02 am by AristotlePagaltzis
Older page: version 1 Last edited on Sunday, August 10, 2003 3:14:49 pm by PerryLorier Revert
@@ -1,6 +1,6 @@
-A way of filtering based on statistics, for every document (email) that arrives, you look at each word in that document and see the probability that that word appears in previous SPAM or HAM [1 ] documents (emails). You then use a Naive Bayesian calculation to figure out the probability that it's SPAM or HAM . If it's SPAM you put it into the SPAM folder
+A statistical filtering method that assigns probabilities to each unit of information that appears in a document and uses the total of probabilities to decide which category the document belongs to. It is commonly used to distinguish between [Spam] and Ham in [Email ], where each unit of information is a word and the probabilities are usually assigned according to a Naive Bayesian calculation. It could, however, be used to sort messages into any number of categories, and it can be applied with any corpus of documents which are to be categorized, not just [Email]
  
-It's called " Naive Bayesian" because it assumes that events (Words) are independant, when they are obviously not. However , it works remarkably well, and attempts to make it "smarter" tend to end up with the error rate getting higher and higher . It's simple, fast, effective , wrong, and actually works . Welcome to the glorious world of MachineLearning
+Naive Bayesian assumes that events are independant, ie words appearing in a document are unrelated to each other. Obviously, they are not, but disregarding that information still allows remarkably accurate judgements. Attempts to make it "smarter" in fact tend to reduce accuracy . Naive Bayesian is simple, fast, wrong, effective , and accurate . Welcome to the glorious world of machine learning
  
-  
-[1]: Ham, obviously, is not spam.  
+----  
+CategoryAntiSpam