Differences between version 3 and predecessor to the previous major change of BayesianFiltering.
Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History
Newer page: | version 3 | Last edited on Wednesday, July 6, 2005 12:22:02 am | by AristotlePagaltzis | Revert |
Older page: | version 1 | Last edited on Sunday, August 10, 2003 3:14:49 pm | by PerryLorier | Revert |
@@ -1,6 +1,6 @@
-A way of
filtering based on statistics, for every document (email)
that arrives, you look at
each word in
that document and see
the probability that that word appears
in previous SPAM or HAM
[1
] documents (emails). You then use
a Naive Bayesian calculation to figure out the probability that it's SPAM or HAM
. If it's SPAM you put it
into the SPAM folder
.
+A statistical
filtering method
that assigns probabilities to
each unit of information
that appears in a
document and uses
the total of probabilities to decide which category the document belongs to. It is commonly used to distinguish between [Spam] and Ham
in [Email
], where each unit of information is a word and the probabilities are usually assigned according to
a Naive Bayesian calculation. It could, however, be used to sort messages
into any number of categories, and it can be applied with any corpus of documents which are to be categorized, not just [Email]
.
-It's called "
Naive Bayesian" because it
assumes that events (Words)
are independant, when
they are obviously
not. However
, it works
remarkably well, and attempts
to make it "smarter" tend to end up with the error rate getting higher and higher
. It's
simple, fast, effective
, wrong, and actually works
.
Welcome to the glorious world of MachineLearning
.
+Naive Bayesian assumes that events are independant, ie words appearing in a document are unrelated to each other. Obviously,
they are not, but disregarding that information still allows
remarkably accurate judgements. Attempts
to make it "smarter" in fact
tend to reduce accuracy
. Naive Bayesian is
simple, fast, wrong, effective
, and accurate
. Welcome to the glorious world of machine learning
.
-
-[1]: Ham, obviously, is not spam.
+----
+CategoryAntiSpam