Why are spam detection algorithms so terrible?

It looks like most of them still rely on Naive Bayes applied to individual keywords, to flag messages. They fail to catch 90% of the spam, yet have a terrible "false positive" rate - as high as 5%.

Are there any companies working on customized (e.g. per email account) solutions? Are there any spam detector that

  • use Botnet lists of (blacklisted) IP addresses for filtering as well as white lists,
  • use list of scammy URLs (embedded in an email message) as well as white lists
  • use metrics other than individual keywords or combination of two keywords (e.g. positive / negative keywords) for spam detection, such as return address different from sender address, or return address looks spammy
  • use algorithms that are much more modern than Naive Bayes, such as hidden decision trees?

Views: 144

Reply to This

Replies to This Discussion

Actually, I think they are.

Ok, let's start from the beginning: they were, when spam started being unbearable approaches like Naïve Bayes labeling were very good, then of course the evil side could not stand still and used more elaborate strategies like botnets (disabling ACLs).... and so spam detection became a moving target, a cat and mouse game.

The thing is that there is one thing bad guys can't get around and that thing is the unsolicited content, which reduces the problem entropy quite a bit and so can be tackled. Or at least the guys and gals at the company that starts with a G and that provides email services are doing a very good job at it AFAIAC. 

Then of course the jury is still out on what constitutes actual spam for specific individual, but then I was thinking that the people at big G have me in some specific group of users with alike sense of spam consideration... hence when one of us tags some email as bad, the rest of us benefits quickly.

Reply to Discussion

RSS

Follow us

© 2013   Data Science Central

Badges  |  Report an Issue  |  Terms of Service