 Paul Bleicher
|
Very few people can say anything good about spam (other than spammers, that is). I am a passionate spam-hater, who goes to
great lengths to avoid any exposure to the frequently offensive emails known as spam (named, of course, for a classic Monty
Python comedy routine involving Vikings). With that introduction, let me tell you something good about spam. Understanding
spam, and the ways we deal with spam, can help understand how to identify and manage problematic adverse events in large,
untidy databases of spontaneously reported data. If that seems a stretch, let me explain.
What is spam?
Spam is unsolicited commercial email (UCE), sent to thousands or millions of email addresses simultaneously, that typically
advertises a commercial service or product or promotes a political viewpoint. It is a huge problem in the world of business,
for many reasons. The sheer cost of scanning through and deleting spam messages is enormous, often estimated in the hundreds
of thousands of dollars annually for a large company. In addition to productivity issues, spam can create significant security
issues, allowing viruses, "malware," and other dangerous programs into a company. Finally, spam can create a hostile work
environment for employees, creating a liability for companies.
UCE is estimated to account for 40% to 80% of all email received in the United States. The reason why it is such a huge problem
(compared with regular "junk" mail) is that it is free to send email. A commonsense solution for the problem would be to charge
for the numbers of emails sent. Unfortunately, most spam is sent through relays and through the illegal hijacking of innocent
computers; charging for these emails wouldn't affect the spammers or their pocketbooks. Whether an email is spam or "not spam" is sometimes in the eye of the beholder, which certainly complicates company and/or
group antispam strategies. Most everybody would think of advertisements for enlargement of body parts to be spam, while there
might be a difference of opinion about whether an advertisement for a low-rate mortgage is spam to reasonable people. Most
noxious spam emails cannot be traced to a legitimate Web site or company and will likely increase the number of emails if
the recipient attempts to "unsubscribe."
Shortly after the first spam emails began arriving in the early 90's, programmers began developing "anti-spam" strategies,
email filters, and programs. In a constant game of cat and mouse, each new strategy to identify and eliminate spam was met
by more sophisticated methodologies of escaping spam detection. The earliest strategies used blacklists, initially personal
ones, but later shared blacklists that could collect reports of spam and allow the blocking of emails from particular email
addresses or domains. While these strategies do block some spam, spammers typically change their email address and even domain
regularly, and/or use domains that can't be blocked because much legitimate mail originates within (e.g., http://hotmail.com/ or http://yahoo.com/). Later methods involved the creation of a "fingerprint" of the spam that could be stored on a centralized server. These
"distributed checksum clearinghouses" can compare all incoming emails to known spam and reject those that match. Unfortunately,
the DCC strategy success is related to ongoing reporting of spam, and static spam content. Neither of these criteria are reliable
in the real world situation.
As spammers became more sophisticated, spam-blocking programs developed scoring systems that could eliminate spam email based
upon the words and formatting used in the messages. Unfortunately, the rules were available to the spam generators as well,
and spam strategies arose to circumvent these algorithms. For example, screening for the word "Viagra" led spammers to begin
using V1agra, V!@gra, and many other variations. If a rule looked for a predominance of certain words or phrases, spammers
added "nonsense" phrases or text to dilute the actual content. Algorithms which incorporate many different rules are somewhat
arbitrary in their weighting criteria, and are prone to errors in overreporting (false positives) and underreporting (false
negatives) spam.