Arabic  Chinese (simplified)  Chinese (traditional)  French  German  Italian  Japanese  Korean  Portuguese  Russian  Spanish 

General ASSP Questions

From ASSPSMTP

Jump to: navigation, search

Return to Documentation Home

On This Page

General ASSP Questions

ASSP overview questions and answers are here.

2003-Nov-14 1:48pm jhanna

Security Considerations

 As a proxy, ASSP passes through most of your host mail transport’s security features and vulnerabilities. 
 It also represents a running service accepting connections from the Internet public. 
 Perl in general has a good track record of offering few vulnerabilities. 
 
 As a proxy, ASSP’s only input/output is socket based, so that limits its exposure. 
 ASSP never opens files with user-inputted names and never shells to the operating system.
 
 In a *nix environment you will want to use ASSP’s ability to run as a non-root user. 
 You may also consider running it in a chroot jail. To do this set the ChangeRoot variable 
 in the configuration to set to your ASSP directory and copy (or link) the /etc/protocols 
 file into a etc/protocol file in the ASSP directory.
 
 The collections of spam and non-spam email may represent a security risk, and access should 
 be restricted to mail administrators. The non-spam email collection will certainly contain 
 sensitive correspondence, and steps should be taken to protect it from those who don’t require access.
 
 Your administration password is transmitted with basic authentication (ie no encryption). 
 If you plan to use the web interface from a host where you feel sniffing is a possibility 
 I’d recommend installing stunnel (www.stunnel.org) to create an encrypted tunnel for your 
 web-admin sessions. The password is stored in plain text in the assp.cfg file -- make sure 
 file permissions protect this file from read access for unauthorized users. 
 You can also add ip addresses to the Allow Admin Connections From configuration entry 
 to restrict access to the admin interface, although this type of packet is quite easy to spoof.
 
 2003-Sep-04 12:36pm jhanna
 

Theory of Operation

 ASSP uses three complementary strategies to allow good mail and block unsolicited 
 email: a whitelist, spambuckets, and a Bayesian filter.
 
 Every time a message passes through your SMTP server it has a from address and one or more to addresses. 
 Your SMTP server also knows if the message is being sent from your local network (and to allow relaying 
 for that message), or if it’s coming from outside (and must be delivered to a local address). 
 Your local users don’t send unsolicited email (right?) and the people they correspond with would 
 only send you solicited email. In fact the people they email would also be unlikely to send UCE. 
 By monitoring these addresses ASSP builds a web of trust – local users are trusted, the addresses 
 in their TO or CC fields are trusted, as are the addresses in their TO and CC fields. 
 Any email from these people is considered not-spam without further checking. (Note this is not a good 
 strategy for virus containment, but it is a good strategy for UCE.)
 
 Users of the local mail domains are not added to the whitelist. They are identified by being a 
 part of the local network. Many spammers forge a from addresses with the same domain as the to 
 address, so it is important to avoid adding local addresses to the whitelist.
 
 With only a few days of operation you should see your whitelist grow to more than 1000 addresses. 
 The whitelist is not only helpful in identifying non-spam, but in building your database of 
 non-spam emails. The whitelist is automatically saved every $UpdateWhitelist seconds (1 hour by default).
 
 Spambuckets are addresses which receive only spam. They can be integrated on your web site, posted on 
 Usenet, or come naturally by having employees leave your site; after a reasonable period of time 
 bouncing their mail all mail received for these addresses can be considered unsolicited. 
 Any email whose sender is not whitelisted and is addressed to a spambucket is classified as spam. 
 Spambuckets are helpful both in identifying spam, and in building and maintaining your spam database.
 
 Finally, if an email comes and is not addressed from someone not on your local network, nor on the 
 whitelist, nor addressed to a spambucket, it is compared to the statistical profile generated by 
 the Bayesian filter. The Bayesian filter works by looking for words and phrases (up to three words long) 
 that occur significantly more often in either your non-spam collection, or your spam collection. 
 For most organizations spam identifiers include things like “get rich quick” while non-spam identifiers 
 are things like your organization’s full name or address, or personal names of people who work there. 
 They also include considerably more subtle references like HTML tags which spammers prefer, or jargon 
 specific to your line of business.
 
 To classify a new email all the words and phrases in the first 10000 bytes of the email (including the header) 
 are checked against the statistical model. The top 50 ranking words and phrases are combined according to 
 Bayes theorem to predict how well the mail compares to spam / non-spam in your collections.
 
 I have made the working assumption that only the first 10000 bytes of an email are significant for 
 identifying spam. Spammers may change their profile, but historically spam has been relatively small, 
 and keeping many large files in your collection is a waste of disk space and processing time.
 
 After an email is classified as local or whitelisted, or as Bayesian spam or spam to a spambox 
 its first 10000 bytes are are saved in the appropriate collection directory. It is given a random 
 number between 0 and MaxFiles (12000 by default) and written to that file name. In this way older 
 files will gradually (randomly) be replaced with newer files, thus keeping the collections both 
 diverse and up-to-date. Files in the errors folders (correctedspam and correctednotspam) 
 are never overwritten.
 
 What follows is a sample statistical analysis of mail we received:
 
  As of Thu Mar 27 10:48:54 2003 the mail logfile shows:
  78843 messages, 47637 were spam (60.4%) in 73 days
   for 1080.0 messages per day or 652.6 spams per day
  8303 additions to / verifications of the whitelist (113.7 per day)
  28273 were judged spam by the bayesian filter (59.4% of spam)
  18862 were to spam addresses (39.6% of spam)
  502 were rejected for executable attachments (1% of spam)
  12608 were sent from local clients (40.4% of nonspam)
  7838 were from whitelisted addresses (25.1% of nonspam)
  10760 were ok after a bayesian check (34.5% of nonspam)
  14467 addresses are on the whitelist
  15108 hits on the blacklist
  14890 resulted in spam (52.7% of Bayesian spam, 98.6% of blacklist hits)
  218 resulted in non-spam (1.443% of blacklist hits)
  
 
 
 2003-Sep-04 12:37pm jhanna
 
 

ASSP uses a content filter - won’t spammers disguise their content?

 ASSP uses a sophisticated parsing filter to work around most spammer tricks to disguise their content. 
 As content-based filters like ASSP become more common spammers may find ways to better disguise their message. 
 I personally do not believe spammers will win that battle, but it’s hard to say for sure.
 2003-Sep-04 12:42pm jhanna
 
 

Everyone we email is added to the whitelist, won’t spammers just use addresses from whitelist to spam us?

 It is possible, but more difficult than it sounds. Addresses from your local site aren’t added to the 
 whitelist, so a spammer will have to find someone your site emails. That list will be different for 
 every site using ASSP. A better strategy would be for the spammer to trick you into emailing him/her. 
 But that too will only work for one site at a time. Ultimately it is possible for the spammer to use 
 this strategy to spam your site, but she/he will have to do the same thing individually for every 
 site running ASSP. If this becomes a problem we will develop an appropriate defense.
 
 2003-Sep-04 12:42pm jhanna
 
 

Will ASSP block messages I want to receive?

 ASSP has been designed with great care to prevent this from happening. The whitelist is the single most 
 powerful tool to prevent this – anyone you email will never have a message blocked. The spam filter 
 keeps track of mail we send and spam we receive -- if an incoming message is not from someone we've 
 emailed and it's more like the mail we send than the spam we receive then it gets through. 
 Otherwise it's blocked and the sender gets the message, "Mail appears to be 
 unsolicited -- report errors to postmaster@ourhost.com."
 
 The type of email that most often falls in this category is confirmation emails from web sites. 
 Often these mails are only as personal as your email address and contain a lot of advertising - 
 they look a lot more like spam than they look like the mail you send. If someone has a good 
 idea how to recognize this type of email please let me know.
 
  
 
 Now that ASSP supports the "Expression to recognize non-spam" you can use that to help recognize 
 these confirmation emails. Often they'll include your address, phone number, or other personal 
 information that spam never includes. You can build a "regular expression" to recognize some of these.
 
 2003-Sep-04 12:43pm jhanna
 
 

One man’s spam is another man’s ham - how does ASSP decide what to block?

 See the answer to the previous question. But this raises one theoretical limit for ASSP; ASSP is designed 
 to work for an entire site. This assumes that the users at your site have a fundamental agreement 
 on what is spam. For most small companies the difference between what they send and spam they receive 
 is clear enough that there isn’t a conflict here. However with a large and diverse company this 
 assumption begins to break down. In that case ASSP is probably not the best solution.
 
 2003-Sep-04 12:44pm jhanna
 
 

Will ASSP work with non-English languages?

 At this point ASSP looks for words built from A-Z and the symbols from \240-\377 and separated by spaces. 
 (It’s a little more complicated than that, but that’s basically it.) If your language is mostly 
 that way then ASSP will work fine – Spanish, French, German, Polish, etc, primarily use the Latin 
 alphabet and should work fine. Korean, Japanese, and Chinese don’t work well. Future plans may 
 include improvements to make them more functional.
 
 As of ASSP 0.3.4 we have active users working in Spanish, French, and German without problems.
 
 2003-Sep-04 12:44pm jhanna
 
 

I want to mess with the mail collections. What format are they in?

 One message per file. Only the first 10k bytes are significant. Keep attachments attached - ASSP parses 
 them up to the first 10k. Separate collections are kept in separate folders. Largely whitespace and 
 headers (except the subject) are ignored. Edit, delete, or add files and rebuild the database - that’s 
 about all there is to it. Files that have numbers as filenames will randomly be overwritten over time 
 keeping the collection up-to-date and limited in size.
 
 As of version 0.3.4 ASSP also began to track helo phrases passed in the SMTP conversation -- see the 
 format of the ASSP received header line to see how this should be formatted.
 
 2003-Sep-04 12:45pm jhanna
 
 

I’ve heard content filtering is CPU intensive. Is ASSP a CPU hog?

 ASSP's CPU and memory load are quite moderate. Excluding rebuilding the databases, ASSP uses fewer 
 CPU cycles per message than our mail transport does and significantly fewer per message than our 
 virus filter software.
 
 2003-Sep-04 12:46pm jhanna
 
 

I want to add per-user settings. How hard is that?

 Beyond the Spam Lovers and Redlist, per-user settings are beyond the scope of ASSP’s design goals. 
 They’re generally pretty hard to implement in the SMTP Proxy environment.
 
 2003-Sep-04 12:50pm jhanna
 
 

Is it required to take down (stop) assp to do rebuildspamdb & dnsbl?

 No. The rebuildspamdb and dnsbl scripts can run without stopping ASSP for all versions. 
 In versions prior to 0.2.0 ASSP had to be stopped to use the list.pl script, or to reload 
 the config.pl script. With 0.2.0 and after a kill -HUP will reload the assp.cfg.
 
 2003-Sep-04 12:57pm jhanna
 
 

How does ASSP compare to SpamAssassin?

 1. Is SpamAssassin in ASSP integrated
 
 	no.
 
 2. if not ... why
 
 	I used spamassassin (www.spamassassin.org) for some time prior to developing ASSP. I found SA difficult to install. 
 	It also had to be regularly upgraded. Finally, ASSP's Bayesian filter was more effective at stopping spam than SA. 
 	I understand that since then SA has developed a Bayesian component as well, but I'm not completly 
 	up-to-date on their development.
 
 3. What are the pros of SpamAssassin compared to ASSP
 
 	SA has a great investment in hand-made regular expressions and header analysis to recognize spam.
 
 4. What are the cons of SpamAssassin compared to ASSP
 
 	These same hand-crafted expressions are brittle as spammers adjust their strategies. 
 	ASSP relies on the flexibility (and customization) from your own site's Bayesian database. 
 	Furthermore, ASSP is a complete spam blocking solution, not just a filter that must be 
 	integrated to your mail transport.
 
 	I credit SA with some of the impetus for getting ASSP going -- it is a great tool with a 
 	lot of features. In fact SA's smtp proxy was part of the inspiration for ASSP. And I would 
 	cheer them on -- every effective anti-spam tool reduces spammer's success and makes spam less profitable
 
 	However, my goal was to have a system that was easy to install, worked unmodified with nearly 
 	every MTA on any OS, and I believe ASSP is achiving those goals. Yes, a competant Linux system 
 	administrator can probably achieve similar results with SA, but ASSP broadens that opportunity 100 fold.
 
 I trust you will find the best tool for your situation.
 
 2003-Sep-04 1:03pm jhanna
 
 

What is the difference between the redlist, no-processing, and spamlover lists?

 Here's a matrix to help identify the differences:
 
 [ filtered mail | unfiltered mail ] x [ contributes to whitelist | doesn't contribute ] =
 filtered & contributes = normal
 unfiltered & contributes = spamlover
 filtered & doesn't contribute = redlist (does contribute to spam/nonspam collections)
 unfiltered & doesn't contribute = no processing (also doesn't contribute to spam/nonspam collections)
 
 2003-Sep-04 1:05pm jhanna
 
 

What is "cache reset" in the log file?

 You can probably ignore it.
 
 If one of your cache is resetting more often then every 7 minutes, then change the line where it says, 
 "if($this->{cnt}++ >5000" and change the 5000 to 20000. This will make ASSP use more RAM but 
 give you better performance.
 
 Note that after one of the databases has been updated (whitelist, redlist, spamdb, or dnsbl) 
 an average of 255 hits on that database you'll get a "cache reset" because ASSP noticed that 
 the file modification timestamp changed. However new data can be read from the file from the 
 moment it's updated -- it's only cached data that won't be re-read.
 
 As of version 1.0.0 the cache size is in the configuration options.
 
 2003-Sep-04 1:06pm jhanna
 
 

What is "helo rndhelo" on the analysis page?

 When a mail client connects to a mail server to send mail it must send a SMTP command, "HELO" 
 (or the variant EHLO) followed by what it calls itself. Almost every server uses its host name 
 in this greeting: m11.lax.untd.com for example. However spammers often greet with a random string 
 of letters: slk845gjlkas perhaps. ASSP tries to recognize these greetings because they're 
 an excellent indicator of spaminess.
 
 Unfortunately, a bug in versions prior to 0.3.5 meant that all messages without a header are 
 interpreted as randomhelo greetings (or rndhelo).
 
 2003-Sep-04 1:07pm jhanna
 
 

I've seen discussion of configuration settings that aren't on my config page. What do I do?

 First, check the "Show Advanced Configuration Options" checkbox and submit the form. This will show all 
 available configuration options.
 
 Second, the wording may have changed, or an abreviation may have been used -- look for another setting 
 with a similar use. For example, WhiteRE is actually, "Expression to identify Non-Spam."
 
 
 2003-Sep-04 1:11pm jhanna
 
 

How really does ASSP detect spam?

 When you install ASSP a colony of super-intelligent thermophilus bacteria takes up residence on your CPU and 
 begin reading all your email. They communicate using radio waves directly with the CPU and interface with the 
 ASSP software choosing between spam and nonspam mail. If you choose to read further this myth will be sadly 
 dispelled, and I take no responsibility for the consequences. However, you can always refer your clients to 
 this page to prove to them that their email is actually being filtered by super-intelligent bacteria.
 
 The rebuildspamdb program is where I will start. It reads the files in your errors/spam, errors/notspam, spam 
 and notspam directories. As it reads the files in the errors directory it also builds a hash of the mail body 
 to be able to identify duplicate messages misfiled. This hash is used to delete messages from the notspam 
 collection that were also in the errors/spam collection and from the spam collection that were also in the 
 errors/notspam collection. Think of it like scrubbing bubbles - they do the work so you don’t have toooo!
 
 As rebuildspamdb reads the files it also does two things. First it runs a filter (the subroutine “clean”) that 
 prepares the message for statistical analysis. Second it walks through the file tallying word pairs in the spam 
 or not-spam categories according to the collection. Files in the errors/spam collection count double; files in 
 the errors/spam count x4.
 
 The 'clean" subroutine does a number of important operations. Primarily its function is to undo the things 
 spammers do to trick filters. It cleans up base64 encoding. It cleans up many HTML obfuscation techniques. 
 Look at the code of the "sub clean" for more details - it’s all commented. It also does two other things (and 
 may do more in the future) to help the Bayesian analysis. First, it inserts a keyword after each word of the 
 subject - this lets the Bayesian filter recognize words in the subject uniquely. For example the word "free" 
 in the subject will have a different Bayesian rating than the word “free” in the body of the message. Second 
 it does a couple of tricks to isolate the “HELO” greeting that was sent when the message was delivered. 
 This has also proven to be a useful Bayesian factor in identifying spam.
 
 Paul Graham’s "A Plan for Spam" recommends complete header analysis within the Bayesian filter. 
 Because ASSP initially used three-keyword identifiers, and now (as of 0.3.4) two-keyword 
 identifiers, I found this useless. However, header analysis will be a fruitful area of 
 development for improving ASSP’s spam / ham recognition rate in the future. That will take 
 place in the “clean” subroutine. There may be other pre-processing features that will be 
 introduced there in the future.
 
 Once each mail message is pre-processed (cleaned) each word pair is tallied (words being defined 
 as [-\$A-Za-z0-9\'\.!\240-\377]+ – shorter than 2 or longer than 19 are ignored and are further 
 cleaned in this way: s/[,.']+$//; s/!!!+/!!/g; s/--+/-/g;) [Sorry for the technical stuff for 
 those allergic to it.] So that in the end you end up with a big database of word pairs and their 
 counts: “in the”: spam=23210, total=46411; “order now”: spam=20001, total=20121. The rebuildspamdb 
 program then steps through this database discarding identifiers with total less than 5 (i.e. if a word 
 pair occurred 4 or fewer times in all the collections combined and with errors/spam x2, and errors/spam 
 x4 then the pair can be ignored) and calculating the spaminess ratio this way:
 
 If the spam count = 0 or the spam count = the total count then square both counts. (This amplifies factors 
 which appear only in the spam or not-spam collection.)
 
 Spaminess = (spam count + 1) / (total count + 2) (This should look familiar to anyone with a basic 
 understanding of Bayesian filters. It also somewhat de-emphasizes rare identifiers and emphasizes common ones.)
 
 Throw out the identifier if it’s between 0.41 and 0.59 - this identifier appears almost equally in both 
 spam and non-spam there’s no point in keeping it.
 
 Force the result between 0.999999 and 0.000001 - Bayesian classifiers croak if the value is too close to 0 or 1.
 
 All of these results are sorted (by identifier) and stored in the spamdb for use by ASSP.
 
 Rebuildspamdb also randomly (1 time in 20) prunes outdated entries in the whitelist and goodhosts databases.
 
 Now you know how the spamdb is built, so let’s see how it is used.
 
 Suppose a mailer in the internet connects to ASSP. ASSP makes a connection to your "SMTP Destination" and 
 begins relaying their conversation. It notes the IP address of the connecting server. It notes their HELO string. 
 It notes their MAIL FROM (envelope sender). It notes their RCPT TOs. It notes their DATA directive. 
 (This is all in sub "getline".) Relay attempts are blocked. The presence of spam bucket addresses is noted. 
 Mail to the email interface is detected. Mail to no-processing or "spam lover" addresses is noted. 
 Assuming none of that qualifies the message is passed on to "getheader."
 
 Getheader is looking for the mail header. When the header is complete getheader calls "onwhitelist" which determines 
 if the message should be treated as whitelisted/local (it’s the same really) and if so to update the whitelist. 
 If not processing goes on to "getbody."
 
 Getbody reads the rest of the message (or the first 10000 bytes including the header, which ever comes first), checks 
 for attached executables (if that’s enabled) and calls "isspam" which is probably why you’re reading this document.
 
 The isspam subroutine first checks WhiteRe and BlackRE, the expressions to identify non-spam and spam, respectively. 
 Then it calls "clean" to clean up any spammer obfuscation, and calls them again with the "cleaned" version. 
 Then it checks for a DNSBL hit, which adds 0.97 twice to the list of Bayesian factors for this message. 
 Then it checks for a goodhost miss, which adds whatever your site’s goodhost factor is twice, provided it is > 0.65. 
 Then it walks through the message’s word pairs, just like rebuildspamdb did, completing the list of Bayesian factors. 
 Unlike rebuildspamdb, an identifier hit will only be counted a maximum of two times, so if the identifier "free money" 
 rates 0.955 and "free money" occurs three or more times in the mail message, only the first two count.
 
 The list of factors is sorted and the thirty factors closest to 0 or 1 (i.e. the 30 furthest from 0.5 or neutral) are 
 combined as Bayes taught into a single probability. If this probability is greater than 0.6 the message is spam. 
 (Mail is very rarely between 0.2 and 0.8 - it’s almost always > 0.9 or < 0.1.)
 
 Spam is logged in the spam directory and local and whitelisted mail is logged in the notspam directory. 
 Headers are updated as configured. If you’re not in test-mode the connection to your "SMTP Destination" is 
 dropped if it is spam, and when the client stops spewing the mail body, it gets the "spam error" message, and 
 it’s connection is dropped. (In test mode the connection is completed and ASSP sends updated headers.)
 
 
 2003-Sep-04 1:13pm jhanna
 
 

What is goodhosts and what does it do?

 Note: As of version 1.0.5 it is recommended that you use the greylist feature and deactivate both 
 goodhosts and the dnsbl.
 
 
 I noticed that we were getting a number of spams slip through the filter all with the same qualities: 
 they were short, they were deliberately misspelled on many words, and they linked to some website.
 
 I started doing some research on (a) why they got through, and (b) how to block them.
 
 It turned out that because of the shortness and misspellings many passed through without any hits in the 
 bayesian database, good or bad.
 
 One solution would be to assume that all mail is just a little spammy and then force the content to justify 
 itself before being allowed to pass. This would have the added effect of possibly raising the false positive 
 ratio, although I didn't research it to be sure.
 
 But further research revealed something more useful.
 
 Because ASSP keeps a whitelist, it is a trivial addition to track what hosts send whitelisted mail. A site of 
 any size will quickly get AOL, Hotmail, and a few others on that list -- they'll also get their organizational 
 partners on it quickly. This is the goodhost database, and it represents a sort of social network for your email. 
 You're likely to email them, and they're likely to email you. Doing the math for our site I found that less 
 than 1% of mail from these goodhosts is spam. And 89% of spam was from a not-goodhost. Each site's ratio will 
 be different, but I expect that the goodhost marker is a healthy sign that an email is not spam.
 
 So the goodhost database is sort of like a inverse-dns-blacklist that you don't have to download. Hosts absent 
 from the goodhost list will get your site's non-goodhost-spam ratio added to the Bayesian determination, once 
 that ratio is higher than 65%.
 
 Other benifits of the goodhost strategy:
 1) requires no download (unlike the DNSBL)
 2) totally self-maintaining & tuning
 3) totally customized to your own site's traffic patterns
 4) unspoofable by spammers
 5) this is exactly the sort of push that these short & misspelled mails needed to correctly fall into the spam pit.
 
 This is a good reason to tell your friends about ASSP -- it's only the best anti-spam tool in existance... And it's free.
 
 
 2003-Oct-22 1:34pm jhanna
 
 

What is the http ://[\w\.]+@ default expression to identify spam?

 That's a quite smart expression to identify spam. It catches all mails that contain URLS in the 
 form http://fakedurl@normalurl.com
 
 It is most often used to trick the readers eye as http://www.mcafee.com@spamsite.com/securitypatch.exe "looks" 
 as if it would connect to the trustworthy "www.mcafee.com" site where in reality it connects to "spamsite.com" 
 with a "username" that is "www.mcafee.com". If this website does not need authentication (and they never do), 
 then the username part is discarded.
 
 By using this expression you will quickly sort out a bunch of Spams, that in turn automatically provide you 
 with suitable spamwords. I found no need to manually add more expressions.
 
 (Robert Orso: 2003-11-17)
 
 
 2003-Nov-17 11:00am jhanna

Why does ASSP only show one recipient per message in the maillog.txt file?

 Messages can have from one to hundreds of recipients. 
 We decided to only show the first one in the maillog for simplicity.
 
 
 2003-Nov-17 3:26pm jhanna
 
 

Virus blocked -- what was blocked and why?

 The short reason for "why" is that ASSP found an executable attachment.
 
 The log file gives you the time and sender (though the sender is often faked, but it IP address would be right). 
 If you use the "other" folder "External mail that wasn't spam (mostly)" you can find a copy of what was blocked 
 there, though it's only the first 10k. That might be enough to try to recognize what was sent, either by 
 inspecting the file or by running a virus scanner. (You can identify the file by the creation date/time -- it 
 will match the time in the log entry.) Files don't stay there forever, though.
 
 
 2003-Nov-19 9:16am jhanna
 
 

Can I delete files from the spam / notspam / other collections?

 You can delete files from the other directory at any time and as you see fit.
 
 The spam and notspam files are used by rebuildspamdb.pl to create your spamdb. 
 Do not delete these files unless you become aware that your spam collection is 
 hopelessly corrupted and want to start from scratch, categorizing spam and notspam by hand.
 
 
 2004-Jan-26 9:44am jhanna

You can limit the number of files in the spam / notspam directories with the "Max Files" option under "Other Settings" in the web configuration (default is 18009). For this to work you must have "Use Subject as Maillog Names" unchecked. Also you can run the perl script "move2num.pl" to convert existing subject named files to numbers.

What order does ASSP process mail to check if it is spam?

 Testing goes like this:
 
 1) Local or whitelisted?
 2) Blacklisted Domain?
 3) Spam Helo?
 4) Addressed to spam-bucket?
 5) Mail bomb?
 6) Blocked attachment?
 7) Matches expression to identify non-spam?
 8) Matches expression to identify spam?
 9) Bayesian evaluation
 
 If the message is identified as spam at any step along the way it goes to the spam directory.
 
 If the message is local or whitelisted it goes to the notspam directory.
 
 All other cases it goes to the other directory.
 
 
 2004-Jun-21 8:15am jhanna
 
 -------------------------------
 
 There is some update information on the ASSP Processing Order here.
 
 2006-Dec-19 8:48am gedwest
 
 

What is the helo blacklist?

'[From a list mail by John Hanna dated 2004-06-11]'

Hi. There have been a few questions about the new helo blacklist feature.
I"ll try to answer the questions I"ve seen so far and provide a little
background. I"ll probably add most of this to a page in the FAQ somewhere
too.

A) Background

Catching spam relies on taking what information is available about the
message at hand and weighing it against valid mail and other samples of
spam. For an automated process that leaves one with basically this
information:

1) the connecting host
2) the smtp dialog (helo, mail from, rcpt to, data, and sometimes a bit
more)
3) the mail header (received"s, from, subject, date, to, cc, and various
other headers)
4) the mail body (content, attachments, word/phrase/character analysis,
urls, etc)

DNS-blacklists make a big deal about the connecting host. ASSP"s greylist
does a pretty good job handling that issue. Ultimately, I"m not convinced
that the connecting host really gives you any useful information -- past
performance is no guarenteed indicator of future behavior. I"ve posted my
thoughts on SPF elsewhere and won"t restate them here.

Most of the smtp dialog is of little use. ASSP (from the first version) has
recognized spam-only addresses as a means to identify spam, and that"s a
useful indicator. Mail-from is often forged for spam so it"s of little
predictive value. However I found that the HELO / EHLO directive is
(currently) an excellent indicator of one type of spam. (More probably,
there are a couple of spam programs that leave a clue in the HELO. This may
change at some point, but for now, it"s our benifit.) More on this in the
next section.

The problem with the mail header is that can be entirely forged. Some of
these forgeries can be detected, but in large this is a fleeting and
time-wasting practice (in my opinion).

The mail body (and header subject) is where most of the Bayesian filtering
pays off. There are many methods of pre-processing (converting base64,
pre-parsing html, etc) and tokenizing (breaking up the body into bite-sized
chunks) and these different strategies result in different levels of success
for the Bayesian categorizer.

B) What"s going on the in HELO?

Connecting SMTP clients are expected to greet the server with a HELO (or
EHLO) message with the name of their host. This greeting is largely ignored
because it is somewhat arbitrary -- NAT and various proxies make there no
practical way to verify the greeting. However many many spams come with the
ip address of the server host in the HELO greeting. It"s possible there are
some mail servers that look at the helo to allow relaying, I don"t know. But
for whatever reason, about 20% of our spam comes with the helo of our server
and no non-spam ever does.

Because the HELO is arbitrarily chosen by spammers there is nothing one
could say in a HELO that would ever help a message be recognized as
not-spam. So there"s no point in setting up a greylist-type structure for
HELOs. However I have quite a few HELOs at our site that only ever send
spam. And thus the helo blacklist was born.

C) How does it work in ASSP?

Rebuildspamdb goes through your mail archive and looks for the HELOs
(they"re stored in the received header) of spam and notspam. If it finds 5
or more spam and no not spam, it goes on the blacklist. If it finds > 50
spam per not spam it goes on the blacklist. (The equation is spams / (spams
+ hams + 0.1) > 0.98).

When mail is being processed by ASSP, if the mail qualifies for a spam check
(ie it"s not local or whitelisted) and doesn"t have a blacklisted domain,
then it is checked against the helo blacklist. If it matches, the mail is
classified as spam.

D) How can I disable it?

The helo blacklist should work excellently in almost every situation
automatically. However, if there"s one thing I"ve found about ASSP, there"s
exceptions to everything. This is what you can do to make sure the helo
blacklist works for you:

1) If it misqualifies ham, be sure to put copies (which include the full
mail header) in the errors/notspam folder and run rebuildspamdb.pl. This
will change the ratio for the HELO string and probably remove the helo from
the blacklist.
2) Add the sender to the whitelist -- whitelisted mail isn"t checked against
the helo blacklist.
3) There"s a couple of ways to disable the helo blacklist.
 a) Comment out assp.pl line 559 with a # so it becomes
  # $HeloBlackObject=tie %HeloBlack,orderedtie,"$base/$spamdb.helo" if
$spamdb;
   and restart assp
 b) simply remove the spamdb.helo file after rebuildspamdb.pl runs
4) I meant to add a "Don"t use the spamdb.helo" config option and I
forgot -- I"ll get it in the next release. Sorry.

 
 --N.B. - the "Don't use the Helo Blacklist" option was added in version 1.0.11--
 
 
2005-Jul-04 2:42pm cpaine

Return to Documentation Home

These icons link to social bookmarking sites where readers can share and discover new web pages. Blinklist  del.icio.us  digg  Furl  Google  ma.gnolia  Reddit  Slashdot  Spurl  YahooMyWeb 
Personal tools