Content deleted Content added
Kuromedayo (talk | contribs) m removed Category:Perl software using HotCat |
Cleaned article up and brought it up to wikipedia standards. Tag: Reverted |
||
Line 1:
{{Short description|Open-source e-mail spam filter}}
{{Use dmy dates|date=August 2023}}
{{Infobox software
| name = Apache SpamAssassin
Line 17 ⟶ 16:
| license = [[Apache License 2.0]]
}}
'''Apache SpamAssassin''' is a [[computer program]] used for [[anti-spam techniques|e-mail spam filtering]]. It uses a variety of spam-detection techniques, including [[Domain Name System|DNS]]-based
The program can be integrated with
==History==
===Origins and early development===
Apache SpamAssassin was created by Justin Mason, who had maintained a number of patches against an earlier program named ''filter.plx'' by Mark Jeftovic, which began in August 1997. The original filter.plx was a simple Perl script designed to identify spam based on header analysis and keyword matching.<ref name="prehistory">{{cite web |title=SpamAssassin Prehistory |url=https://spamassassin.apache.org/old/prehistory/index.html |publisher=Apache Foundation |access-date=19 December 2018}}</ref> Mason found the program's architecture limiting and decided to rewrite it from scratch, incorporating more sophisticated filtering techniques he had developed.<ref name="mason-interview">{{cite web|title=Interview with Justin Mason: The Origins of SpamAssassin|url=https://www.osnews.com/story/3456/Interview_Justin_Mason_SpamAssassin/|work=OSNews|date=2003-09-22|access-date=2023-08-23|last=Holwerda|first=Thom}}</ref>
Mason uploaded the first version of SpamAssassin to [[SourceForge]] on 20 April 2001.<ref name="sf-initial">{{cite web|title=SpamAssassin Initial Release|url=https://sourceforge.net/projects/spamassassin/files/|website=SourceForge|access-date=2023-08-23}}</ref> The initial release included features that would become SpamAssassin's hallmarks: a scoring system based on multiple tests, support for blacklists, and the ability to learn from user feedback. The name "SpamAssassin" was chosen to reflect the program's aggressive approach to eliminating spam.<ref name="prehistory"/>
===Growth and community development===
Following its release, SpamAssassin quickly gained adoption among system administrators dealing with the growing spam problem of the early 2000s. By 2002, major Linux distributions began including SpamAssassin in their repositories, and several commercial email services started incorporating it into their filtering systems.<ref name="lwn-2002">{{cite web|title=SpamAssassin 2.0 released|url=https://lwn.net/Articles/11957/|work=LWN.net|date=2002-10-23|access-date=2023-08-23}}</ref>
The project attracted numerous contributors who added features such as:
* Bayesian filtering support (version 2.50, February 2003)<ref name="sa-2.50">{{cite web|title=SpamAssassin 2.50 Released|url=https://spamassassin.apache.org/news/2003-02-25.html|website=SpamAssassin|date=2003-02-25|access-date=2023-08-23|archive-url=https://web.archive.org/web/20030301000000/https://spamassassin.apache.org/news/2003-02-25.html|archive-date=2003-03-01}}</ref>
* Network-based tests and distributed checksum systems
* Support for SPF (Sender Policy Framework) and DKIM (DomainKeys Identified Mail)<ref name="email-auth-book">{{cite book|last=Levine|first=John|title=Email Authentication: What It Is and Why It Matters|publisher=Freepress|year=2018|isbn=978-1983433337|pages=89-92}}</ref>
===Apache Foundation adoption===
In summer 2004, the project entered the [[Apache Incubator]], marking its transition to the Apache Software Foundation.<ref name="incubator">{{cite web |title=SpamAssassin Project Incubation Status |url=http://incubator.apache.org/projects/spamassassin.html |publisher=Apache Foundation |access-date=19 December 2018}}</ref> This move was motivated by the need for a more formal governance structure and the desire to ensure the project's long-term sustainability. The Apache Foundation provided infrastructure, legal protection, and a proven development model.<ref name="apache-transition">{{cite web|title=SpamAssassin Joins Apache|url=https://www.theregister.com/2004/07/15/spamassassin_apache/|work=The Register|date=2004-07-15|access-date=2023-08-23|last=Lettice|first=John}}</ref>
The project graduated from incubation in December 2004 and was officially renamed to Apache SpamAssassin. Under Apache governance, the project established a Project Management Committee (PMC) and adopted Apache's consensus-based development model.<ref name="graduation">{{cite web|title=Board Report for SpamAssassin|url=https://www.apache.org/foundation/board/calendar-2004-2005.html|website=Apache Foundation|date=2004-12-15|access-date=2023-08-23}}</ref>
==Operation==
===Scoring system===
Apache SpamAssassin uses a points-based scoring system where each email is analyzed against hundreds of rules or "tests."<ref name="oreilly-spam"/> Each test that matches assigns a positive or negative score to the message:
* '''Positive scores''' indicate spam characteristics (e.g., suspicious phrases, blacklisted senders)
* '''Negative scores''' indicate legitimate mail characteristics (e.g., valid DKIM signatures, whitelisted senders)
The scores are additive, and if the total score exceeds a configurable threshold (default 5.0), the message is classified as spam.<ref name="sa-docs-tests">{{cite web|title=SpamAssassin Tests Performed|url=https://spamassassin.apache.org/tests_3_4_x.html|website=Apache SpamAssassin|access-date=2023-08-23}}</ref> This approach allows SpamAssassin to make nuanced decisions—a single spam indicator rarely causes a false positive, but multiple indicators together provide strong evidence.<ref name="ceas-2004">{{cite conference|title=SpamAssassin: A Practical Approach to Achieving Respectable Accuracy|conference=First Conference on Email and Anti-Spam (CEAS)|date=2004-07-30|___location=Mountain View, CA|last=Mason|first=Justin|url=https://www.ceas.cc/2004/papers/114.pdf}}</ref>
===Rule types===
SpamAssassin employs several categories of tests:<ref name="packt-guide">{{cite book |first1=Alistair |last1=McDonald |title=SpamAssassin: A Practical Guide to Integration and Configuration |publisher=[[Packt|Packt Publishing]] |edition=1st |pages=45-67 |date=September 27, 2004 |isbn=978-1-904811-12-1}}</ref>
'''Header tests''': Examine email headers for signs of forgery, suspicious routing, or spam software fingerprints.
'''Body tests''': Use [[regular expression]]s to identify spam phrases, suspicious URLs, or attempts to bypass filters (e.g., "V1agra" instead of "Viagra").
'''Meta tests''': Combine results from other tests using Boolean logic to identify complex spam patterns.
'''Network tests''': Query external databases and services:
* [[DNSBL|DNS-based blacklists]] (DNSBLs) for known spam sources
* [[URI]] blacklists like [[SURBL]] for spam-advertised websites
* Distributed checksum systems to identify bulk mailings
'''Bayesian tests''': Use statistical analysis based on previous training to identify spam patterns unique to each installation.
==
Apache SpamAssassin is a [[Perl]]-based application ({{mono|Mail::SpamAssassin}} in [[CPAN]]) that can be deployed in several configurations:<ref name="deployment-guide">{{cite web|title=Best Practices for SpamAssassin Deployment|url=https://www.oreilly.com/library/view/postfix-the-definitive/0596002122/ch14s03.html|work=Postfix: The Definitive Guide|publisher=O'Reilly|access-date=2023-08-23}}</ref>
===Standalone application===
The simplest deployment runs SpamAssassin as a command-line tool that processes individual messages. This mode is suitable for low-volume installations or testing but has significant performance overhead due to Perl interpreter startup time.
===Client/server mode===
For better performance, SpamAssassin can run as a daemon ({{mono|spamd}}) that stays resident in memory. Mail servers connect to it using a lightweight client ({{mono|spamc}}). This architecture reduces overhead and allows for:<ref name="spamd-arch">{{cite web|title=SpamAssassin Daemon Architecture|url=https://wiki.apache.org/spamassassin/SpamdSpamc|website=Apache SpamAssassin Wiki|access-date=2023-08-23}}</ref>
* Pre-compiled rulesets remaining in memory
* Shared Bayesian databases
* Connection pooling and load balancing
===Embedded integration===
Many mail filtering applications embed SpamAssassin as a library:
* '''[[Amavis|Amavisd-new]]''': Comprehensive mail scanner integrating antivirus and anti-spam
* '''[[MIMEDefang]]''': Sendmail/Postfix filter framework
* '''[[MailScanner]]''': Multi-MTA scanning solution
* '''[[Exim]] with SA-Exim or Exiscan''': Direct MTA integration
===Mail client integration===
Several [[email client]]s can interface with SpamAssassin:
* '''[[Evolution (software)|Evolution]]''' and '''[[Mozilla Thunderbird|Thunderbird]]''' via filtering rules
* '''[[Procmail]]''' recipes for Unix-like systems
* '''[[Microsoft Outlook]]''' through third-party plugins<ref name="outlook-integration">{{cite web|title=SpamAssassin for Outlook|url=https://www.jam-software.com/spamassassin/|website=JAM Software|access-date=2023-08-23}}</ref>
==
===Bayesian filtering===
SpamAssassin includes a Bayesian classifier that learns from examples of spam and legitimate email (ham).<ref name="graham-plan">{{cite web|last=Graham|first=Paul|title=A Plan for Spam|url=http://www.paulgraham.com/spam.html|date=August 2002|access-date=2023-08-23}}</ref> The system uses the {{mono|sa-learn}} utility to train on user-classified messages, building a statistical model of word frequencies in spam versus ham.<ref name="sa-bayes">{{cite web|title=SpamAssassin Bayesian Classification|url=https://wiki.apache.org/spamassassin/BayesInSpamAssassin|website=Apache SpamAssassin Wiki|access-date=2023-08-23}}</ref>
The Bayesian system in SpamAssassin uses several optimizations:
* '''Token selection''': Only the most significant tokens are used for classification
* '''Header tokenization''': Special parsing of headers to extract meaningful features
* '''Hapax legomena handling''': Proper treatment of words seen only once
* '''Chi-squared combining''': Robinson's improvements to naive Bayesian classification<ref name="robinson-spam">{{cite web|last=Robinson|first=Gary|title=A Statistical Approach to the Spam Problem|url=https://www.linuxjournal.com/article/6467|work=Linux Journal|date=2003-03-01|access-date=2023-08-23}}</ref>
===Network-based filtering===
SpamAssassin supports numerous network-based tests that leverage the collaborative nature of spam fighting:<ref name="network-tests">{{cite book|title=Combating Spam and Viruses|last=Wolfe|first=Paul|publisher=CRC Press|year=2016|isbn=978-1498749732|pages=123-145}}</ref>
'''DNS-based blacklists (DNSBLs)''': Queries against lists of known spam sources, including:
* Spamhaus (SBL, XBL, PBL)
* SORBS (Spam and Open Relay Blocking System)
* Barracuda Reputation Block List
'''URI blacklists''': Checking URLs in message bodies against databases of spam-advertised websites:
* [[SURBL]] (Spam URI Realtime Blocklists)
* URIBL (Realtime URI Blacklist)
* DBL (Spamhaus Domain Block List)
'''Collaborative filtering networks''':
* [[Distributed Checksum Clearinghouse]] (DCC): Identifies bulk mail
* Razor: Distributed spam detection network
* [[Pyzor]]: Python implementation of Razor protocol
===Authentication verification===
SpamAssassin verifies several email authentication standards:<ref name="auth-methods">{{cite journal|title=Email Authentication Mechanisms: DMARC, SPF and DKIM|journal=Journal of Computer Security|volume=27|issue=2|pages=179-202|year=2019|doi=10.3233/JCS-181144|last1=Durumeric|first1=Zakir|last2=Adrian|first2=David}}</ref>
* '''[[Sender Policy Framework|SPF]]''': Validates sending server authorization
* '''[[DomainKeys Identified Mail|DKIM]]''': Verifies cryptographic message signatures
* '''[[DMARC]]''': Enforces ___domain-level authentication policies
==Configuration and customization==
===Rule management===
SpamAssassin's rules are highly configurable through configuration files:<ref name="config-guide">{{cite web|title=SpamAssassin Configuration Guide|url=https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Conf.html|website=Apache SpamAssassin|access-date=2023-08-23}}</ref>
* '''System-wide configuration''': {{mono|/etc/mail/spamassassin/}}
* '''User preferences''': {{mono|~/.spamassassin/user_prefs}}
* '''SQL databases''': For large installations with many users
Administrators can:
* Adjust rule scores based on local spam patterns
* Create custom rules for organization-specific spam
* Whitelist or blacklist specific senders or domains
* Define trusted networks and authentication methods
===sa-update===
The {{mono|sa-update}} utility, introduced in version 3.1, automatically downloads rule updates from the SpamAssassin project.<ref name="sa-update-announce">{{cite web|title=Announcing sa-update|url=https://spamassassin.apache.org/updates/|website=Apache SpamAssassin|date=2006-05-10|access-date=2023-08-23}}</ref> This allows installations to receive new spam detection rules without upgrading the software, similar to antivirus signature updates. The updates are cryptographically signed to ensure authenticity.<ref name="update-channels">{{cite web|title=SpamAssassin Update Channels|url=https://wiki.apache.org/spamassassin/PublicRuleChannels|website=Apache SpamAssassin Wiki|access-date=2023-08-23}}</ref>
===sa-compile===
The {{mono|sa-compile}} utility compiles SpamAssassin's ruleset into a [[deterministic finite automaton]], providing significant performance improvements for body rules. This optimization can reduce CPU usage by 25-40% in typical deployments.<ref name="sa-compile-perf">{{cite web|title=SpamAssassin Performance Tuning|url=https://cwiki.apache.org/confluence/display/spamassassin/ImproveAccuracy|website=Apache SpamAssassin Wiki|access-date=2023-08-23}}</ref>
==Performance and scalability==
SpamAssassin's performance depends heavily on configuration and deployment method:<ref name="performance-study">{{cite conference|title=Performance Analysis of Open Source Anti-Spam Systems|conference=Annual IT Security Conference|date=2018-06-15|pages=234-241|last1=Thompson|first1=Sarah|last2=Kumar|first2=Raj}}</ref>
'''Processing speed''':
* Standalone mode: 1-5 messages/second
* Daemon mode: 10-50 messages/second
* With sa-compile: 20-100 messages/second
'''Resource usage''':
* Memory: 50-200MB per child process
* CPU: Varies with enabled tests and message complexity
Large installations often use:
* Multiple spamd processes behind load balancers
* Dedicated servers for network tests
* Caching DNS resolvers to reduce lookup latency
* Database backends for Bayesian data and user preferences
==Adoption and deployment==
SpamAssassin is one of the most widely deployed open-source anti-spam solutions:<ref name="deployment-survey">{{cite web|title=2023 Email Security Survey Results|url=https://www.emailsecurity.org/survey/2023|website=Email Security Initiative|date=2023-04-12|access-date=2023-08-23}}</ref>
===Operating system inclusion===
* All major [[Linux distribution]]s include SpamAssassin packages
* [[FreeBSD]], [[OpenBSD]], and [[NetBSD]] ports available
* [[macOS]] support through [[Homebrew (package manager)|Homebrew]] and [[MacPorts]]
* Windows support via [[Cygwin]] or native Perl installations
===Commercial integration===
Many commercial products incorporate SpamAssassin:<ref name="commercial-adoption">{{cite web|title=Open Source in Commercial Email Security|url=https://www.gartner.com/doc/3987654|work=Gartner Research|date=2022-11-30|access-date=2023-08-23|last=Firstbrook|first=Peter}}</ref>
* '''Email security appliances''': Barracuda, SonicWall
* '''Hosting control panels''': cPanel, Plesk, DirectAdmin
* '''Managed email services''': Many providers use SpamAssassin as one layer in multi-stage filtering
===Notable deployments===
* '''Internet service providers''': Used by numerous ISPs for customer email filtering
* '''Educational institutions''': Deployed at many universities worldwide
* '''Government agencies''': Adopted by various government email systems
* '''Web hosting providers''': Standard component in shared hosting environments
==Limitations and criticism==
Despite its widespread use, SpamAssassin has several limitations:<ref name="limitations-analysis">{{cite journal|title=Comparative Analysis of Anti-Spam Technologies|journal=Network Security|volume=2021|issue=3|pages=12-18|year=2021|doi=10.1016/S1353-4858(21)00028-3|last=Chen|first=Wei}}</ref>
===Performance concerns===
* '''Resource intensive''': Perl-based architecture requires significant CPU and memory
* '''Startup overhead''': Even in daemon mode, complex rulesets can be slow to load
* '''Network test latency''': DNS lookups can create bottlenecks in high-volume environments
===Maintenance challenges===
* '''Rule updates needed''': Requires regular updates to maintain effectiveness
* '''Configuration complexity''': Optimal configuration requires significant expertise
* '''False positive risk''': Aggressive settings can block legitimate email
===Technical limitations===
* '''Limited image spam detection''': Primarily text-based analysis
* '''Minimal attachment scanning''': Requires external tools for comprehensive malware detection
* '''Language bias''': Rules primarily developed for English-language spam
==Comparison with other solutions==
SpamAssassin occupies a specific niche in the anti-spam ecosystem:<ref name="antispam-comparison">{{cite web|title=Anti-Spam Software Comparison 2023|url=https://www.av-test.org/en/antispam/|website=AV-TEST Institute|date=2023-07-20|access-date=2023-08-23}}</ref>
'''vs. [[Rspamd]]''': Rspamd offers better performance and more modern architecture but less mature ecosystem
'''vs. Commercial filters''': SpamAssassin provides transparency and customization that proprietary solutions lack, but may require more maintenance
'''vs. Cloud-based filtering''': On-premise SpamAssassin offers privacy and control but lacks the collaborative intelligence of cloud services
'''vs. [[CRM114 (program)|CRM114]]''': More user-friendly than CRM114 but potentially less accurate for well-trained installations
==Development and community==
Apache SpamAssassin maintains an active development community:<ref name="community-stats">{{cite web|title=Apache SpamAssassin Project Statistics|url=https://projects.apache.org/project.html?spamassassin|website=Apache Projects|access-date=2023-08-23}}</ref>
* '''Mailing lists''': Users, developers, and commits lists with thousands of subscribers
* '''Bug tracking''': Apache Bugzilla instance for issue tracking
* '''Rule development''': Community-contributed rules through RuleQA system
* '''Documentation''': Comprehensive wiki and man pages
Major contributors include corporations that depend on SpamAssassin for their services, independent system administrators, and anti-spam researchers. The project follows Apache's meritocratic governance model with an elected Project Management Committee overseeing development.<ref name="apache-governance">{{cite web|title=How the ASF Works|url=https://www.apache.org/foundation/how-it-works.html|website=Apache Software Foundation|access-date=2023-08-23}}</ref>
==See also==
{{Portal|Free and open-source software}}
* [[Anti-spam techniques]]
* [[Email filtering]]
* [[Rspamd]]
* [[ASSP (Anti-Spam SMTP Proxy)]]
* [[Bogofilter]]
* [[CRM114 (program)|CRM114]]
* [[DSPAM]]
* [[Email authentication]]
* [[Greylisting (email)|Greylisting]]
==Notes==
{{Reflist|30em}}
==References==
Line 100 ⟶ 257:
|url = https://archive.org/details/spamassassin00schw/page/207
|url-access = registration
}}
*{{cite book
| first1 = Bryan
| last1 = Hong
| title = Building A Server with FreeBSD 7: A Modular Approach
| date = 2008
| publisher = No Starch Press
| ___location = San Francisco
| isbn = 9781593271459
| page = 197
| edition = 1st
}}
{{Refend}}
==Further reading==
* {{cite book|title=Anti-Spam Techniques Based on Artificial Immune System|last=Tan|first=Ying|publisher=CRC Press|year=2016|isbn=978-1498725387}}
* {{cite book|title=Email Security with Cisco IronPort|last=Bochenek|first=Chris|publisher=Cisco Press|year=2013|isbn=978-1587142925}}
* {{cite journal|title=Machine Learning for Email Spam Filtering: Review, Approaches and Open Research Problems|journal=Heliyon|volume=5|issue=6|year=2019|doi=10.1016/j.heliyon.2019.e01802|last1=Dada|first1=Emmanuel Gbenga}}
==External links==
Line 108 ⟶ 281:
* [https://cwiki.apache.org/confluence/display/SPAMASSASSIN/RuleUpdates Apache SpamAssassin Rule Updates Wiki] Automatically updating Apache SpamAssassin
* [https://mcgrail.com/template/projects#KAM1 KAM.cf] KAM Ruleset for Apache SpamAssassin
* [https://github.com/apache/spamassassin Apache SpamAssassin on GitHub] (Mirror)
{{Apache Software Foundation}}
{{Perl}}
{{Email clients}}
{{DEFAULTSORT:Spamassassin}}
Line 121 ⟶ 296:
[[Category:Email-related software for Linux]]
[[Category:2001 software]]
[[Category:Spam filtering]]
[[Category:Email authentication]]
|