We store all bayesian and whitelist data for Spamassassin in a PostgreSQL. Keeping it all in a database like this allows all members of our email cluster to access the same data.
Bayesian processing works by noticing lots of little unique snippets - tokens - and storing them for future reference.  If lots of tokens were found in a message flagged as spam then future messages containing these tokens are more likely to be spam as well. This has the same weighing effect for non-spam messages. Over time, watching these tokens increases the quality of your spam detection.
We have been running a SQL-backed bayesian instance for several months, with remarkable results.
The downside is a massive and growing requirement for data storage and processing power. Spamassassin cleans up after itself and won't grow infinitely, but it still wants to maintain data for roughly 4,000 emails per database.  With over 2000 active email accounts, each generating tens of thousands of tokens, our SQL database quickly grew to over 11 million rows. A few weeks ago, we allocated a dedicated server just to house this data!
Rather than fighting a losing resource battle, we wanted to consolidate this data. SpamAssassin can allow you to set a variable, bayes_sql_override_username, to group all the data into one database. The recommended use for this is to use the same database for all accounts.
But that wouldn't work for the variety and scope of our clients. Imagine if we hosted email for a doctor, for whom it would not be impossible to receive a non-spam email containing the word "viagra."
Instead, I decided to add a bayes_sql_override_username attribute to every email account we host, and to use it to group data by domain. I then wrote a script that added this entry to each email address in our LDAP tree. Â
Rather than throwing away the data we had already collected, I wrote a script that locates the account with the most data per domain, and renames it as the domain account. Then, all other per-address entries were deleted.
This got the bayesian token database down from over 11 million to under 4.5M, with a much more rational growth pattern. The trade-off in accuracy will be negligible. Because this attribute is added to each email address, we can always set up special cases.
All in all, a good day's work!Â







