Matthew Weier O'Phinney [2006-07-29 15:27 UTC]Could you put up a .phps to make viewing the source easier? Thanks
Markus Wolff [2006-07-29 16:02 UTC]Hi,
nice idea, something like that is surely needed. One suggestion though: When learning stuff, the class only takes either HAM or SPAM as the classification parameter. If there was another, maybe optional parameter where you could also specify a kind of "pool", or "domain" or whatever for the current text, the user could classify several kinds of texts. Like, "this text has a 80% probability to fit in drawer A, but also a 75% probability to fit into drawer B...". Alternatively, this could also be passed as an option to the storage container, so that it knows what pool to insert the classification data into.
I hope this was at least half understandable :-)
Philippe Jausions [2006-07-29 16:42 UTC]Is there any specific reason to not use MDB2 instead of PDO?
Andreas Ahlenstorf [2006-07-29 16:43 UTC]@Markus Wolff:
Sorry, I don't get it.
Do you like to have something like multi user capability? We would be able to maintain different spam/ham counts for each user instead of having a global spam/ham database for all users.
Or do you think about user-supplied spam probabilities? This makes not much sense because that's already what's the training mechanism about.
Christian Weiske [2006-07-29 16:46 UTC]I think Markus wants to be able to define multiple containers, not only spam/ham.
Philippe Jausions [2006-07-29 16:46 UTC]Also, check if Services_SpamCheck http://pear.php.net/pepr/pepr-proposal-show.php?id=378 and you package could play nice together. Not necessarily as one package, but at least try to bring consistency to the API.
Andreas Ahlenstorf [2006-07-29 16:54 UTC]@Philippe Jausions:
We don't use MDB2 and I personally never used it too. But we're heavily using PDO (Reasons: sqlite3, prepared statement emulation etc.).
Adding a driver for MDB2, the native database extensions or even Berkeley DB wouldn't be a problem. That's what the interface is for.
Andreas Ahlenstorf [2006-07-29 19:44 UTC]Regarding the multiple containers (not only spam/ham):
I thought some time about it, but I couldn't find a real use case for it. Any suggestions?
Markus Wolff [2006-07-29 19:47 UTC]Andreas said:
> Sorry, I don't get it.
> Do you like to have something like multi user capability?
...and Christian responded:
> I think Markus wants to be able to define multiple containers,
> not only spam/ham.
Almost. I just envision multiple containers in which to put texts. Within these containers, the choices boil down to spam and ham, for this particular container.
So the user does not just say "this is ham" or "this is spam", but he says: "this is ham for the container 'Advertisements'", "this is ham for the container 'Personal Email'", or "this is spam for the container 'Open Source Projects'"... so you could basically build a filter for incoming emails (for example) that is solely based on classification.
It would eliminate the need to filter mail into fitting folders by specifying mail headers, you'd just classify mail and then let the Bayes filter do its work. The filtering is then done by content assessment.
Justin Patrin [2006-07-29 19:48 UTC]Regarding multiple containers, I also think that this would be useful. Think of not only dealing with spam/ham but phishing/ham, etc. It would be nice to be able to set up any number of categories.
Christian Weiske [2006-07-29 19:53 UTC]Does somebody of you know Popfile (popfile.sf.net)? It allows you to define as many buckets as you like, and then train the filter to sort this message in bucket #1, another in bucket #3 and so. This way you can circumvent your email program's filtering functionality by defining buckets for the different categories of mail you get -> personal, work, oss project #1, ...
Andreas Ahlenstorf [2006-07-29 20:57 UTC]Ok, I'll add the "multi user" capability (I need this) and the thingie with the multiple containers.
Any other suggestions, (CS) objections?