Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
loadplugin Mail::SpamAssassin::Plugin::TextCat
This plugin will try to guess the language used in the message text.
You can then specify which languages are considered okay for incoming
mail and if the guessed language is not okay, UNWANTED_LANGUAGE_BODY
is triggered
It will always add the results to a "X-Language" name-value pair in the message metadata data structure. This may be useful as Bayes tokens. The results can also be added to marked-up messages using "add_header", with the _LANGUAGES_ tag. See Mail::SpamAssassin::Conf for details.
Note: the language cannot always be recognized with sufficient
confidence. In that case, UNWANTED_LANGUAGE_BODY
will not trigger.
This option is used to specify which languages are considered okay for incoming mail. SpamAssassin will try to detect the language used in the message text.
Note that the language cannot always be recognized with sufficient confidence. In that case, no points will be assigned.
The rule UNWANTED_LANGUAGE_BODY
is triggered based on how this is set.
In your configuration, you must use the two or three letter language
specifier in lowercase, not the English name for the language. You may
also specify all
if a desired language is not listed, or if you want to
allow any language. The default setting is all
.
Examples:
ok_languages all (allow all languages) ok_languages en (only allow English) ok_languages en ja zh (allow English, Japanese, and Chinese)
Note: if there are multiple ok_languages lines, only the last one is used.
Select the languages to allow from the list below:
This option is used to specify which languages will not be considered
when trying to guess the language. For performance reasons, supported
languages that have fewer than about 5 million speakers are disabled by
default. Note that listing a language in ok_languages
automatically
enables it for that user.
The default setting is:
That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian, Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian, Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.
The maximum number of languages before the classification is considered unknown.
If the number of ngrams is lower than this number then they will be removed. This can be used to speed up the program for longer inputs. For shorter inputs, this should be set to 0.
The maximum number of ngrams that should be compared with each of the languages models (note that each of those models is used completely).
Include any language that scores at least textcat_acceptable_score
in the
returned list of languages