Cyrus IMAP Server: Cyrus Murder Concepts

Abstract

The Cyrus IMAP Aggregator transparently distributes IMAP and POP mailboxes across multiple servers. Unlike other systems for load balancing IMAP mailboxes, the aggregator allows users to access mailboxes on any of the IMAP servers in the system.

The software described below is now available as part of the Cyrus IMAP distribution , versions 2.1.3 and higher. Please refer to the documentation for setup and install instructions.

1.0 Overview

Scaling a service usually takes one of two paths: buy bigger and faster machines, or distribute the load across multiple machines. The first approach is obvious and (hopefully) easy, though at some point software tuning becomes necessary to take advantage of the bigger machines. However, if one of these large machines go down, then your entire system is unavailable.

The second approach has the benefit that there is no longer a single point of failure and the aggregate cost of multiple machines may be significantly lower than the cost of a single large machine. However, the system may be harder to implement as well as harder to manage.

In the IMAP space, the approach of buying a larger machine is pretty obvious. Distributing the load is a bit trickier since there is no concept of mailbox location in IMAP (excluding RFC2193 mailbox referrals, which are not widely implemented by clients). Clients traditionally assume that the server they are talking to is the server with the mailbox they are looking for.

The approaches to distributing the load among IMAP servers generally sacrifice the unified system image. For pure email, this is an acceptable compromise; however, trying to share mailboxes becomes difficult or even impossible. Specific examples can be found in Appendix A: DNS Name Load Balancing and Appendix B: IMAP Multiplexing).

We propose a new approach to overcome these problems. We call it the Cyrus IMAP Aggregator. The Cyrus aggregator takes a murder of IMAP servers and presents a server independent view to the clients. That is, all the mailboxes across all the IMAP servers are aggregated to a single image, thereby appearing to be only one IMAP server to the clients.

2.0 Architecture

The Cyrus IMAP Aggregator has three classes of servers: IMAP frontend, IMAP backend, and MUPDATE. The frontend servers act as the primary communication point between the end user clients and the back endservers. The frontends use the MUPDATE server as an authoritative source for mailbox names, locations, and permissions. The back end servers store the actual IMAP data (and keep the MUPDATE server appraised as to changes in the Mailbox list).

The Cyrus IMAP Aggregator requires version 2.1.3 or higher of the Cyrus IMAP Distribution.

2.1 Back End Servers

The backend servers serve the actual data and are fully functional standalone IMAP servers that serve a set of mailboxes. Each backend server maintains a local mailboxes database that lists what mailboxes are available on that server.

The imapd processes on a backend server can stand by themselves, so that each backend IMAP server can be used in isolation, without a MUPDATE server or any frontend servers. However, they are configured so that they won't process any mailbox operations (CREATE, DELETE, RENAME, SETACL, etc) unless the master MUPDATE server can be contacted and authorizes the transaction.

In this mode, the imapd processes update the local mailboxes database themselves. Additionally, on a CREATE they need to reserve a place with the MUPDATE server to insure that other backend servers aren't creating the same mailbox before proceeding. Once the local aspects of mailbox creation are complete, the mailbox is activated on the MUPDATE server, and is considered available to any client through the frontends.

2.2 Front End Servers

The front end servers, unlike the back end servers, are fully interchangeable with each other and the front end servers can be considered 'dataless'; any loss of a proxy results in no loss of data. The only persistent data that is needed (the mailbox list) is kept on the MUPDATE master server. This list is synchronized between the frontend and the MUPDATE master when the frontend comes up.

The list of mailboxes in the murder is maintained by the MUPDATE server. The MUPDATE protocol is described at RFC3656.

For IMAP service on a frontend, there are two main types of processes, the proxyd and the mupdate (slave mode) synchronization process. The proxyd handles the IMAP session with clients. It relies on a consistent and complete mailboxes database that reflects the state of the world. It never writes to the mailboxes database. Instead, the mailboxes database is kept in sync with the master by a slave mupdate process.

2.3 Mail Delivery

The incoming mail messages go to an lmtp proxy (either running on a frontend, a mail exchanger, or any other server). The lmtp proxy running on the front end server uses the master MUPDATE server to determine the location of the destination folder and then transfers the message to the appropriate back end server via LMTP. If the backend is not up (or otherwise fails to accept the message), then the LMTP proxy returns a failure to the connected MTA.

If a sieve script is present, the lmtp proxy server must do the processing as the end result of the processing may result in the mail message going to a different back end server than where the user's INBOX is. Note that the current implementation runs SIEVE on the backend servers, and holds the requirement that all of a user's mailboxes live on the same backend.

2.4 Clients

Clients that support RFC2193 IMAP referrals can bypass the aggregator front end. See section 3.8 for more details.

Clients are encouraged to bypass the front ends via approved mechanisms. This should result in better performance for the client and less load for the servers.

3.0 Implementation

3.1 Assumptions

Operations that change the mailbox list are (comparatively) rare. The vast majority of IMAP sessions do not manipulate the state of the mailbox list.
Read operations on the mailbox list are very frequent.
A mailbox name must be unique among all the back end servers.
The MUPDATE master server will be able to handle the load from the frontend, backend, and LMTP proxy servers. Currently, the MUPDATE master can be a bottleneck in the throughput of mailbox operations, but as the MUPDATE protocol allows for slave server to act as replicas, it is theoretically possible to reduce the load of read operations against the master to a very low level.
IMAP clients are not sensitive to somewhat loose mailbox tree consistency, and some amount of consistency can be sacrificed for speed. As is, IMAP gives no guarantees about the state of the mailbox tree from one command to the next. However, it's important to note that different IMAP sessions do communicate out of band: two sessions for the same client should see sensible results. In the Murder case, this means that the same client talking to two different frontends should see sensible results.
A single IMAP connection should see consistent results: once an operation is done, it is done, and needs to be reflected in the current session. The straightforward case that must work correctly is (provided there is no interleaved DELETE in another session):
```
     A001 CREATE INBOX.new
     A002 SELECT INBOX.new
```
Accesses to non-existant mailboxes are rare.

3.2 Authentication

The user authenticates to the frontend server via any supported SASL mechanism or via plaintext. If authentication is successful, the front end server will authenticate to the back end server using a SASL mechanism (in our case KERBEROS_V4 or GSSAPI) as a privileged user. This user is able to switch to the authorization of the actual user being proxied for and any authorization checks happen as if the user actually authenticated directly to the back end server. Note this is a native feature of many SASL mechanisms and nothing special with the aggregator.

To help protect the backends from a compromised frontends, all administrative actions (creating users, top level mailboxes, quota changes, etc) must be done directly from the client to the backend, as administrative permissions are not granted to any of the proxy servers. IMAP Referrals provide a way to accomplish this with minimal client UI changes.

3.3 Subscriptions

[LSUB, SUBSCRIBE, UNSUBSCRIBE]
The front end server directs the LSUB to the back end server that has the user's INBOX. As such, the back end server may have entries in the subscription database that do not exist on that server. The frontend server needs to process the list returned by the backend server and either remove or tag with \NoSelect the entries which are not currently active within the murder.

If the user's INBOX server is down and the LSUB fails, then the aggregator replies with NO with an appropriate error message. Clients should not assume that the user has no subscriptions (though apparently some clients do this).

3.4 Finding a Mailbox

[SETQUOTA, GETQUOTA, EXAMINE, STATUS]
The front end machine looks up the location of the mailbox, connects via IMAP to the back end server, and issues the equivalent command there.

A quota root is not allowed to span across multiple servers. Atleast, not with the semantics that it will be inclusive across the murder.

[SELECT]
To SELECT a mailbox:

proxyd: lookup foo.bar in local mailboxes database
if yes, proxyd -> back end: send SELECT
if no, proxyd -> mupdate slave -> mupdate master: send a ping along the UPDATE channel in order to ensure that we have received the latest data from the MUPDATE master.
if mailbox still doesn't exist, fail operation
if mailbox does exist, and the client supports referrals, refer the client. Otherwise continue as a proxy with a selected mailbox.

SELECT on mailboxes that do not exist are much more expensive but the assumption is that this does not frequently occur (or if it does, it is just after the mailbox has been created and the frontend hasn't seen the update yet).

3.5 Operations within a Mailbox

[APPEND, CHECK, CLOSE, EXPUNGE, SEARCH, FETCH, STORE, UID]
These commands are sent to the appropriate back end server. The aggregator does not need to modify any of these commands before sending them to the back end server.

3.6 COPY

COPY is somewhat special as it acts upon messages in the currently SELECT'd mailbox but then interacts with another mailbox.

In the case where the destination mailbox is on the same back end server as the source folder, the COPY command is issued to the back end server and the back end server takes care of the command.

If the destination folder is on a different back end server, the front end intervenes and does the COPY by FETCHing the messages from the source back end server and then APPENDs the messages to the destination server.

3.7 Operations on the Mailbox List

[CREATE, DELETE, RENAME, SETACL]
These commands are all done by the back end server using the MUPDATE server as a lock manager. Changes are then propagated to the frontend via the MUPDATE protocol.

[LIST]

LIST is handled by the front end servers; no interaction is required with the back end server as the front ends have a local database that is never more than a few seconds out of date.

[CREATE]
CREATE creates the mailbox on the same back end server as the parent mailbox. If the parent exists but exists on multiple back end servers, if there is no parent folder, a tagged NO response is returned.

When this happens, the administrator has two choices. He may connect directly to a back end server and issue the CREATE on that server. Alternatively, a second argument can be given to CREATE after the mailbox name. This argument specifies the specific host name on which the mailbox is to be created.

The following operations occur for CREATE on the front end:

proxyd: verify that mailbox doesn't exist in MUPDATE mailbox list.
proxyd: decide where to send CREATE (the server of the parent mailbox, as top level mailboxes cannot be created by the proxies).
proxyd -> back end: duplicate CREATE command and verifies that the CREATE does not create an inconsistency in the mailbox list (i.e. the folder name is still unique).

The following operations occur for CREATE on the back end:

imapd: verify ACLs to best of ability (CRASH: aborted)
imapd: start mailboxes transaction (CRASH: aborted)
imapd may have to open an MUPDATE connection here if one doesn't already exist
imapd -> MUPDATE: set foo.bar reserved (CRASH: MUPDATE externally inconsistent)
imapd: create foo.bar in spool disk (CRASH: MUPDATE externally inconsistent, back end externally inconsistent, this can be resolved when the backend comes back up by clearing the state from both MUPDATE and the backend)
imapd: add foo.bar to mailboxes dataset (CRASH: ditto)
imapd: commit transaction (CRASH: ditto, but the recovery can activate the mailbox in mupdate instead)
imapd -> MUPDATE: set foo.bar active (CRASH: committed)

Failure modes: Above, all back end inconsistencies result in the next CREATE attempt failing. The earlier MUPDATE inconsistency results in any attempts to CREATE the mailbox on another back end failing. The latter one makes the mailbox unreachable and un-createable. Though, this is safer than potentially having the mailbox appear in two places when the failed backend comes back up.

[RENAME]
RENAME is only interesting in the cross-server case. In this case it issues a (non-standard) XFER command to the backend that currently hosts the mailbox, which performs a binary transfer of the mailbox (and in the case of a user's inbox, their associated seen state and subscription list) to the new backend. During this time the mailbox is marked as RESERVED in mupdate, and when it is complete it is activated on the new server in MUPDATE. The deactivation prevents clients from accessing the mailbox, and causes mail delivery to temporarily fail.

3.8 IMAP Referrals

If clients support IMAP Mailbox Referrals [MBOXREF], the client can improve performance and reduce the load on the aggregator by using the IMAP referrals that are sent to it and going to the appropriate back end servers.

The front end servers will advertise the MAILBOX-REFERRALS capability. The back end servers will also advertise this capability (but only because they need to refer clients while a mailbox is moving between servers).

Since there is no way for the server to know if a client supports referrals, the Cyrus IMAP Aggregator will assume the clients do not support referrals unless the client issues a RLSUB or a RLIST command.

Once a client issues one of those commands, then the aggregator will issue referrals for any command that is safe for the client to contact the IMAP server directly. Most commands that perform operations within a mailbox (cf Section 3.3) fall into this category. Some commands will not be possible without a referrals-capable client (such as most commands done as administrator).

RFC2193 indicates that the client does not stick the referred server. As such the SELECT will get issued to the front end server and not the referred server. Additionally, CREATE, RENAME, and DELETE get sent to the frontend which will proxy the command to the correct back end server.

3.9 POP

POP is easy given that POP only allows access to the user's INBOX. When it comes to POP, the IMAP Aggregator acts just like a multiplexor. The user authenticates to front end server. The front end determines where the user's INBOX is located and does a direct pass through of the POP commands from the client to the appropriate back end server.

3.10 MUPDATE

The mupdate (slave) process (one per front end) holds open an MUPDATE connection and listens for updates from the MUPDATE master server (as backends inform it of updates). The slave makes these modifications on the local copy of the mailboxes database.

4.0 Analysis

??? Add timing info? Random load testing?

4.1 Mailboxes Database

A benefit of having the mailbox information on the front end is that LIST is very cheap. The front end servers can process this request without having to contact each back end server.

We're also assuming that LIST is a much more frequent operation than any of the mailbox operations and thus should be the case to optimize. (In addition to the fact that any operation that needs to be forwarded to a backend needs to know which backend it is being forwarded to, so lookups in the mailbox list are also quite frequent).

4.2 Failure Mode Analysis

What happens when a back end server comes up? Resynchronization with the MUPDATE server. Any mailboxes that exist locally but are not in MUPDATE are pushed to MUPDATE. Any mailboxes that exist locally but are in MUPDATE as living on a different server are deleted. Any mailboxes that do not exist locally but exist in MUPDATE as living on this server are removed from MUPDATE.

What happens when a front end server comes up? The only thing that needs to happen is for the front end to connect to the MUPDATE server, issue an UPDATE command, and resynchronize its local database copy with the copy on the master server.

Where's the true mailboxes file? The MUPDATE master contains authoritative information as to the location of any mailbox (in the case of a conflict), but the backends are authoritative as to which mailboxes actually exist.

4.3 Summary of Benefits

Availability - By allowing multiple front-ends, failures of the front-end only result in a reduction of capacity. Users currently connected still lose their session but can just reconnect to get back online.
The failure of the back-ends will result in the loss of availability. However, given that the data is distributed among multiple servers, the failure of a single server does not result the entire system being down. Our experience with AFS was that this type of partitioned failure was acceptable (if not ideal).
The failure of the mupdate master will cause write operations to the mailbox list to fail, but accesses to mailboxes themselves (as well as read operations to the mailbox list) will continue uninterrupted.
At this point, there may be some ideas but no plans for providing a high availability solution which would allow for back-end servers or the MUPDATE server to fail with no availability impact.
Load scalability - We have not done any specific benchmarks to show that this system actually performs better. However, it is clear that it scales to a larger number of users than a single server architecture would. Though, based on the fact that we have not had any performance problems similar to when we were running a single machine, and we are handling about 20% more concurrent users, things have been a rousing success.
Live statistics can be found under the "cyrus overview" section at: http://graphs.andrew.cmu.edu.
Management benefits - As with AFS, administrators have the flexibility of placement of data on the servers, "live" move of data between servers,
User benefits - The user only needs to know a single server name for configuration. The same name can be handed out to all users.
Users don't lose the ability to share their folders and those folders are visible to other users. A user's INBOX folder hierarchy can also exist across multiple machines.

5.0 Futures

It would be nice to be able to replicate the messages in a mailbox among multiple servers and not just do partitioning for availability.
We are also evaluating using the aggregator to be able to provide mailboxes to the user with a different backup policy or even different "quality of service." For example, we are looking to give users a larger quota than default but not back up the servers where these mailboxes exist.
There is possibility that LDAP could be used instead of MUPDATE. However at this time the replication capabilities of LDAP are insufficient for the needs of the Aggregator
It would be nice if quotaroots had some better semantics with respect to the murder (either make them first-class entities, or have them apply across servers).

Appendix A: DNS Name Load Balancing

One method of load balancing is to use DNS to spread your users to multiple machines.

One method is to create a DNS CNAME for each letter of the alphabet. Then, each user sets their IMAP server to be the first letter of their userid. For example, the userid 'tom' would set his IMAP server to be T.IMAP.ANDREW.CMU.EDU and T.IMAP.ANDREW.CMU.EDU would resolve to an actual mail server.

Given that this does not provide a good distribution, another option is to create a DNS CNAME for each user. Using the previous example, the user 'tom' would set his IMAP server to be TOM.IMAP.ANDREW.CMU.EDU which then points to an actual mail server.

The good part is that you don't have all your users on one machine and growth can be accommodated without any user reconfiguration.

The drawback is with shared folders. The mail client now must support multiple servers and users must potentially configure a server for each user with a shared folder he wishes to view. Also, the user's INBOX hierarchy must also reside on a single machine.

Appendix B: IMAP Multiplexing

Another method of spreading out the load is to use IMAP multiplexing. This is very similar to the IMAP Aggregator in that there are frontend and backend servers. The frontend servers do the lookup and then forward the request to the appropriate backend server.

The multiplexor looks at the user who has authenticated. Once the user has authenticated, the frontend does a lookup for the backend server and then connects the session to a single backend server. This provides the flexibility of balancing the users among any arbitrary server but it creates a problem where a user can not share a folder with a user on a different back end server.

Multiplexors references:

BLUETAIL Mail Robustifier - Note this link broken due to the acquisition of Bluetail by Alteon Web Systems.
Netscape Messaging Multiplexor
Paul Fleming's IMAP Proxy
Perdition IMAP Proxy
Mirapoint Message Director - This is a hardware solution that also does content filtering.

Appendix C: Definitions

IMAP connection: A single IMAP TCP/IP session with a single IMAP server is a "connection".
client: A client is a process on a remote computer that communicates with the set of servers distributing mail data, be they ACAP, IMAP, or LDAP servers. A client opens one or more connections to various servers.
mailbox tree: The collection of all mailboxes at a given site in a namespace is called the mailbox tree. Generally, the user Bovik's personal data is found in user.bovik.
mailboxes database: A local database containing a list of mailboxes known to a particular server. (In old Cyrus terms, this maps to /var/imap/mailboxes.)
mailbox dataset: The store of mailbox information on the ACAP server is the "mailbox dataset".
mailbox operation: The following IMAP commands are "mailbox operations": CREATE, RENAME, DELETE, and SETACL.
MTA: The mail transport agent (e.g. sendmail, postfix).
Murder of IMAP servers: A grouping of IMAP servers. It sounded cool for crows so we decided to use it for IMAP servers as well.
quota operations: The quota IMAP commands (GETQUOTA, GETQUOTAROOT, and SETQUOTA) operate on mailbox trees. In future versions of Cyrus, it is expected that a quotaroot will be a subset of a mailbox tree that resides on one partition on one server. For rational, see section xxx.

Appendix D: ACAP

It was originally intended to use the general purpose protocol ACAP instead of the (task specific) MUPDATE. We expected the following commands to query the ACAP server:

[LSUB, SUBSCRIBE, UNSUBSCRIBE, SETQUOTA, GETQUOTA, EXAMINE, STATUS]
All these commands could be handled by the mailboxes dataset.[MBOXDSET] This would mainly be a client enhancement. For example, a client could use this to quickly check all the new messages in every folder a user is subscribed to and would also be able to talk directly to the back-end where the data is and thereby skipping the front-ends.

Appendix E: Naming

An alternate name was suggested (unfortunately too late) by Philip Lewis. He writes:

i would have called it an atlas.... or iatlas (pronounced "yatlas")
it is a collection of (i)maps.

References

[ACAP] Newman, Myers, "ACAP -- Application Configuration Access Protocol", RFC 2244, November 1997.
[AFS] J. Howard, "An Overview of the Andrew File System", Usenix, Feb 1988.
[IMAP] M. Crispin, "Internet Message Access Protocol", RFC 2060, December 1996.
[LMTP] J. Myers, "Local Mail Transfer Protocol", RFC 2033, October 1996.
[MBOXDSET] L. Greenfield, "ACAP Mailbox Dataset Class", November 1998, available from the author, leg+@andrew.cmu.edu.
[MBOXREF] M. Gahrns, "IMAP4 Mailbox Referrals", RFC 2193, September 1997.
[SIEVE] T. Showalter, "Sieve: A Mail Filtering Language", RFC 3028, January 2001.

Document Change Log

3.1 - cleanups
3.0 - actual implementation. MUPDATE instead of ACAP, mailbox moves now possible (4/4/02)
2.0 - closer to implementation. ACAP is now required. Ditch MOIS
1.0 - Initial revision.