Postfix Queue Scheduler


Disclaimer

Many of the transport-specific configuration parameters discussed in this document will not show up in "postconf" command output before Postfix version 2.9. This limitation applies to many parameters whose name is a combination of a master.cf service name such as "relay" and a built-in suffix such as "_destination_concurrency_limit".

Overview

The queue manager is by far the most complex part of the Postfix mail system. It schedules delivery of new mail, retries failed deliveries at specific times, and removes mail from the queue after the last delivery attempt. There are two major classes of mechanisms that control the operation of the queue manager.

Topics covered by this document:

Concurrency scheduling

The following sections document the Postfix 2.5 concurrency scheduler, after a discussion of the limitations of the earlier concurrency scheduler. This is followed by results of medium-concurrency experiments, and a discussion of trade-offs between performance and robustness.

The material is organized as follows:

Drawbacks of the existing concurrency scheduler

From the start, Postfix has used a simple but robust algorithm where the per-destination delivery concurrency is decremented by 1 after delivery failed due to connection or handshake failure, and incremented by 1 otherwise. Of course the concurrency is never allowed to exceed the maximum per-destination concurrency limit. And when a destination's concurrency level drops to zero, the destination is declared "dead" and delivery is suspended.

Drawbacks of +/-1 concurrency feedback per delivery are:

(*) A pseudo-cohort is a number of delivery requests equal to a destination's delivery concurrency.

The revised concurrency scheduler has a highly modular structure. It uses separate mechanisms for per-destination concurrency control and for "dead destination" detection. The concurrency control in turn is built from two separate mechanisms: it supports less-than-1 feedback per delivery to allow for more gradual concurrency adjustments, and it uses feedback hysteresis to suppress concurrency oscillations. And instead of waiting for delivery concurrency to throttle down to zero, a destination is declared "dead" after a configurable number of pseudo-cohorts reports connection or handshake failure.

Summary of the Postfix 2.5 concurrency feedback algorithm

We want to increment a destination's delivery concurrency when some (not necessarily consecutive) number of deliveries complete without connection or handshake failure. This is implemented with positive feedback g(N) where N is the destination's delivery concurrency. With g(N)=1 feedback per delivery, concurrency increases by 1 after each positive feedback event; this gives us the old scheduler's exponential growth in time. With g(N)=1/N feedback per delivery, concurrency increases by 1 after an entire pseudo-cohort N of positive feedback reports; this gives us linear growth in time. Less-than-1 feedback per delivery and integer truncation naturally give us hysteresis, so that transitions to larger concurrency happen every 1/g(N) positive feedback events.

We want to decrement a destination's delivery concurrency when some (not necessarily consecutive) number of deliveries complete after connection or handshake failure. This is implemented with negative feedback f(N) where N is the destination's delivery concurrency. With f(N)=1 feedback per delivery, concurrency decreases by 1 after each negative feedback event; this gives us the old scheduler's behavior where concurrency is throttled down dramatically after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery, concurrency backs off more gently. Again, less-than-1 feedback per delivery and integer truncation naturally give us hysteresis, so that transitions to lower concurrency happen every 1/f(N) negative feedback events.

However, with negative feedback we introduce a subtle twist. We "reverse" the negative hysteresis cycle so that the transition to lower concurrency happens at the beginning of a sequence of 1/f(N) negative feedback events. Otherwise, a correction for overload would be made too late. This makes the choice of f(N) relatively unimportant, as borne out by measurements later in this document.

In summary, the main ingredients for the Postfix 2.5 concurrency feedback algorithm are a) the option of less-than-1 positive feedback per delivery to avoid overwhelming servers, b) the option of less-than-1 negative feedback per delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that it can correct for overload quickly.

Summary of the Postfix 2.5 "dead destination" detection algorithm

We want to suspend deliveries to a specific destination after some number of deliveries suffers connection or handshake failure. The old scheduler declares a destination "dead" when negative (-1) feedback throttles the delivery concurrency down to zero. With less-than-1 feedback per delivery, this throttling down would obviously take too long. We therefore have to separate "dead destination" detection from concurrency feedback. This is implemented by introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency scheduler declares a destination "dead" after a configurable number of pseudo-cohorts suffers from connection or handshake failures. The old scheduler corresponds to the special case where the pseudo-cohort failure limit is equal to 1.

Pseudocode for the Postfix 2.5 concurrency scheduler

The pseudo code shows how the ideas behind new concurrency scheduler are implemented as of November 2007. The actual code can be found in the module qmgr/qmgr_queue.c.

Types:
        Each destination has one set of the following variables
        int concurrency
        double success
        double failure
        double fail_cohorts

Feedback functions:
        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
        positive feedback: g(N) = x/N | x/sqrt(N) | x
        negative feedback: f(N) = y/N | y/sqrt(N) | y

Initialization:
        concurrency = initial_concurrency
        success = 0
        failure = 0
        fail_cohorts = 0

After success:
        fail_cohorts = 0
        Be prepared for feedback > hysteresis, or rounding error
        success += g(concurrency)
        while (success >= 1)            Hysteresis 1
            concurrency += 1            Hysteresis 1
            failure = 0
            success -= 1                Hysteresis 1
        Be prepared for overshoot
        if (concurrency > concurrency limit)
            concurrency = concurrency limit

Safety:
        Don't apply positive feedback unless
            concurrency < busy_refcount + init_dest_concurrency
        otherwise negative feedback effect could be delayed

After failure:
        if (concurrency > 0)
            fail_cohorts += 1.0 / concurrency
            if (fail_cohorts > cohort_failure_limit)
                concurrency = 0
        if (concurrency > 0)
            Be prepared for feedback > hysteresis, rounding errors
            failure -= f(concurrency)
            while (failure < 0)
                concurrency -= 1        Hysteresis 1
                failure += 1            Hysteresis 1
                success = 0
            Be prepared for overshoot
            if (concurrency < 1)
                concurrency = 1

Results for delivery to concurrency-limited servers

Discussions about the concurrency scheduler redesign started early 2004, when the primary goal was to find alternatives that did not exhibit exponential growth or rapid concurrency throttling. No code was implemented until late 2007, when the primary concern had shifted towards better handling of server concurrency limits. For this reason we measure how well the new scheduler does this job. The table below compares mail delivery performance of the old +/-1 feedback per delivery with several less-than-1 feedback functions, for different limited-concurrency server scenarios. Measurements were done with a FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers.

Server configuration:

Client configuration:

Impact of the 30s SMTP connect timeout

The first results are for a FreeBSD 6.2 server, where our artificially low listen(2) backlog results in a very short kernel queue for established connections. The table shows that all deferred deliveries failed due to a 30s connection timeout, and none failed due to a server greeting timeout. This measurement simulates what happens when the server's connection queue is completely full under load, and the TCP engine drops new connections.

client
limit
server
limit
feedback
style
connection
caching
percentage
deferred
client concurrency
average/stddev
timed-out in
connect/greeting

20 5 1/N no 9.9 19.4 0.49 198 -
20 5 1/N yes 10.3 19.4 0.49 206 -
20 5 1/sqrt(N) no 10.4 19.6 0.59 208 -
20 5 1/sqrt(N) yes 10.6 19.6 0.61 212 -
20 5 1 no 10.1 19.5 1.29 202 -
20 5 1 yes 10.8 19.3 1.57 216 -

A busy server with a completely full connection queue. N is the client delivery concurrency. Failed deliveries time out after 30s without completing the TCP handshake. See text for a discussion of results.

Impact of the 300s SMTP greeting timeout

The next table shows results for a Fedora Core 8 server (results for RedHat 7.3 are identical). In this case, the artificially small listen(2) backlog argument does not impact our measurement. The table shows that practically all deferred deliveries fail after the 300s SMTP greeting timeout. As these timeouts were 10x longer than with the first measurement, we increased the recipient count (and thus the running time) by a factor of 10 to keep the results comparable. The deferred mail percentages are a factor 10 lower than with the first measurement, because the 1s per-recipient delay was 1/300th of the greeting timeout instead of 1/30th of the connection timeout.

client
limit
server
limit
feedback
style
connection
caching
percentage
deferred
client concurrency
average/stddev
timed-out in
connect/greeting

20 5 1/N no 1.16 19.8 0.37 - 230
20 5 1/N yes 1.36 19.8 0.36 - 272
20 5 1/sqrt(N) no 1.21 19.9 0.23 4 238
20 5 1/sqrt(N) yes 1.36 20.0 0.23 - 272
20 5 1 no 1.18 20.0 0.16 - 236
20 5 1 yes 1.39 20.0 0.16 - 278

A busy server with a non-full connection queue. N is the client delivery concurrency. Failed deliveries complete at the TCP level, but time out after 300s while waiting for the SMTP greeting. See text for a discussion of results.

Impact of active server concurrency limiter

The final concurrency-limited result shows what happens when SMTP connections don't time out, but are rejected immediately with the Postfix server's smtpd_client_connection_count_limit feature (the server replies with a 421 status and disconnects immediately). Similar results can be expected with concurrency limiting features built into other MTAs or firewalls. For this measurement we specified a server concurrency limit and a client initial destination concurrency of 5, and a server process limit of 10; all other conditions were the same as with the first measurement. The same result would be obtained with a FreeBSD or Linux server, because the "pushing back" is done entirely by the receiving side.

client
limit
server
limit
feedback
style
connection
caching
percentage
deferred
client concurrency
average/stddev
theoretical
defer rate

20 5 1/N no 16.5 5.17 0.38 1/6
20 5 1/N yes 16.5 5.17 0.38 1/6
20 5 1/sqrt(N) no 24.5 5.28 0.45 1/4
20 5 1/sqrt(N) yes 24.3 5.28 0.46 1/4
20 5 1 no 49.7 5.63 0.67 1/2
20 5 1 yes 49.7 5.68 0.70 1/2

A server with active per-client concurrency limiter that replies with 421 and disconnects. N is the client delivery concurrency. The theoretical defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed +/-1 feedback per delivery; with the concurrency-dependent feedback variants, the defer rate decreases with increasing concurrency. See text for a discussion of results.

Discussion of concurrency-limited server results

All results in the previous sections are based on the first delivery runs only; they do not include any second etc. delivery attempts. It's also worth noting that the measurements look at steady-state behavior only. They don't show what happens when the client starts sending at a much higher or lower concurrency.

The first two examples show that the effect of feedback is negligible when concurrency is limited due to congestion. This is because the initial concurrency is already at the client's concurrency maximum, and because there is 10-100 times more positive than negative feedback. Under these conditions, it is no surprise that the contribution from SMTP connection caching is also negligible.

In the last example, the old +/-1 feedback per delivery will defer 50% of the mail when confronted with an active (anvil-style) server concurrency limit, where the server hangs up immediately with a 421 status (a TCP-level RST would have the same result). Less aggressive feedback mechanisms fare better than more aggressive ones. Concurrency-dependent feedback fares even better at higher concurrencies than shown here, but has limitations as discussed in the next section.

Limitations of less-than-1 per delivery feedback

Less-than-1 feedback is of interest primarily when sending large amounts of mail to destinations with active concurrency limiters (servers that reply with 421, or firewalls that send RST). When sending small amounts of mail per destination, less-than-1 per-delivery feedback won't have a noticeable effect on the per-destination concurrency, because the number of deliveries to the same destination is too small. You might just as well use zero per-delivery feedback and stay with the initial per-destination concurrency. And when mail deliveries fail due to congestion instead of active concurrency limiters, the measurements above show that per-delivery feedback has no effect. With large amounts of mail you might just as well use zero per-delivery feedback and start with the maximal per-destination concurrency.

The scheduler with less-than-1 concurrency feedback per delivery solves a problem with servers that have active concurrency limiters. This works only because feedback is handled in a peculiar manner: positive feedback will increment the concurrency by 1 at the end of a sequence of events of length 1/feedback, while negative feedback will decrement concurrency by 1 at the beginning of such a sequence. This is how Postfix adjusts quickly for overshoot without causing lots of mail to be deferred. Without this difference in feedback treatment, less-than-1 feedback per delivery would defer 50% of the mail, and would be no better in this respect than the old +/-1 feedback per delivery.

Unfortunately, the same feature that corrects quickly for concurrency overshoot also makes the scheduler more sensitive for noisy negative feedback. The reason is that one lonely negative feedback event has the same effect as a complete sequence of length 1/feedback: in both cases delivery concurrency is dropped by 1 immediately. As a worst-case scenario, consider multiple servers behind a load balancer on a single IP address, and no backup MX address. When 1 out of K servers fails to complete the SMTP handshake or drops the connection, a scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency once it reaches a concurrency level of about K, even though the good servers behind the load balancer are perfectly capable of handling more traffic.

This noise problem gets worse as the amount of positive feedback per delivery gets smaller. A compromise is to use fixed less-than-1 positive feedback values instead of concurrency-dependent positive feedback. For example, to tolerate 1 of 4 bad servers in the above load balancer scenario, use positive feedback of 1/4 per "good" delivery (no connect or handshake error), and use an equal or smaller amount of negative feedback per "bad" delivery. The downside of using concurrency-independent feedback is that some of the old +/-1 feedback problems will return at large concurrencies. Sites that must deliver mail at non-trivial per-destination concurrencies will require special configuration.

Concurrency configuration parameters

The Postfix 2.5 concurrency scheduler is controlled with the following configuration parameters, where "transport_foo" provides a transport-specific parameter override. All parameter default settings are compatible with earlier Postfix versions.

Parameter name Postfix version Description

initial_destination_concurrency
transport_initial_destination_concurrency
all
2.5
Initial per-destination delivery concurrency
default_destination_concurrency_limit
transport_destination_concurrency_limit
all
all
Maximum per-destination delivery concurrency
default_destination_concurrency_positive_feedback
transport_destination_concurrency_positive_feedback
2.5
2.5
Per-destination positive feedback amount, per delivery that does not fail with connection or handshake failure
default_destination_concurrency_negative_feedback
transport_destination_concurrency_negative_feedback
2.5
2.5
Per-destination negative feedback amount, per delivery that fails with connection or handshake failure
default_destination_concurrency_failed_cohort_limit
transport_destination_concurrency_failed_cohort_limit
2.5
2.5
Number of failed pseudo-cohorts after which a destination is declared "dead" and delivery is suspended
destination_concurrency_feedback_debug 2.5 Enable verbose logging of concurrency scheduler activity

Preemptive scheduling

The following sections describe the new queue manager and its preemptive scheduler algorithm. Note that the document was originally written to describe the changes between the new queue manager (in this text referred to as nqmgr, the name it was known by before it became the default queue manager) and the old queue manager (referred to as oqmgr). This is why it refers to oqmgr every so often.

This document is divided into sections as follows:

The structures used by nqmgr

Let's start by recapitulating the structures and terms used when referring to the queue manager and how it operates. Many of these are partially described elsewhere, but it is nice to have a coherent overview in one place:

The first four structures are common to both nqmgr and oqmgr, the latter two were introduced by nqmgr.

These terms are used extensively in the text below, feel free to look up the description above anytime you'll feel you have lost a sense what is what.

What happens when nqmgr picks up the message

Whenever nqmgr moves a queue file into the "active" queue, the following happens: It reads all necessary information from the queue file as oqmgr does, and also reads as many recipients as possible - more on that later, for now let's just pretend it always reads all recipients.

Then it resolves the recipients as oqmgr does, which means obtaining (address, nexthop, transport) triple for each recipient. For each triple, it finds the transport; if it does not exist yet, it instantiates it (unless it's dead). Within the transport, it finds the destination queue for the given nexthop; if it does not exist yet, it instantiates it (unless it's dead). The triple is then bound to given destination queue. This happens in qmgr_resolve() and is basically the same as in oqmgr.

Then for each triple which was bound to some queue (and thus transport), the program finds the job which represents the message within that transport's context; if it does not exist yet, it instantiates it. Within the job, it finds the peer which represents the bound destination queue within this jobs context; if it does not exist yet, it instantiates it. Finally, it stores the address from the resolved triple to the recipient entry which is appended to both the queue entry list and the peer entry list. The addresses for the same nexthop are batched in the entries up to the transport_destination_recipient_limit for that transport. This happens in qmgr_message_assign(), and apart from that it operates with job and peer structures, it is basically the same as in oqmgr.

When the job is instantiated, it is enqueued on the transport's job list based on the time its message was picked up by nqmgr. For first batch of recipients this means it is appended to the end of the job list, but the ordering of the job list by the enqueue time is important as we will see shortly.

[Now you should have a pretty good idea what the state of the nqmgr is after a couple of messages were picked up, and what the relation is between all those job, peer, queue and entry structures.]

How the entry selection works

Having prepared all those above mentioned structures, the task of the nqmgr's scheduler is to choose the recipient entries one at a time and pass them to the delivery agent for corresponding transport. Now how does this work?

The first approximation of the new scheduling algorithm is like this:

foreach transport (round-robin-by-transport)
do
    if transport busy continue
    if transport process limit reached continue
    foreach transport's job (in the order of the transport's job list)
    do
        foreach job's peer (round-robin-by-destination)
             if peer->queue->concurrency < peer->queue->window
                 return next peer entry.
        done
    done
done

Now what is the "order of the transport's job list"? As we know already, the job list is by default kept in the order the message was picked up by the nqmgr. So by default we get the top-level round-robin transport, and within each transport we get the FIFO message delivery. The round-robin of the peers by the destination is perhaps of little importance in most real-life cases (unless the transport_destination_recipient_limit is reached, in one job there is only one peer structure for each destination), but theoretically it makes sure that even within single jobs, destinations are treated fairly.

[By now you should have a feeling you really know how the scheduler works, except for the preemption, under ideal conditions - that is, no recipient resource limits and no destination concurrency problems.]

How the preemption works

As you might perhaps expect by now, the transport's job list does not remain sorted by the job's message enqueue time all the time. The most cool thing about nqmgr is not the simple FIFO delivery, but that it is able to slip mail with little recipients past the mailing-list bulk mail. This is what the job preemption is about - shuffling the jobs on the transport's job list to get the best message delivery rates. Now how is it achieved?

First I have to tell you that there are in fact two job lists in each transport. One is the scheduler's job list, which the scheduler is free to play with, while the other one keeps the jobs always listed in the order of the enqueue time and is used for recipient pool management we will discuss later. For now, we will deal with the scheduler's job list only.

So, we have the job list, which is first ordered by the time the jobs' messages were enqueued, oldest messages first, the most recently picked one at the end. For now, let's assume that there are no destination concurrency problems. Without preemption, we pick some entry of the first (oldest) job on the queue, assign it to delivery agent, pick another one from the same job, assign it again, and so on, until all the entries are used and the job is delivered. We would then move onto the next job and so on and on. Now how do we manage to sneak in some entries from the recently added jobs when the first job on the job list belongs to a message going to the mailing-list and has thousands of recipient entries?

The nqmgr's answer is that we can artificially "inflate" the delivery time of that first job by some constant for free - it is basically the same trick you might remember as "accumulation of potential" from the amortized complexity lessons. For example, instead of delivering the entries of the first job on the job list every time a delivery agent becomes available, we can do it only every second time. If you view the moments the delivery agent becomes available on a timeline as "delivery slots", then instead of using every delivery slot for the first job, we can use only every other slot, and still the overall delivery efficiency of the first job remains the same. So the delivery 11112222 becomes 1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free slot). Now what do we do with free slots?

As you might have guessed, we will use them for sneaking the mail with little recipients in. For example, if we have one four-recipient mail followed by four one recipients mail, the delivery sequence (that is, the sequence in which the jobs are assigned to the delivery slots) might look like this: 12131415. Hmm, fine for sneaking in the single recipient mail, but how do we sneak in the mail with more than one recipient? Say if we have one four-recipient mail followed by two two-recipient mails?

The simple answer would be to use delivery sequence 12121313. But the problem is that this does not scale well. Imagine you have mail with a thousand recipients followed by mail with a hundred recipients. It is tempting to suggest the delivery sequence like 121212...., but alas! Imagine there arrives another mail with say ten recipients. But there are no free slots anymore, so it can't slip by, not even if it had only one recipient. It will be stuck until the hundred-recipient mail is delivered, which really sucks.

So, it becomes obvious that while inflating the message to get free slots is a great idea, one has to be really careful of how the free slots are assigned, otherwise one might corner himself. So, how does nqmgr really use the free slots?

The key idea is that one does not have to generate the free slots in a uniform way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is even better as some entries are in the first case selected earlier than in the second case, and none is selected later! So it is possible to first "accumulate" the free delivery slots and then use them all at once. It is even possible to accumulate some, then use them, then accumulate some more and use them again, as in 11..1.1 .

Let's get back to the one hundred recipient example. We now know that we could first accumulate one hundred free slots, and only after then to preempt the first job and sneak the one hundred recipient mail in. Applying the algorithm recursively, we see the hundred recipient job can accumulate ten free delivery slots, and then we could preempt it and sneak in the ten-recipient mail... Wait wait wait! Could we? Aren't we overinflating the original one thousand recipient mail?

Well, despite the fact that it looks so at the first glance, another trick will allow us to answer "no, we are not!". If we had said that we will inflate the delivery time twice at maximum, and then we consider every other slot as a free slot, then we would overinflate in case of the recursive preemption. BUT! The trick is that if we use only every n-th slot as a free slot for n>2, there is always some worst inflation factor which we can guarantee not to be breached, even if we apply the algorithm recursively. To be precise, if for every k>1 normally used slots we accumulate one free delivery slot, than the inflation factor is not worse than k/(k-1) no matter how many recursive preemptions happen. And it's not worse than (k+1)/k if only non-recursive preemption happens. Now, having got through the theory and the related math, let's see how nqmgr implements this.

Each job has so called "available delivery slot" counter. Each transport has a transport_delivery_slot_cost parameter, which defaults to default_delivery_slot_cost parameter which is set to 5 by default. This is the k from the paragraph above. Each time k entries of the job are selected for delivery, this counter is incremented by one. Once there are some slots accumulated, a job which requires no more than that number of slots to be fully delivered can preempt this job.

[Well, the truth is, the counter is incremented every time an entry is selected and it is divided by k when it is used. But to understand, it's good enough to use the above approximation of the truth.]

OK, so now we know the conditions which must be satisfied so one job can preempt another one. But what job gets preempted, how do we choose what job preempts it if there are several valid candidates, and when does all this exactly happen?

The answer for the first part is simple. The job whose entry was selected the last time is the so called current job. Normally, it is the first job on the scheduler's job list, but destination concurrency limits may change this as we will see later. It is always only the current job which may get preempted.

Now for the second part. The current job has a certain amount of recipient entries, and as such may accumulate at maximum some amount of available delivery slots. It might have already accumulated some, and perhaps even already used some when it was preempted before (remember a job can be preempted several times). In either case, we know how many are accumulated and how many are left to deliver, so we know how many it may yet accumulate at maximum. Every other job which may be delivered by less than that number of slots is a valid candidate for preemption. How do we choose among them?

The answer is - the one with maximum enqueue_time/recipient_entry_count. That is, the older the job is, the more we should try to deliver it in order to get best message delivery rates. These rates are of course subject to how many recipients the message has, therefore the division by the recipient (entry) count. No one shall be surprised that a message with n recipients takes n times longer to deliver than a message with one recipient.

Now let's recap the previous two paragraphs. Isn't it too complicated? Why don't the candidates come only among the jobs which can be delivered within the number of slots the current job already accumulated? Why do we need to estimate how much it has yet to accumulate? If you found out the answer, congratulate yourself. If we did it this simple way, we would always choose the candidate with the fewest recipient entries. If there were enough single recipient mails coming in, they would always slip by the bulk mail as soon as possible, and the two or more recipients mail would never get a chance, no matter how long they have been sitting around in the job list.

This candidate selection has an interesting implication - that when we choose the best candidate for preemption (this is done in qmgr_choose_candidate()), it may happen that we may not use it for preemption immediately. This leads to an answer to the last part of the original question - when does the preemption happen?

The preemption attempt happens every time next transport's recipient entry is to be chosen for delivery. To avoid needless overhead, the preemption is not attempted if the current job could never accumulate more than transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots which defaults to 3). If there are already enough accumulated slots to preempt the current job by the chosen best candidate, it is done immediately. This basically means that the candidate is moved in front of the current job on the scheduler's job list and decreasing the accumulated slot counter by the amount used by the candidate. If there are not enough slots... well, I could say that nothing happens and the another preemption is attempted the next time. But that's not the complete truth.

The truth is that it turns out that it is not really necessary to wait until the jobs counter accumulates all the delivery slots in advance. Say we have ten-recipient mail followed by two two-recipient mails. If the preemption happened when enough delivery slots accumulate (assuming slot cost 2), the delivery sequence becomes 11112211113311. Now what would we get if we would wait only for 50% of the necessary slots to accumulate and we promise we would wait for the remaining 50% later, after we get back to the preempted job? If we use such a slot loan, the delivery sequence becomes 11221111331111. As we can see, it makes it not considerably worse for the delivery of the ten-recipient mail, but it allows the small messages to be delivered sooner.

The concept of these slot loans is where the transport_delivery_slot_discount and transport_delivery_slot_loan come from (they default to default_delivery_slot_discount and default_delivery_slot_loan, whose values are by default 50 and 3, respectively). The discount (resp. loan) specifies how many percent (resp. how many slots) one "gets in advance", when the number of slots required to deliver the best candidate is compared with the number of slots the current slot had accumulated so far.

And that pretty much concludes this chapter.

[Now you should have a feeling that you pretty much understand the scheduler and the preemption, or at least that you will have after you read the last chapter a couple more times. You shall clearly see the job list and the preemption happening at its head, in ideal delivery conditions. The feeling of understanding shall last until you start wondering what happens if some of the jobs are blocked, which you might eventually figure out correctly from what had been said already. But I would be surprised if your mental image of the scheduler's functionality is not completely shattered once you start wondering how it works when not all recipients may be read in-core. More on that later.]

How destination concurrency limits affect the scheduling algorithm

The nqmgr uses the same algorithm for destination concurrency control as oqmgr. Now what happens when the destination limits are reached and no more entries for that destination may be selected by the scheduler?

From the user's point of view it is all simple. If some of the peers of a job can't be selected, those peers are simply skipped by the entry selection algorithm (the pseudo-code described before) and only the selectable ones are used. If none of the peers may be selected, the job is declared a "blocker job". Blocker jobs are skipped by the entry selection algorithm and they are also excluded from the candidates for preemption of the current job. Thus the scheduler effectively behaves as if the blocker jobs didn't exist on the job list at all. As soon as at least one of the peers of a blocker job becomes unblocked (that is, the delivery agent handling the delivery of the recipient entry for the given destination successfully finishes), the job's blocker status is removed and the job again participates in all further scheduler actions normally.

So the summary is that the users don't really have to be concerned about the interaction of the destination limits and scheduling algorithm. It works well on its own and there are no knobs they would need to control it.

From a programmer's point of view, the blocker jobs complicate the scheduler quite a lot. Without them, the jobs on the job list would be normally delivered in strict FIFO order. If the current job is preempted, the job pree