Dividable without remainder

Oct 22, 2015
golang bpf sflow

Summary

Problem

For a “1 in n sampling” problem I was looking for a fast way to determine, if a value x (random number) is dividable by a previously known divisor d (sampling rate) without remainder. So the algorithm has to result with true in 1 of d cases, otherwise with false. x and d are both unsigned integers, in my case 32 bit long. Since the algorithm will be used for sampling and x is produced by a random number generator (see below: Motivation), a slight inaccuracy (< 0.01%) may be tolerable. The usual range for d will be whole numbers between 1 and 2000, based on the recommendations for sFlow sampling rates.

tl;dr Solution

The obvious solution of course is to use the modulo operation: x % d == 0. In my special case, the modulo operation is not available, and I therefore had to look for an alternative (for details see: Motivation).

The proposed solution only works with positive (unsigned) values.
d = 0 is not allowed (division by 0); for d = 1, the result is always true.

If d is a power of 2 (e.g. 4, 8, 16), the result is determined with x & (d - 1) == 0. If true, there is no remainder.

For all other cases, the following calculation is performed:
x * reciprocal(d) < reciprocal(d)
The reciprocal function is defined as:
reciprocal(d) = maxuint / d + 1
where the division is an integer division, where the result is round down.
maxuint is the maximum unsigned integer value of the used data type, in my case 32 bit (maxuint = 2^32 - 1 = 4294967295).

The check, if d is a power of 2 and the calculation of reciprocal(d) is performed only once at initialization, because the sampling-rate is constant during the whole sampling process.

For x in 0 to maxuint and d from 1 to 10000 (inclusive) the proposed algorithm returns in 9988 cases the same count of return value = true (and respectively false) as the exact method x % d == 0. In the other 12 cases, the difference of the counts of return value = true is 1, which leads to an inaccuracy below 0.001%.

These results are verified by a small program of mine, called zeroremainder available at Github.

Bottom Line

The proposed algorithm is usable for the use case in question, as the difference of the results compared to the exact method, using the modulo operation, is marginal (below 0.001%).

Motivation

I am working on a tool, to sample network packets at arbitrary sampling rate (normally between 1 and 2000), as proposed by sFlow). My implementation is based on gopacket, which uses libpcap, the same library used by tcpdump. One of my approaches for the sampling of the packages is based on BPF (Berkeley Packet Filter), which allows to filter the captured packets in the kernel. The assumption is that filtering in kernel mode should be faster, because there are less context switches and memory copy operations necessary. BPF allows to call the prandom interface of the kernel to get a random uint32 number. One way to decide, if the current packet should be sampled, is to use the modulo operation; random number modulo sampling rate. If the remainder is 0, the packet is sampled, otherwise the packet is ignored.

Although BPF assembler does provide a modulo instruction, I do not want to use this instruction because of two reasons:

The modulo BPF instruction is not supported with libpcap < 1.6.0
Unsigned integer division (and therefore modulo) is one of the slowest operations on a modern microprocessor (see benchmark results below)

Inspiration

While searching the web for possible solutions, which could be implemented easily in assembler, I stumbled over Division Without DIV and IDIV and the concept of magic numbers (Stackoverflow - Most optimized way to calculate modulus in C, Magic number generator, Labor of Division) which replaces division by a multiplication with a magic number and a binary shift operation.

Inspired by this ideas I started to experiment with reciprocal, magic numbers and overflow of unsigned integers.

Solution

Preconditions

We only work with unsigned integers, where the size of the integer is a multiple of 8 (bits), normally one of 8, 16, 32, 64 (depending on data type and architecture)
maxuint is defined as: maxuint = 2^size - 1
The divisor d must not be 0 (would lead to division by 0 error)
If d is 1, no calculation is performed, the result is always true
The dividend x may be any number in the range from 0 to maxuint

d being a power of 2

This part of the solution is straightforward. If d is a power of 2 (e.g. 4, 8, 16), the modulo could by determined with a simple and operation, x & (d - 1) == 0. The result of a bitwise and-operation of x and d - 1 is the remainder. Therefore if the result is 0, there is no remainder and the equation is true.

Multiply by reciprocal

For all other cases (d not being a power of 2), the basic idea is, to perform a multiplication of x and the reciprocal of d, calculated on the base of maxuint, the maximum integer value of respective data type or architecture (e.g. 32 bit), and use the free wrap around modulo operation of the integer overflow. The integer overflow is described in Wikipedia - Integer overflow as follows:

Since an arithmetic operation may produce a result larger than the maximum representable value, a potential error condition may result. In the C programming language, signed integer overflow causes undefined behavior, while unsigned integer overflow causes the number to be reduced modulo a power of two, meaning that unsigned integers “wrap around” on overflow.

First we handle some special cases:

If x == 0 the result is always true
If 0 < x < d no calculation is performed as the result has to be false

The elimination of these special cases ensures that a wrap around modulo operation is always performed.

For the remaining cases (x > d) the following calculation is performed to decide if a packet should be sampled or not:
x * reciprocal(d) < reciprocal(d).

The used reciprocal is calculated as rounded up division of maxuint and d, the division being an integer division, where the result is round down):
reciprocal(d) = maxuint / d + 1

One of the first question I asked my self about this equation was, why has x * reciprocal(d) to be smaller than reciprocal(d)? To explain this, let’s have a look at an example with d being a power of 2. We know, if d is a power of 2, there is a simpler approach to determine, if a x is dividable by d without remainder, by using the bitwise and-operation described above.
Nevertheless we may use the proposed algorithm. In this case x * reciprocal(d) will always result to 0, if x is dividable by d without remainder. This works because in this case reciprocal(d) results in the effective reciprocal value of d, without rounding error. For all other d, not being a power of 2, reciprocal(d) is always a rounded value, with 0 < error < 1.
The error may be calculated as follows:
reciprocal(d) - 2^size / d

This rounding error is the reason, why x * reciprocal(d) does not equal to 0 if x % d = 0. The higher x, the higher the accumulated error becomes, as the error is multiplied by x as well. Now it becomes clear, there is a point where the accumulated error even exceeds reciprocal(d), which leads to the situation, where x * reciprocal(d) < reciprocal(d) does not provide the same result as x % d == 0.

With (2^size / d) / error (integer division) the highest x may be calculated, where the proposed algorithm still provides the same result as x % d = 0. Unfortunately this buys us nothing, because in my scenario x may be any number between 0 and maxuint. So the next question arises, what happens there or a little bit bolder: does it matter? The short answer is: no, for my use case it does not matter. Why?

We could understand the proposed algorithm x * reciprocal(d) < reciprocal(d) as a kind of sliding window with the size of d. Because we can not exactly represent the reciprocal of d as whole number, the sliding window is a little bit bigger than d. This leads to a difference between the proposed algorithm and the exact calculation, but the proposed algorithm still fulfills the requirement of selecting approximately every nth element (respectively every dth).

Applied verification

The verify the above statements, I wrote a small program, called zeroremainder available at Github.

This program allows to compare the results of the proposed algorithm with the result of the exact method (modulo operation) and calculate the difference of the results.

As I am primarily interested in the results on 32 bit data types and for divisors between 1 and 2000, my applied verification was focused on these parameters, while the zeroremainder test program does allow to check other number ranges as well.

Results

I ran zeroremainder on my 4 core i5 2.67 GHz for about 60 hours, using 100% CPU, to fully calculate the divisors from 2 to 10000 (inclusive) with dividends from 0 to 2^32-1.

Executed command:

zeroremainder -outputalldivisors -dividendend=0xffffffff -divisorend=10000 > 0_to_0xffffffff_2_to_10000_alldivisors.txt

The results of these calculations are:

The proposed algorithm returns in 9988 cases the same count of return value = true (and respectively false) as the exact method x % d == 0.
In the other 12 cases (divisors: 3, 5, 15, 17, 51, 85, 255, 257, 771, 1285, 3855, 4369), the difference of the counts of return value = true is exactly 1, which leads to an inaccuracy below 0.001%.

For easier verification of the results I sorted these by divisor and added a column with an exact value for the difference between the proposed algorithm and the modulo operation as a percent value as follows:

cat 0_to_0xffffffff_2_to_10000_alldivisors.txt | sort -n -k 2 | awk ' {print $0 " Difference exact (%): " sprintf("%.16f", 100-100/$8*$5)}' > 0_to_0xffffffff_2_to_10000_alldivisors_sorted_exact.txt

You may download these results as compressed bz2 archive 0_to_0xffffffff_2_to_10000_alldivisors_sorted_exact.txt.bz2.

Benchmarks

To get a feeling about the performance difference of the different algorithms I implemented a small go benchmark.

$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkDivisionMod-4                100000000                11.1 ns/op
BenchmarkDivisionPow2-4               300000000                 4.21 ns/op
BenchmarkZeroremainderUint32-4        300000000                 4.46 ns/op
BenchmarkZeroremainderUint64-4        300000000                 4.50 ns/op
BenchmarkDivisionLt-4                 500000000                 3.34 ns/op
ok          /home/lubr/go/src/github.com/breml/zeroremainder    8.439s

I am aware of the fact that these results are not comparable with the implementation of this calculations in BPF, but the give me a feeling about the dimensions.

The easiest solution

While working on this solution I became aware of an even easier way to do the sampling by evaluating x < maxuint / d. The part maxuint / d may be calculated at initialization (like reciprocal(d)), where maxuint is the maximum number generated by the random number generator.

With this solution a simple comparison is enough to determine if a packet should be sampled or not.

This solution may be used, if the source for the random numbers guarantees for a even distribution of the random numbers throughout the whole range of possible numbers.

Disclaimer

This document is a compilation of observations and empirical studies of the described algorithm and by no means a formal proof.