|  | // Copyright 2009 The Go Authors. All rights reserved. | 
|  | // Use of this source code is governed by a BSD-style | 
|  | // license that can be found in the LICENSE file. | 
|  |  | 
|  | /* | 
|  |  | 
|  | Multi-precision division. Here be dragons. | 
|  |  | 
|  | Given u and v, where u is n+m digits, and v is n digits (with no leading zeros), | 
|  | the goal is to return quo, rem such that u = quo*v + rem, where 0 ≤ rem < v. | 
|  | That is, quo = ⌊u/v⌋ where ⌊x⌋ denotes the floor (truncation to integer) of x, | 
|  | and rem = u - quo·v. | 
|  |  | 
|  |  | 
|  | Long Division | 
|  |  | 
|  | Division in a computer proceeds the same as long division in elementary school, | 
|  | but computers are not as good as schoolchildren at following vague directions, | 
|  | so we have to be much more precise about the actual steps and what can happen. | 
|  |  | 
|  | We work from most to least significant digit of the quotient, doing: | 
|  |  | 
|  | • Guess a digit q, the number of v to subtract from the current | 
|  | section of u to zero out the topmost digit. | 
|  | • Check the guess by multiplying q·v and comparing it against | 
|  | the current section of u, adjusting the guess as needed. | 
|  | • Subtract q·v from the current section of u. | 
|  | • Add q to the corresponding section of the result quo. | 
|  |  | 
|  | When all digits have been processed, the final remainder is left in u | 
|  | and returned as rem. | 
|  |  | 
|  | For example, here is a sketch of dividing 5 digits by 3 digits (n=3, m=2). | 
|  |  | 
|  | q₂ q₁ q₀ | 
|  | _________________ | 
|  | v₂ v₁ v₀ ) u₄ u₃ u₂ u₁ u₀ | 
|  | ↓  ↓  ↓  |  | | 
|  | [u₄ u₃ u₂]|  | | 
|  | - [  q₂·v  ]|  | | 
|  | ----------- ↓  | | 
|  | [  rem  | u₁]| | 
|  | - [    q₁·v   ]| | 
|  | ----------- ↓ | 
|  | [  rem  | u₀] | 
|  | - [    q₀·v   ] | 
|  | ------------ | 
|  | [  rem   ] | 
|  |  | 
|  | Instead of creating new storage for the remainders and copying digits from u | 
|  | as indicated by the arrows, we use u's storage directly as both the source | 
|  | and destination of the subtractions, so that the remainders overwrite | 
|  | successive overlapping sections of u as the division proceeds, using a slice | 
|  | of u to identify the current section. This avoids all the copying as well as | 
|  | shifting of remainders. | 
|  |  | 
|  | Division of u with n+m digits by v with n digits (in base B) can in general | 
|  | produce at most m+1 digits, because: | 
|  |  | 
|  | • u < B^(n+m)               [B^(n+m) has n+m+1 digits] | 
|  | • v ≥ B^(n-1)               [B^(n-1) is the smallest n-digit number] | 
|  | • u/v < B^(n+m) / B^(n-1)   [divide bounds for u, v] | 
|  | • u/v < B^(m+1)             [simplify] | 
|  |  | 
|  | The first step is special: it takes the top n digits of u and divides them by | 
|  | the n digits of v, producing the first quotient digit and an n-digit remainder. | 
|  | In the example, q₂ = ⌊u₄u₃u₂ / v⌋. | 
|  |  | 
|  | The first step divides n digits by n digits to ensure that it produces only a | 
|  | single digit. | 
|  |  | 
|  | Each subsequent step appends the next digit from u to the remainder and divides | 
|  | those n+1 digits by the n digits of v, producing another quotient digit and a | 
|  | new n-digit remainder. | 
|  |  | 
|  | Subsequent steps divide n+1 digits by n digits, an operation that in general | 
|  | might produce two digits. However, as used in the algorithm, that division is | 
|  | guaranteed to produce only a single digit. The dividend is of the form | 
|  | rem·B + d, where rem is a remainder from the previous step and d is a single | 
|  | digit, so: | 
|  |  | 
|  | • rem ≤ v - 1                 [rem is a remainder from dividing by v] | 
|  | • rem·B ≤ v·B - B             [multiply by B] | 
|  | • d ≤ B - 1                   [d is a single digit] | 
|  | • rem·B + d ≤ v·B - 1         [add] | 
|  | • rem·B + d < v·B             [change ≤ to <] | 
|  | • (rem·B + d)/v < B           [divide by v] | 
|  |  | 
|  |  | 
|  | Guess and Check | 
|  |  | 
|  | At each step we need to divide n+1 digits by n digits, but this is for the | 
|  | implementation of division by n digits, so we can't just invoke a division | 
|  | routine: we _are_ the division routine. Instead, we guess at the answer and | 
|  | then check it using multiplication. If the guess is wrong, we correct it. | 
|  |  | 
|  | How can this guessing possibly be efficient? It turns out that the following | 
|  | statement (let's call it the Good Guess Guarantee) is true. | 
|  |  | 
|  | If | 
|  |  | 
|  | • q = ⌊u/v⌋ where u is n+1 digits and v is n digits, | 
|  | • q < B, and | 
|  | • the topmost digit of v = vₙ₋₁ ≥ B/2, | 
|  |  | 
|  | then q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ satisfies q ≤ q̂ ≤ q+2. (Proof below.) | 
|  |  | 
|  | That is, if we know the answer has only a single digit and we guess an answer | 
|  | by ignoring the bottom n-1 digits of u and v, using a 2-by-1-digit division, | 
|  | then that guess is at least as large as the correct answer. It is also not | 
|  | too much larger: it is off by at most two from the correct answer. | 
|  |  | 
|  | Note that in the first step of the overall division, which is an n-by-n-digit | 
|  | division, the 2-by-1 guess uses an implicit uₙ = 0. | 
|  |  | 
|  | Note that using a 2-by-1-digit division here does not mean calling ourselves | 
|  | recursively. Instead, we use an efficient direct hardware implementation of | 
|  | that operation. | 
|  |  | 
|  | Note that because q is u/v rounded down, q·v must not exceed u: u ≥ q·v. | 
|  | If a guess q̂ is too big, it will not satisfy this test. Viewed a different way, | 
|  | the remainder r̂ for a given q̂ is u - q̂·v, which must be positive. If it is | 
|  | negative, then the guess q̂ is too big. | 
|  |  | 
|  | This gives us a way to compute q. First compute q̂ with 2-by-1-digit division. | 
|  | Then, while u < q̂·v, decrement q̂; this loop executes at most twice, because | 
|  | q̂ ≤ q+2. | 
|  |  | 
|  |  | 
|  | Scaling Inputs | 
|  |  | 
|  | The Good Guess Guarantee requires that the top digit of v (vₙ₋₁) be at least B/2. | 
|  | For example in base 10, ⌊172/19⌋ = 9, but ⌊18/1⌋ = 18: the guess is wildly off | 
|  | because the first digit 1 is smaller than B/2 = 5. | 
|  |  | 
|  | We can ensure that v has a large top digit by multiplying both u and v by the | 
|  | right amount. Continuing the example, if we multiply both 172 and 19 by 3, we | 
|  | now have ⌊516/57⌋, the leading digit of v is now ≥ 5, and sure enough | 
|  | ⌊51/5⌋ = 10 is much closer to the correct answer 9. It would be easier here | 
|  | to multiply by 4, because that can be done with a shift. Specifically, we can | 
|  | always count the number of leading zeros i in the first digit of v and then | 
|  | shift both u and v left by i bits. | 
|  |  | 
|  | Having scaled u and v, the value ⌊u/v⌋ is unchanged, but the remainder will | 
|  | be scaled: 172 mod 19 is 1, but 516 mod 57 is 3. We have to divide the remainder | 
|  | by the scaling factor (shifting right i bits) when we finish. | 
|  |  | 
|  | Note that these shifts happen before and after the entire division algorithm, | 
|  | not at each step in the per-digit iteration. | 
|  |  | 
|  | Note the effect of scaling inputs on the size of the possible quotient. | 
|  | In the scaled u/v, u can gain a digit from scaling; v never does, because we | 
|  | pick the scaling factor to make v's top digit larger but without overflowing. | 
|  | If u and v have n+m and n digits after scaling, then: | 
|  |  | 
|  | • u < B^(n+m)               [B^(n+m) has n+m+1 digits] | 
|  | • v ≥ B^n / 2               [vₙ₋₁ ≥ B/2, so vₙ₋₁·B^(n-1) ≥ B^n/2] | 
|  | • u/v < B^(n+m) / (B^n / 2) [divide bounds for u, v] | 
|  | • u/v < 2 B^m               [simplify] | 
|  |  | 
|  | The quotient can still have m+1 significant digits, but if so the top digit | 
|  | must be a 1. This provides a different way to handle the first digit of the | 
|  | result: compare the top n digits of u against v and fill in either a 0 or a 1. | 
|  |  | 
|  |  | 
|  | Refining Guesses | 
|  |  | 
|  | Before we check whether u < q̂·v, we can adjust our guess to change it from | 
|  | q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ into the refined guess ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋. | 
|  | Although not mentioned above, the Good Guess Guarantee also promises that this | 
|  | 3-by-2-digit division guess is more precise and at most one away from the real | 
|  | answer q. The improvement from the 2-by-1 to the 3-by-2 guess can also be done | 
|  | without n-digit math. | 
|  |  | 
|  | If we have a guess q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ and we want to see if it also equal to | 
|  | ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, we can use the same check we would for the full division: | 
|  | if uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂, then the guess is too large and should be reduced. | 
|  |  | 
|  | Checking uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ < 0, | 
|  | and | 
|  |  | 
|  | uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ = (uₙuₙ₋₁·B + uₙ₋₂) - q̂·(vₙ₋₁·B + vₙ₋₂) | 
|  | [splitting off the bottom digit] | 
|  | = (uₙuₙ₋₁ - q̂·vₙ₋₁)·B + uₙ₋₂ - q̂·vₙ₋₂ | 
|  | [regrouping] | 
|  |  | 
|  | The expression (uₙuₙ₋₁ - q̂·vₙ₋₁) is the remainder of uₙuₙ₋₁ / vₙ₋₁. | 
|  | If the initial guess returns both q̂ and its remainder r̂, then checking | 
|  | whether uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as checking r̂·B + uₙ₋₂ < q̂·vₙ₋₂. | 
|  |  | 
|  | If we find that r̂·B + uₙ₋₂ < q̂·vₙ₋₂, then we can adjust the guess by | 
|  | decrementing q̂ and adding vₙ₋₁ to r̂. We repeat until r̂·B + uₙ₋₂ ≥ q̂·vₙ₋₂. | 
|  | (As before, this fixup is only needed at most twice.) | 
|  |  | 
|  | Now that q̂ = ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, as mentioned above it is at most one | 
|  | away from the correct q, and we've avoided doing any n-digit math. | 
|  | (If we need the new remainder, it can be computed as r̂·B + uₙ₋₂ - q̂·vₙ₋₂.) | 
|  |  | 
|  | The final check u < q̂·v and the possible fixup must be done at full precision. | 
|  | For random inputs, a fixup at this step is exceedingly rare: the 3-by-2 guess | 
|  | is not often wrong at all. But still we must do the check. Note that since the | 
|  | 3-by-2 guess is off by at most 1, it can be convenient to perform the final | 
|  | u < q̂·v as part of the computation of the remainder r = u - q̂·v. If the | 
|  | subtraction underflows, decremeting q̂ and adding one v back to r is enough to | 
|  | arrive at the final q, r. | 
|  |  | 
|  | That's the entirety of long division: scale the inputs, and then loop over | 
|  | each output position, guessing, checking, and correcting the next output digit. | 
|  |  | 
|  | For a 2n-digit number divided by an n-digit number (the worst size-n case for | 
|  | division complexity), this algorithm uses n+1 iterations, each of which must do | 
|  | at least the 1-by-n-digit multiplication q̂·v. That's O(n) iterations of | 
|  | O(n) time each, so O(n²) time overall. | 
|  |  | 
|  |  | 
|  | Recursive Division | 
|  |  | 
|  | For very large inputs, it is possible to improve on the O(n²) algorithm. | 
|  | Let's call a group of n/2 real digits a (very) “wide digit”. We can run the | 
|  | standard long division algorithm explained above over the wide digits instead of | 
|  | the actual digits. This will result in many fewer steps, but the math involved in | 
|  | each step is more work. | 
|  |  | 
|  | Where basic long division uses a 2-by-1-digit division to guess the initial q̂, | 
|  | the new algorithm must use a 2-by-1-wide-digit division, which is of course | 
|  | really an n-by-n/2-digit division. That's OK: if we implement n-digit division | 
|  | in terms of n/2-digit division, the recursion will terminate when the divisor | 
|  | becomes small enough to handle with standard long division or even with the | 
|  | 2-by-1 hardware instruction. | 
|  |  | 
|  | For example, here is a sketch of dividing 10 digits by 4, proceeding with | 
|  | wide digits corresponding to two regular digits. The first step, still special, | 
|  | must leave off a (regular) digit, dividing 5 by 4 and producing a 4-digit | 
|  | remainder less than v. The middle steps divide 6 digits by 4, guaranteed to | 
|  | produce two output digits each (one wide digit) with 4-digit remainders. | 
|  | The final step must use what it has: the 4-digit remainder plus one more, | 
|  | 5 digits to divide by 4. | 
|  |  | 
|  | q₆ q₅ q₄ q₃ q₂ q₁ q₀ | 
|  | _______________________________ | 
|  | v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀ | 
|  | ↓  ↓  ↓  ↓  ↓  |  |  |  |  | | 
|  | [u₉ u₈ u₇ u₆ u₅]|  |  |  |  | | 
|  | - [    q₆q₅·v    ]|  |  |  |  | | 
|  | ----------------- ↓  ↓  |  |  | | 
|  | [    rem    |u₄ u₃]|  |  | | 
|  | - [     q₄q₃·v      ]|  |  | | 
|  | -------------------- ↓  ↓  | | 
|  | [    rem    |u₂ u₁]| | 
|  | - [     q₂q₁·v      ]| | 
|  | -------------------- ↓ | 
|  | [    rem    |u₀] | 
|  | - [     q₀·v     ] | 
|  | ------------------ | 
|  | [    rem    ] | 
|  |  | 
|  | An alternative would be to look ahead to how well n/2 divides into n+m and | 
|  | adjust the first step to use fewer digits as needed, making the first step | 
|  | more special to make the last step not special at all. For example, using the | 
|  | same input, we could choose to use only 4 digits in the first step, leaving | 
|  | a full wide digit for the last step: | 
|  |  | 
|  | q₆ q₅ q₄ q₃ q₂ q₁ q₀ | 
|  | _______________________________ | 
|  | v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀ | 
|  | ↓  ↓  ↓  ↓  |  |  |  |  |  | | 
|  | [u₉ u₈ u₇ u₆]|  |  |  |  |  | | 
|  | - [    q₆·v   ]|  |  |  |  |  | | 
|  | -------------- ↓  ↓  |  |  |  | | 
|  | [    rem    |u₅ u₄]|  |  |  | | 
|  | - [     q₅q₄·v      ]|  |  |  | | 
|  | -------------------- ↓  ↓  |  | | 
|  | [    rem    |u₃ u₂]|  | | 
|  | - [     q₃q₂·v      ]|  | | 
|  | -------------------- ↓  ↓ | 
|  | [    rem    |u₁ u₀] | 
|  | - [     q₁q₀·v      ] | 
|  | --------------------- | 
|  | [    rem    ] | 
|  |  | 
|  | Today, the code in divRecursiveStep works like the first example. Perhaps in | 
|  | the future we will make it work like the alternative, to avoid a special case | 
|  | in the final iteration. | 
|  |  | 
|  | Either way, each step is a 3-by-2-wide-digit division approximated first by | 
|  | a 2-by-1-wide-digit division, just as we did for regular digits in long division. | 
|  | Because the actual answer we want is a 3-by-2-wide-digit division, instead of | 
|  | multiplying q̂·v directly during the fixup, we can use the quick refinement | 
|  | from long division (an n/2-by-n/2 multiply) to correct q to its actual value | 
|  | and also compute the remainder (as mentioned above), and then stop after that, | 
|  | never doing a full n-by-n multiply. | 
|  |  | 
|  | Instead of using an n-by-n/2-digit division to produce n/2 digits, we can add | 
|  | (not discard) one more real digit, doing an (n+1)-by-(n/2+1)-digit division that | 
|  | produces n/2+1 digits. That single extra digit tightens the Good Guess Guarantee | 
|  | to q ≤ q̂ ≤ q+1 and lets us drop long division's special treatment of the first | 
|  | digit. These benefits are discussed more after the Good Guess Guarantee proof | 
|  | below. | 
|  |  | 
|  |  | 
|  | How Fast is Recursive Division? | 
|  |  | 
|  | For a 2n-by-n-digit division, this algorithm runs a 4-by-2 long division over | 
|  | wide digits, producing two wide digits plus a possible leading regular digit 1, | 
|  | which can be handled without a recursive call. That is, the algorithm uses two | 
|  | full iterations, each using an n-by-n/2-digit division and an n/2-by-n/2-digit | 
|  | multiplication, along with a few n-digit additions and subtractions. The standard | 
|  | n-by-n-digit multiplication algorithm requires O(n²) time, making the overall | 
|  | algorithm require time T(n) where | 
|  |  | 
|  | T(n) = 2T(n/2) + O(n) + O(n²) | 
|  |  | 
|  | which, by the Bentley-Haken-Saxe theorem, ends up reducing to T(n) = O(n²). | 
|  | This is not an improvement over regular long division. | 
|  |  | 
|  | When the number of digits n becomes large enough, Karatsuba's algorithm for | 
|  | multiplication can be used instead, which takes O(n^log₂3) = O(n^1.6) time. | 
|  | (Karatsuba multiplication is implemented in func karatsuba in nat.go.) | 
|  | That makes the overall recursive division algorithm take O(n^1.6) time as well, | 
|  | which is an improvement, but again only for large enough numbers. | 
|  |  | 
|  | It is not critical to make sure that every recursion does only two recursive | 
|  | calls. While in general the number of recursive calls can change the time | 
|  | analysis, in this case doing three calls does not change the analysis: | 
|  |  | 
|  | T(n) = 3T(n/2) + O(n) + O(n^log₂3) | 
|  |  | 
|  | ends up being T(n) = O(n^log₂3). Because the Karatsuba multiplication taking | 
|  | time O(n^log₂3) is itself doing 3 half-sized recursions, doing three for the | 
|  | division does not hurt the asymptotic performance. Of course, it is likely | 
|  | still faster in practice to do two. | 
|  |  | 
|  |  | 
|  | Proof of the Good Guess Guarantee | 
|  |  | 
|  | Given numbers x, y, let us break them into the quotients and remainders when | 
|  | divided by some scaling factor S, with the added constraints that the quotient | 
|  | x/y and the high part of y are both less than some limit T, and that the high | 
|  | part of y is at least half as big as T. | 
|  |  | 
|  | x₁ = ⌊x/S⌋        y₁ = ⌊y/S⌋ | 
|  | x₀ = x mod S      y₀ = y mod S | 
|  |  | 
|  | x  = x₁·S + x₀    0 ≤ x₀ < S    x/y < T | 
|  | y  = y₁·S + y₀    0 ≤ y₀ < S    T/2 ≤ y₁ < T | 
|  |  | 
|  | And consider the two truncated quotients: | 
|  |  | 
|  | q = ⌊x/y⌋ | 
|  | q̂ = ⌊x₁/y₁⌋ | 
|  |  | 
|  | We will prove that q ≤ q̂ ≤ q+2. | 
|  |  | 
|  | The guarantee makes no real demands on the scaling factor S: it is simply the | 
|  | magnitude of the digits cut from both x and y to produce x₁ and y₁. | 
|  | The guarantee makes only limited demands on T: it must be large enough to hold | 
|  | the quotient x/y, and y₁ must have roughly the same size. | 
|  |  | 
|  | To apply to the earlier discussion of 2-by-1 guesses in long division, | 
|  | we would choose: | 
|  |  | 
|  | S  = Bⁿ⁻¹ | 
|  | T  = B | 
|  | x  = u | 
|  | x₁ = uₙuₙ₋₁ | 
|  | x₀ = uₙ₋₂...u₀ | 
|  | y  = v | 
|  | y₁ = vₙ₋₁ | 
|  | y₀ = vₙ₋₂...u₀ | 
|  |  | 
|  | These simpler variables avoid repeating those longer expressions in the proof. | 
|  |  | 
|  | Note also that, by definition, truncating division ⌊x/y⌋ satisfies | 
|  |  | 
|  | x/y - 1 < ⌊x/y⌋ ≤ x/y. | 
|  |  | 
|  | This fact will be used a few times in the proofs. | 
|  |  | 
|  | Proof that q ≤ q̂: | 
|  |  | 
|  | q̂·y₁ = ⌊x₁/y₁⌋·y₁                      [by definition, q̂ = ⌊x₁/y₁⌋] | 
|  | > (x₁/y₁ - 1)·y₁                  [x₁/y₁ - 1 < ⌊x₁/y₁⌋] | 
|  | = x₁ - y₁                         [distribute y₁] | 
|  |  | 
|  | So q̂·y₁ > x₁ - y₁. | 
|  | Since q̂·y₁ is an integer, q̂·y₁ ≥ x₁ - y₁ + 1. | 
|  |  | 
|  | q̂ - q = q̂ - ⌊x/y⌋                      [by definition, q = ⌊x/y⌋] | 
|  | ≥ q̂ - x/y                        [⌊x/y⌋ < x/y] | 
|  | = (1/y)·(q̂·y - x)                [factor out 1/y] | 
|  | ≥ (1/y)·(q̂·y₁·S - x)             [y = y₁·S + y₀ ≥ y₁·S] | 
|  | ≥ (1/y)·((x₁ - y₁ + 1)·S - x)    [above: q̂·y₁ ≥ x₁ - y₁ + 1] | 
|  | = (1/y)·(x₁·S - y₁·S + S - x)    [distribute S] | 
|  | = (1/y)·(S - x₀ - y₁·S)          [-x = -x₁·S - x₀] | 
|  | > -y₁·S / y                      [x₀ < S, so S - x₀ < 0; drop it] | 
|  | ≥ -1                             [y₁·S ≤ y] | 
|  |  | 
|  | So q̂ - q > -1. | 
|  | Since q̂ - q is an integer, q̂ - q ≥ 0, or equivalently q ≤ q̂. | 
|  |  | 
|  | Proof that q̂ ≤ q+2: | 
|  |  | 
|  | x₁/y₁ - x/y = x₁·S/y₁·S - x/y          [multiply left term by S/S] | 
|  | ≤ x/y₁·S - x/y             [x₁S ≤ x] | 
|  | = (x/y)·(y/y₁·S - 1)       [factor out x/y] | 
|  | = (x/y)·((y - y₁·S)/y₁·S)  [move -1 into y/y₁·S fraction] | 
|  | = (x/y)·(y₀/y₁·S)          [y - y₁·S = y₀] | 
|  | = (x/y)·(1/y₁)·(y₀/S)      [factor out 1/y₁] | 
|  | < (x/y)·(1/y₁)             [y₀ < S, so y₀/S < 1] | 
|  | ≤ (x/y)·(2/T)              [y₁ ≥ T/2, so 1/y₁ ≤ 2/T] | 
|  | < T·(2/T)                  [x/y < T] | 
|  | = 2                        [T·(2/T) = 2] | 
|  |  | 
|  | So x₁/y₁ - x/y < 2. | 
|  |  | 
|  | q̂ - q = ⌊x₁/y₁⌋ - q                    [by definition, q̂ = ⌊x₁/y₁⌋] | 
|  | = ⌊x₁/y₁⌋ - ⌊x/y⌋                [by definition, q = ⌊x/y⌋] | 
|  | ≤ x₁/y₁ - ⌊x/y⌋                  [⌊x₁/y₁⌋ ≤ x₁/y₁] | 
|  | < x₁/y₁ - (x/y - 1)              [⌊x/y⌋ > x/y - 1] | 
|  | = (x₁/y₁ - x/y) + 1              [regrouping] | 
|  | < 2 + 1                          [above: x₁/y₁ - x/y < 2] | 
|  | = 3 | 
|  |  | 
|  | So q̂ - q < 3. | 
|  | Since q̂ - q is an integer, q̂ - q ≤ 2. | 
|  |  | 
|  | Note that when x/y < T/2, the bounds tighten to x₁/y₁ - x/y < 1 and therefore | 
|  | q̂ - q ≤ 1. | 
|  |  | 
|  | Note also that in the general case 2n-by-n division where we don't know that | 
|  | x/y < T, we do know that x/y < 2T, yielding the bound q̂ - q ≤ 4. So we could | 
|  | remove the special case first step of long division as long as we allow the | 
|  | first fixup loop to run up to four times. (Using a simple comparison to decide | 
|  | whether the first digit is 0 or 1 is still more efficient, though.) | 
|  |  | 
|  | Finally, note that when dividing three leading base-B digits by two (scaled), | 
|  | we have T = B² and x/y < B = T/B, a much tighter bound than x/y < T. | 
|  | This in turn yields the much tighter bound x₁/y₁ - x/y < 2/B. This means that | 
|  | ⌊x₁/y₁⌋ and ⌊x/y⌋ can only differ when x/y is less than 2/B greater than an | 
|  | integer. For random x and y, the chance of this is 2/B, or, for large B, | 
|  | approximately zero. This means that after we produce the 3-by-2 guess in the | 
|  | long division algorithm, the fixup loop essentially never runs. | 
|  |  | 
|  | In the recursive algorithm, the extra digit in (2·⌊n/2⌋+1)-by-(⌊n/2⌋+1)-digit | 
|  | division has exactly the same effect: the probability of needing a fixup is the | 
|  | same 2/B. Even better, we can allow the general case x/y < 2T and the fixup | 
|  | probability only grows to 4/B, still essentially zero. | 
|  |  | 
|  |  | 
|  | References | 
|  |  | 
|  | There are no great references for implementing long division; thus this comment. | 
|  | Here are some notes about what to expect from the obvious references. | 
|  |  | 
|  | Knuth Volume 2 (Seminumerical Algorithms) section 4.3.1 is the usual canonical | 
|  | reference for long division, but that entire series is highly compressed, never | 
|  | repeating a necessary fact and leaving important insights to the exercises. | 
|  | For example, no rationale whatsoever is given for the calculation that extends | 
|  | q̂ from a 2-by-1 to a 3-by-2 guess, nor why it reduces the error bound. | 
|  | The proof that the calculation even has the desired effect is left to exercises. | 
|  | The solutions to those exercises provided at the back of the book are entirely | 
|  | calculations, still with no explanation as to what is going on or how you would | 
|  | arrive at the idea of doing those exact calculations. Nowhere is it mentioned | 
|  | that this test extends the 2-by-1 guess into a 3-by-2 guess. The proof of the | 
|  | Good Guess Guarantee is only for the 2-by-1 guess and argues by contradiction, | 
|  | making it difficult to understand how modifications like adding another digit | 
|  | or adjusting the quotient range affects the overall bound. | 
|  |  | 
|  | All that said, Knuth remains the canonical reference. It is dense but packed | 
|  | full of information and references, and the proofs are simpler than many other | 
|  | presentations. The proofs above are reworkings of Knuth's to remove the | 
|  | arguments by contradiction and add explanations or steps that Knuth omitted. | 
|  | But beware of errors in older printings. Take the published errata with you. | 
|  |  | 
|  | Brinch Hansen's “Multiple-length Division Revisited: a Tour of the Minefield” | 
|  | starts with a blunt critique of Knuth's presentation (among others) and then | 
|  | presents a more detailed and easier to follow treatment of long division, | 
|  | including an implementation in Pascal. But the algorithm and implementation | 
|  | work entirely in terms of 3-by-2 division, which is much less useful on modern | 
|  | hardware than an algorithm using 2-by-1 division. The proofs are a bit too | 
|  | focused on digit counting and seem needlessly complex, especially compared to | 
|  | the ones given above. | 
|  |  | 
|  | Burnikel and Ziegler's “Fast Recursive Division” introduced the key insight of | 
|  | implementing division by an n-digit divisor using recursive calls to division | 
|  | by an n/2-digit divisor, relying on Karatsuba multiplication to yield a | 
|  | sub-quadratic run time. However, the presentation decisions are made almost | 
|  | entirely for the purpose of simplifying the run-time analysis, rather than | 
|  | simplifying the presentation. Instead of a single algorithm that loops over | 
|  | quotient digits, the paper presents two mutually-recursive algorithms, for | 
|  | 2n-by-n and 3n-by-2n. The paper also does not present any general (n+m)-by-n | 
|  | algorithm. | 
|  |  | 
|  | The proofs in the paper are remarkably complex, especially considering that | 
|  | the algorithm is at its core just long division on wide digits, so that the | 
|  | usual long division proofs apply essentially unaltered. | 
|  | */ | 
|  |  | 
|  | package big | 
|  |  | 
|  | import "math/bits" | 
|  |  | 
|  | // div returns q, r such that q = ⌊u/v⌋ and r = u%v = u - q·v. | 
|  | // It uses z and z2 as the storage for q and r. | 
|  | func (z nat) div(z2, u, v nat) (q, r nat) { | 
|  | if len(v) == 0 { | 
|  | panic("division by zero") | 
|  | } | 
|  |  | 
|  | if u.cmp(v) < 0 { | 
|  | q = z[:0] | 
|  | r = z2.set(u) | 
|  | return | 
|  | } | 
|  |  | 
|  | if len(v) == 1 { | 
|  | // Short division: long optimized for a single-word divisor. | 
|  | // In that case, the 2-by-1 guess is all we need at each step. | 
|  | var r2 Word | 
|  | q, r2 = z.divW(u, v[0]) | 
|  | r = z2.setWord(r2) | 
|  | return | 
|  | } | 
|  |  | 
|  | q, r = z.divLarge(z2, u, v) | 
|  | return | 
|  | } | 
|  |  | 
|  | // divW returns q, r such that q = ⌊x/y⌋ and r = x%y = x - q·y. | 
|  | // It uses z as the storage for q. | 
|  | // Note that y is a single digit (Word), not a big number. | 
|  | func (z nat) divW(x nat, y Word) (q nat, r Word) { | 
|  | m := len(x) | 
|  | switch { | 
|  | case y == 0: | 
|  | panic("division by zero") | 
|  | case y == 1: | 
|  | q = z.set(x) // result is x | 
|  | return | 
|  | case m == 0: | 
|  | q = z[:0] // result is 0 | 
|  | return | 
|  | } | 
|  | // m > 0 | 
|  | z = z.make(m) | 
|  | r = divWVW(z, 0, x, y) | 
|  | q = z.norm() | 
|  | return | 
|  | } | 
|  |  | 
|  | // modW returns x % d. | 
|  | func (x nat) modW(d Word) (r Word) { | 
|  | // TODO(agl): we don't actually need to store the q value. | 
|  | var q nat | 
|  | q = q.make(len(x)) | 
|  | return divWVW(q, 0, x, d) | 
|  | } | 
|  |  | 
|  | // divWVW overwrites z with ⌊x/y⌋, returning the remainder r. | 
|  | // The caller must ensure that len(z) = len(x). | 
|  | func divWVW(z []Word, xn Word, x []Word, y Word) (r Word) { | 
|  | r = xn | 
|  | if len(x) == 1 { | 
|  | qq, rr := bits.Div(uint(r), uint(x[0]), uint(y)) | 
|  | z[0] = Word(qq) | 
|  | return Word(rr) | 
|  | } | 
|  | rec := reciprocalWord(y) | 
|  | for i := len(z) - 1; i >= 0; i-- { | 
|  | z[i], r = divWW(r, x[i], y, rec) | 
|  | } | 
|  | return r | 
|  | } | 
|  |  | 
|  | // div returns q, r such that q = ⌊uIn/vIn⌋ and r = uIn%vIn = uIn - q·vIn. | 
|  | // It uses z and u as the storage for q and r. | 
|  | // The caller must ensure that len(vIn) ≥ 2 (use divW otherwise) | 
|  | // and that len(uIn) ≥ len(vIn) (the answer is 0, uIn otherwise). | 
|  | func (z nat) divLarge(u, uIn, vIn nat) (q, r nat) { | 
|  | n := len(vIn) | 
|  | m := len(uIn) - n | 
|  |  | 
|  | // Scale the inputs so vIn's top bit is 1 (see “Scaling Inputs” above). | 
|  | // vIn is treated as a read-only input (it may be in use by another | 
|  | // goroutine), so we must make a copy. | 
|  | // uIn is copied to u. | 
|  | shift := nlz(vIn[n-1]) | 
|  | vp := getNat(n) | 
|  | v := *vp | 
|  | shlVU(v, vIn, shift) | 
|  | u = u.make(len(uIn) + 1) | 
|  | u[len(uIn)] = shlVU(u[0:len(uIn)], uIn, shift) | 
|  |  | 
|  | // The caller should not pass aliased z and u, since those are | 
|  | // the two different outputs, but correct just in case. | 
|  | if alias(z, u) { | 
|  | z = nil | 
|  | } | 
|  | q = z.make(m + 1) | 
|  |  | 
|  | // Use basic or recursive long division depending on size. | 
|  | if n < divRecursiveThreshold { | 
|  | q.divBasic(u, v) | 
|  | } else { | 
|  | q.divRecursive(u, v) | 
|  | } | 
|  | putNat(vp) | 
|  |  | 
|  | q = q.norm() | 
|  |  | 
|  | // Undo scaling of remainder. | 
|  | shrVU(u, u, shift) | 
|  | r = u.norm() | 
|  |  | 
|  | return q, r | 
|  | } | 
|  |  | 
|  | // divBasic implements long division as described above. | 
|  | // It overwrites q with ⌊u/v⌋ and overwrites u with the remainder r. | 
|  | // q must be large enough to hold ⌊u/v⌋. | 
|  | func (q nat) divBasic(u, v nat) { | 
|  | n := len(v) | 
|  | m := len(u) - n | 
|  |  | 
|  | qhatvp := getNat(n + 1) | 
|  | qhatv := *qhatvp | 
|  |  | 
|  | // Set up for divWW below, precomputing reciprocal argument. | 
|  | vn1 := v[n-1] | 
|  | rec := reciprocalWord(vn1) | 
|  |  | 
|  | // Compute each digit of quotient. | 
|  | for j := m; j >= 0; j-- { | 
|  | // Compute the 2-by-1 guess q̂. | 
|  | // The first iteration must invent a leading 0 for u. | 
|  | qhat := Word(_M) | 
|  | var ujn Word | 
|  | if j+n < len(u) { | 
|  | ujn = u[j+n] | 
|  | } | 
|  |  | 
|  | // ujn ≤ vn1, or else q̂ would be more than one digit. | 
|  | // For ujn == vn1, we set q̂ to the max digit M above. | 
|  | // Otherwise, we compute the 2-by-1 guess. | 
|  | if ujn != vn1 { | 
|  | var rhat Word | 
|  | qhat, rhat = divWW(ujn, u[j+n-1], vn1, rec) | 
|  |  | 
|  | // Refine q̂ to a 3-by-2 guess. See “Refining Guesses” above. | 
|  | vn2 := v[n-2] | 
|  | x1, x2 := mulWW(qhat, vn2) | 
|  | ujn2 := u[j+n-2] | 
|  | for greaterThan(x1, x2, rhat, ujn2) { // x1x2 > r̂ u[j+n-2] | 
|  | qhat-- | 
|  | prevRhat := rhat | 
|  | rhat += vn1 | 
|  | // If r̂  overflows, then | 
|  | // r̂ u[j+n-2]v[n-1] is now definitely > x1 x2. | 
|  | if rhat < prevRhat { | 
|  | break | 
|  | } | 
|  | // TODO(rsc): No need for a full mulWW. | 
|  | // x2 += vn2; if x2 overflows, x1++ | 
|  | x1, x2 = mulWW(qhat, vn2) | 
|  | } | 
|  | } | 
|  |  | 
|  | // Compute q̂·v. | 
|  | qhatv[n] = mulAddVWW(qhatv[0:n], v, qhat, 0) | 
|  | qhl := len(qhatv) | 
|  | if j+qhl > len(u) && qhatv[n] == 0 { | 
|  | qhl-- | 
|  | } | 
|  |  | 
|  | // Subtract q̂·v from the current section of u. | 
|  | // If it underflows, q̂·v > u, which we fix up | 
|  | // by decrementing q̂ and adding v back. | 
|  | c := subVV(u[j:j+qhl], u[j:], qhatv) | 
|  | if c != 0 { | 
|  | c := addVV(u[j:j+n], u[j:], v) | 
|  | // If n == qhl, the carry from subVV and the carry from addVV | 
|  | // cancel out and don't affect u[j+n]. | 
|  | if n < qhl { | 
|  | u[j+n] += c | 
|  | } | 
|  | qhat-- | 
|  | } | 
|  |  | 
|  | // Save quotient digit. | 
|  | // Caller may know the top digit is zero and not leave room for it. | 
|  | if j == m && m == len(q) && qhat == 0 { | 
|  | continue | 
|  | } | 
|  | q[j] = qhat | 
|  | } | 
|  |  | 
|  | putNat(qhatvp) | 
|  | } | 
|  |  | 
|  | // greaterThan reports whether the two digit numbers x1 x2 > y1 y2. | 
|  | // TODO(rsc): In contradiction to most of this file, x1 is the high | 
|  | // digit and x2 is the low digit. This should be fixed. | 
|  | func greaterThan(x1, x2, y1, y2 Word) bool { | 
|  | return x1 > y1 || x1 == y1 && x2 > y2 | 
|  | } | 
|  |  | 
|  | // divRecursiveThreshold is the number of divisor digits | 
|  | // at which point divRecursive is faster than divBasic. | 
|  | const divRecursiveThreshold = 100 | 
|  |  | 
|  | // divRecursive implements recursive division as described above. | 
|  | // It overwrites z with ⌊u/v⌋ and overwrites u with the remainder r. | 
|  | // z must be large enough to hold ⌊u/v⌋. | 
|  | // This function is just for allocating and freeing temporaries | 
|  | // around divRecursiveStep, the real implementation. | 
|  | func (z nat) divRecursive(u, v nat) { | 
|  | // Recursion depth is (much) less than 2 log₂(len(v)). | 
|  | // Allocate a slice of temporaries to be reused across recursion, | 
|  | // plus one extra temporary not live across the recursion. | 
|  | recDepth := 2 * bits.Len(uint(len(v))) | 
|  | tmp := getNat(3 * len(v)) | 
|  | temps := make([]*nat, recDepth) | 
|  |  | 
|  | z.clear() | 
|  | z.divRecursiveStep(u, v, 0, tmp, temps) | 
|  |  | 
|  | // Free temporaries. | 
|  | for _, n := range temps { | 
|  | if n != nil { | 
|  | putNat(n) | 
|  | } | 
|  | } | 
|  | putNat(tmp) | 
|  | } | 
|  |  | 
|  | // divRecursiveStep is the actual implementation of recursive division. | 
|  | // It adds ⌊u/v⌋ to z and overwrites u with the remainder r. | 
|  | // z must be large enough to hold ⌊u/v⌋. | 
|  | // It uses temps[depth] (allocating if needed) as a temporary live across | 
|  | // the recursive call. It also uses tmp, but not live across the recursion. | 
|  | func (z nat) divRecursiveStep(u, v nat, depth int, tmp *nat, temps []*nat) { | 
|  | // u is a subsection of the original and may have leading zeros. | 
|  | // TODO(rsc): The v = v.norm() is useless and should be removed. | 
|  | // We know (and require) that v's top digit is ≥ B/2. | 
|  | u = u.norm() | 
|  | v = v.norm() | 
|  | if len(u) == 0 { | 
|  | z.clear() | 
|  | return | 
|  | } | 
|  |  | 
|  | // Fall back to basic division if the problem is now small enough. | 
|  | n := len(v) | 
|  | if n < divRecursiveThreshold { | 
|  | z.divBasic(u, v) | 
|  | return | 
|  | } | 
|  |  | 
|  | // Nothing to do if u is shorter than v (implies u < v). | 
|  | m := len(u) - n | 
|  | if m < 0 { | 
|  | return | 
|  | } | 
|  |  | 
|  | // We consider B digits in a row as a single wide digit. | 
|  | // (See “Recursive Division” above.) | 
|  | // | 
|  | // TODO(rsc): rename B to Wide, to avoid confusion with _B, | 
|  | // which is something entirely different. | 
|  | // TODO(rsc): Look into whether using ⌈n/2⌉ is better than ⌊n/2⌋. | 
|  | B := n / 2 | 
|  |  | 
|  | // Allocate a nat for qhat below. | 
|  | if temps[depth] == nil { | 
|  | temps[depth] = getNat(n) // TODO(rsc): Can be just B+1. | 
|  | } else { | 
|  | *temps[depth] = temps[depth].make(B + 1) | 
|  | } | 
|  |  | 
|  | // Compute each wide digit of the quotient. | 
|  | // | 
|  | // TODO(rsc): Change the loop to be | 
|  | //	for j := (m+B-1)/B*B; j > 0; j -= B { | 
|  | // which will make the final step a regular step, letting us | 
|  | // delete what amounts to an extra copy of the loop body below. | 
|  | j := m | 
|  | for j > B { | 
|  | // Divide u[j-B:j+n] (3 wide digits) by v (2 wide digits). | 
|  | // First make the 2-by-1-wide-digit guess using a recursive call. | 
|  | // Then extend the guess to the full 3-by-2 (see “Refining Guesses”). | 
|  | // | 
|  | // For the 2-by-1-wide-digit guess, instead of doing 2B-by-B-digit, | 
|  | // we use a (2B+1)-by-(B+1) digit, which handles the possibility that | 
|  | // the result has an extra leading 1 digit as well as guaranteeing | 
|  | // that the computed q̂ will be off by at most 1 instead of 2. | 
|  |  | 
|  | // s is the number of digits to drop from the 3B- and 2B-digit chunks. | 
|  | // We drop B-1 to be left with 2B+1 and B+1. | 
|  | s := (B - 1) | 
|  |  | 
|  | // uu is the up-to-3B-digit section of u we are working on. | 
|  | uu := u[j-B:] | 
|  |  | 
|  | // Compute the 2-by-1 guess q̂, leaving r̂ in uu[s:B+n]. | 
|  | qhat := *temps[depth] | 
|  | qhat.clear() | 
|  | qhat.divRecursiveStep(uu[s:B+n], v[s:], depth+1, tmp, temps) | 
|  | qhat = qhat.norm() | 
|  |  | 
|  | // Extend to a 3-by-2 quotient and remainder. | 
|  | // Because divRecursiveStep overwrote the top part of uu with | 
|  | // the remainder r̂, the full uu already contains the equivalent | 
|  | // of r̂·B + uₙ₋₂ from the “Refining Guesses” discussion. | 
|  | // Subtracting q̂·vₙ₋₂ from it will compute the full-length remainder. | 
|  | // If that subtraction underflows, q̂·v > u, which we fix up | 
|  | // by decrementing q̂ and adding v back, same as in long division. | 
|  |  | 
|  | // TODO(rsc): Instead of subtract and fix-up, this code is computing | 
|  | // q̂·vₙ₋₂ and decrementing q̂ until that product is ≤ u. | 
|  | // But we can do the subtraction directly, as in the comment above | 
|  | // and in long division, because we know that q̂ is wrong by at most one. | 
|  | qhatv := tmp.make(3 * n) | 
|  | qhatv.clear() | 
|  | qhatv = qhatv.mul(qhat, v[:s]) | 
|  | for i := 0; i < 2; i++ { | 
|  | e := qhatv.cmp(uu.norm()) | 
|  | if e <= 0 { | 
|  | break | 
|  | } | 
|  | subVW(qhat, qhat, 1) | 
|  | c := subVV(qhatv[:s], qhatv[:s], v[:s]) | 
|  | if len(qhatv) > s { | 
|  | subVW(qhatv[s:], qhatv[s:], c) | 
|  | } | 
|  | addAt(uu[s:], v[s:], 0) | 
|  | } | 
|  | if qhatv.cmp(uu.norm()) > 0 { | 
|  | panic("impossible") | 
|  | } | 
|  | c := subVV(uu[:len(qhatv)], uu[:len(qhatv)], qhatv) | 
|  | if c > 0 { | 
|  | subVW(uu[len(qhatv):], uu[len(qhatv):], c) | 
|  | } | 
|  | addAt(z, qhat, j-B) | 
|  | j -= B | 
|  | } | 
|  |  | 
|  | // TODO(rsc): Rewrite loop as described above and delete all this code. | 
|  |  | 
|  | // Now u < (v<<B), compute lower bits in the same way. | 
|  | // Choose shift = B-1 again. | 
|  | s := B - 1 | 
|  | qhat := *temps[depth] | 
|  | qhat.clear() | 
|  | qhat.divRecursiveStep(u[s:].norm(), v[s:], depth+1, tmp, temps) | 
|  | qhat = qhat.norm() | 
|  | qhatv := tmp.make(3 * n) | 
|  | qhatv.clear() | 
|  | qhatv = qhatv.mul(qhat, v[:s]) | 
|  | // Set the correct remainder as before. | 
|  | for i := 0; i < 2; i++ { | 
|  | if e := qhatv.cmp(u.norm()); e > 0 { | 
|  | subVW(qhat, qhat, 1) | 
|  | c := subVV(qhatv[:s], qhatv[:s], v[:s]) | 
|  | if len(qhatv) > s { | 
|  | subVW(qhatv[s:], qhatv[s:], c) | 
|  | } | 
|  | addAt(u[s:], v[s:], 0) | 
|  | } | 
|  | } | 
|  | if qhatv.cmp(u.norm()) > 0 { | 
|  | panic("impossible") | 
|  | } | 
|  | c := subVV(u[0:len(qhatv)], u[0:len(qhatv)], qhatv) | 
|  | if c > 0 { | 
|  | c = subVW(u[len(qhatv):], u[len(qhatv):], c) | 
|  | } | 
|  | if c > 0 { | 
|  | panic("impossible") | 
|  | } | 
|  |  | 
|  | // Done! | 
|  | addAt(z, qhat.norm(), 0) | 
|  | } |