Computation of 2700 billion decimal digits of Pi using a
Desktop Computer
Fabrice Bellard
Jan 6, 2010
This article describes some of the methods used to get the world record of the
computation of the digits of pi digits using an inexpensive desktop computer.
1 Notations
We assume that numbers are represented in base B with B = 264. A digit in
base B is called a limb. M(n) is the time needed to multiply n limb numbers. We
assume thatM(Cn) is approximately CM(n), which meansM(n) is mostly linear,
which is the case when handling very large numbers with the Scho¨nhage-Strassen
multiplication [5].
log(n) means the natural logarithm of n. log2(n) is log(n)/ log(2).
SI and binary prefixes are used (i.e. 1 TB = 1012 bytes, 1 GiB = 230 bytes).
2 Evaluation of the Chudnovsky series
2.1 Introduction
The digits of pi were computed using the Chudnovsky series [10]
1
pi
=
12
C3/2
∞∑
n=0
(−1)n(6n)!(U + V n)
(n!)3(3n)!C3n
with
U = 13591409 V = 545140134 C = 640320 .
It was evaluated with the binary splitting algorithm. The asymptotic running
time is O(log(n)2M(n)) for a n limb result. It is worst than the asymptotic running
time of the Arithmetic-Geometric Mean algorithms of O(log(n)M(n)) but it has
better locality and many improvements can reduce its constant factor.
Let S(a, b) defined as
b−1∑
n=a
an
∏n
k=a g(k)∏n
k=a q(k)
1
If we define
G(a, b) =
b−1∏
k=a
g(k)
Q(a, b) =
b−1∏
k=a
q(k)
and P (a, b) such as
S(a, b) =
P (a, b)
Q(a, b)
then if a < c < b, we have:
P (a, b)
Q(a, b)
=
P (a, c)
Q(a, c)
+
G(a, c)P (c, b)
Q(a, c)Q(c, b)
hence
P (a, b) = P (a, c)Q(c, b) + P (c, b)G(a, c)
which gives a way to compute P recursively. Moreover, both Q and G can be
computed recursively by seeing that:
G(a, b) = G(a, c)G(c, b)
Q(a, b) = Q(a, c)Q(c, b)
The binary splitting algorithm allows to compute S(a, b) recursively:
If a = b− 1 then
G(b− 1, b) ← g(b)
Q(b− 1, b) ← q(b)
P (b− 1, b) ← a(b)g(b)
else choose c = b(a+ b)/2c and evaluate:
G(a, b) ← G(a, c)G(c, b)
Q(a, b) ← Q(a, c)Q(c, b)
P (a, b) ← P (a, c)Q(c, b) + P (c, b)G(a, c)
For the Chudnovsky series we take:
g(n) = (2n− 1)(6n− 5)(6n− 1)
a(n) = (−1)n(U + V n)
q(n) = n3C3/24
2
We get then
pi ≈ Q(0, N)
12P (0, N) + 12AQ(0, N)
C3/2
2.2 Improvements
2.2.1 Common factor reduction
The result is not modified if common factors are removed from the numerator and
the denominator. The recursion becomes:
D ← gcd(G1(a, c), Q1(c, b))
GD ← G1(a, c)/D
QD ← Q1(c, b)/D
G1(a, b) ← GDG1(c, b)
Q1(a, b) ← Q1(a, c)QD
P1(a, b) ← P1(a, c)QD + P1(c, b)GD
The gcd (Greatest Common Divisor) is computed by explicitly storing the
factorization of G and Q in small prime factors at every step. The factorization
is computed by using a sieve [3]. In [3] the sieve takes a memory of O(N log(N))
which is very large. Our implementation only uses a memory of O(N1/2 log(N)),
which is negligible compared to the other memory use.
As
(6n)!
(n!)3(3n)!
=
n∏
k=1
24(2k − 1)(6k − 5)(6k − 1)
k3
is an integer, it is useful to note that the final denominator Q(0, N) of the sum
divides C3N when all the common factors have been removed.
So the maximum size of the final P (0, N) can easily be evaluated. Here it is
log(C)/ log(C/12) ≈ 1.22 times the size of the final result.
2.2.2 Precomputed powers of C3
It is not efficient to recompute the powers of C3/24 at every step of the recursion.
So we precompute them separately.
Another point is that the final value of Q is known to divide C3N . So in order
to have a bounded size result, it is interesting to precompute the powers of C3
instead of C3/24. Some space is lost for the storage of G, but the overall memory
usage is smaller.
The recursion becomes:
g2(n) = 24(2n− 1)(6n− 5)(6n− 1)
a2(n) = (−1)n(U + V n)
q2(n) = n3
3
D ← gcd(G2(a, c), Q2(c, b))
GD ← G2(a, c)/D
QD ← Q2(c, b)/D
G2(a, b) ← GDG(c, b)
Q2(a, b) ← Q2(a, c)QD
P2(a, b) ← P2(a, c)QDC3(b−c) + P2(c, b)GD.
C3n needs only be computed for log2(N) exponents if N is a power of two. If
N is not a power of two, it is also possible to use dlog2(N)e exponents by choosing
c correctly at every step.
2.3 Multi-threading
When the operands are small enough, different parts of the binary splitting re-
cursion are executed in different threads to optimally use the CPU cores. When
the operands get bigger, a single thread is used, but the large multiplications are
multi-threaded.
2.4 Restartability
When operands are stored on disk, each step of the computation is implemented
so that it is restartable. It means that if the program is stopped (for example
because of a power loss or a hardware failure), it can be restarted from a known
synchronization point.
3 Verification
3.1 Binary result
The usual way is to get the same result using a different formula. Then it is highly
unlikely that a software or hardware bug modifies the result the same way.
Our original plan was to use the Ramanujan formula which is very similar to
the Chudnovsky one, but less efficient (8 digits per term instead of 14 digits per
term). However, a hard disk failure on a second PC doing the verification delayed
it and we did not wanted to spend more time on it.
Instead of redoing the full computation, we decided to only compute the last
binary digits of the result using the Borwein-Bailey-Plouffe algorithm [8] with a
formula we found which is slightly faster (and easier to prove) than the original
BBP formula [9].
Of course, verifying the last digits only demonstrates that it is very unlikely
that the overall algorithm is incorrect, but hardware errors can still have modified
the previous digits. So we used a second proof relying on the fact that the last
computation step of the Chudnovsky formula is implemented as a floating point
4
multiplication of two N limbs floating point numbers C1 and C2. We suppose here
that the floating point results are represented as size N integers. Then:
R = C1C2 (R has 2N limbs)
piBN ≈ bR/BNc (truncation to take the high N limbs)
where C2/BN ≈ C3/2 and C1/BN is the result of the binary splitting.
If we assume that C1 and C2 are random like, then the last digits of the result
of the multiplication depend on all the input digits of C1 and C2.
More precisely, if there is an additive error eBk in C1 with −B < e < B and
0 ≤ k < N , then
R = (C1 + eBk)C2 = C1C2 + eC2.Bk
The fact that C2 has random like digits ensures that digits near the position
N in R are modified (multiplying eC2 by Bk can be seen as a left shift by k limbs
of eC2).
So it remains to prove that the multiplication itself was done without error.
To do it, we check the 2N limb result R (before doing the truncation) modulo a
small number p :
c1 ← C1 mod p
c2 ← C2 mod p
r1 ← c1c2 mod p
r2 ← R mod p
The check consists in verifying that r1 = r2. p is chosen as a prime number
so that the modulus depends on all the limbs of the input numbers. Its size was
fixed to 64 bit which gives a negligible probability of not catching an error.
3.2 Base conversion
The binary to base 10 conversion also needs to be verified. The common way is to
convert back the result from base 10 to binary and to verify that it matches the
original binary result.
We implemented this method, but it would have doubled the computation
time. Instead we did a more efficient (but less direct) verification:
Let R be the N limb floating point binary result and R2 be the base 10 result
expressed as a N2 limb floating point number in base B2 (B2 = 1019 in our case).
Assuming that R2 is normalized between 1/B2 and 1, we want to verify that
bRBN22 c = bR2BN22 c
5
neglecting possible rounding errors. Doing the full computation would need
converting back R2BN22 to integer, which does not save time. So we check the
equality modulo a small prime p.
We compute r1 = bRBN22 c mod p the usual way (cost of O(M(N))).
r2 = bR2BN22 c mod p is evaluated by considering bR2BN22 c as a polynomial
in B2 evaluated modulo p using the Horner algorithm. It has a cost of O(N)
assuming that single word operations have a running time of O(1).
Then the check consists in verifying that r1 = r2.
4 Fast multiplication algorithm
4.1 Small sizes
For small sizes (a few limbs), the naive O(n2) method is used. For larger sizes (up
to a few tens of limbs), the Karatsuba method is used (O(nlog2(3)) running time).
4.2 Large sizes (operands fit in RAM)
For larger sizes, the multiplication is implemented as a cyclic convolution of poly-
nomials. The cyclic convolution of polynomials is implemented with floating point
Discrete Fourier Transforms (DFT) [5].
Instead of doing more complicated real valued DFTs, we do complex DFTs and
use a dyadic multiplication instead of the point-wise multiplication. Negacyclic
convolutions use a similar trick [2].
To reduce the memory usage, large multiplications giving a result of a size of
2N limbs are splitted in two multiplications modulo BN − 1 (cyclic convolution)
and modulo BN +1 (negacyclic convolution). The two results are combined using
the Chinese Remainder Theorem (CRT) to get a multiplication modulo B2N − 1.
Although it is possible to design the floating point DFT convolution with
guaranteed error bounds [4] [1], it severely reduces its performance. Instead of
doing that, we use the fact that the input numbers are random like and we use
a DFT limb size found empirically. We systematically use a fast modulo (B − 1)
checksum for the cyclic convolution to detect most rounding errors. We made the
assumption that possible rounding error left would be detected by the verification
steps.
A DFT of size N = n1n2 is evaluated by doing n2 DFT of size n1, N multipli-
cations by complex exponentials (they are usually called twiddle factors) and n1
DFT of size n2. The method can be applied recursively [6].
When N is small, we use n1 or n2 = 2, 3, 5 or 7 which give well known Fast
Fourier Transform algorithms using decimation in time or decimation in fre-
quency. Using other factors than 2 allows N to be chosen very close to the actual
size of the result, which is important to reduce the memory usage and to increase
the speed.
6
When N is large, a better CPU cache usage is obtained by choosing factors so
that each DFT fits in the L1 cache. The smaller DFTs are evaluated using several
threads to use all the CPU cores. Since the task is mostly I/O bound, it does not
scale linearly with the number of CPUs.
Another important detail is how to compute the complex exponentials. For
the small DFTs fitting the L1 cache, a cache is used to store all the possible values.
The twiddle factors of the large DFTs are evaluated on the fly because it would use
too much memory to save them. It comes from the fact that the binary splitting
uses multiplications of very different sizes. The twiddle factors are evaluated using
multiplication trees ensuring there is little error propagation. The intermediate
complex values are stored using 80 bit floating point numbers which are natively
supported by x86 CPUs.
All the other floating point computations are achieved using 64 bit floating
point numbers. All the operations are vectorized using the x86 SSE3 instruction
set.
4.3 Huge sizes (operands are on disk)
4.3.1 Main parameters
For huge sizes (hundreds gigabytes as of year 2009), the main goal was to reduce
the amount of disk I/O. A key parameter is what we call the DFT expansion ratio
r, which is the ratio between the DFT word size and the DFT limb size.
For example, for DFTs of 227 complex samples, the expansion ratio is about
4.5.
The DFT expansion ratio cannot be smaller than two because the result of the
convolution needs at least twice the size of the DFT limb size. A ratio close to 2
can be obtained by using large DFT word sizes. Using the hardware of year 2009,
it is easier to do it using a Number Theoretic Transform (NTT) which is a DFT
in Z/nZ where n is chosen so that there are primitive roots of unity of order N ,
where N is the size of the NTT.
A possible choice is to choose n = 22
k
+1, which gives the Scho¨nhage-Strassen
multiplication algorithm. However, it is difficult to make it efficient for all the
result sizes N , although it can be improved by using the Chinese Remainder
Theorem (CRT) as described in [7].
Another choice is to use several small primes n and to combine them with the
CRT. The advantage is that the same multi-threading and cache optimizations as
for the floating point DFT are reused and that it is possible to find a NTT size
N very close to the needed result size. The drawback is that the computation
involves modular arithmetic on small numbers which is not as efficient as floating
point arithmetic on x86 CPUs of the year 2009.
However, it turns out that latest x86 CPUs are very fast at doing 64 bit by
64 bit multiplications giving a 64 bit result. We improved the NTT speed by
using the fact that most modular multiplications are modular multiplications by
7
a constant. This trick is attributed to Victor Shoup and is described in section
4.2 of [11].
We choose 8 64 bit moduli, which gives a DFT expansion ratio of about 2.2
for NTTs of a few terabytes. Using more moduli is possible to get closer to 2, but
it complicates the final CRT step.
4.3.2 NTT convolution
The NTT size N is split in two factors n1 and n2. n2 is chosen as large as possible
with the constraint that a NTT of size n2 can be done in memory. The resulting
convolution in split in 3 steps:
1. n2 NTTs of size n1 on both operands, multiplication by the twiddle factors
2. n1 NTTs of size n2 on both operands, pointwise multiplication, n1 inverse
NTT of size n2
3. multiplication by the twiddle factors, n2 NTTs of size n1 on the result.
Step (2) involves only linear disk access, which are fast. (1) and (3) involves
more disk seeks as n1 gets bigger because the NTTs must be done using samples
separated by n2 samples.
If n1 is too big, it is split again in two steps, which gives a 5 step convolution
algorithm.
For the 3 step algorithm, the amount of disk I/O is 12r+4 assuming two input
operands of unit size and a DFT expansion ratio of r.
The needed disk space, assuming the operands are destroyed is 4r + 1.
If the disk space is limited, we use the same CRT trick as for the floating point
DFT by doing two multiplications modulo BN − 1 and BN +1. The amount disk
I/O is 12r + 10 but the needed disk space is lowered to 2r + 2.
The disk space can be further reduced at the expense of more I/Os by using
more than two moduli.
5 Other operations
5.1 Division
The evaluation of the Chudnovsky formula involves a lot of integer divisions be-
cause of the GCD trick. So having a good division algorithm is critical. We
implemented the one described in [1] using the concept of middle product. Then
the floating point inversion has a cost of 2 multiplications and the division a cost
of 2.5 multiplications.
8
5.2 Base conversion
The usual base conversion algorithm for huge numbers involves recursively dividing
by powers of the new base. The running time is 0.5D(n/2) log2(n) [1] when n is
a power of two and D(n) is the time needed to divide a 2n limb number by a n
limb number.
We implemented a better algorithm by using the Bernstein scaled remainder
tree and the middle product concept, as suggested in exercise 1.32 of [1]. The
running time becomes 0.5M(n/2) log2(n).
6 Pi computation parameters
We describe the various parameters to compute 2700 billion decimal digits of pi.
6.1 Global memory use
The full pi computation, including the base conversion uses a total memory of 6.42
times the size of the final result. Here the result is 1.12 TB big, hence at least 7.2
TB of mass storage is needed.
It is possible to further reduce the memory usage by using more moduli in the
large multiplications (see section 4.2).
6.2 Hardware
A standard desktop PC was used, using a Core i7 CPU at 2.93 GHz. This CPU
contains 4 physical cores.
We put only 6 GiB of RAM to reduce the cost. It means the amount of RAM
is about 170 times smaller than the size of the final result, so the I/O performance
of the mass storage is critical. Unfortunately, the RAM had no ECC (Error
Correcting Code), so random bit errors could not be corrected nor detected. Since
the computation lasted more than 100 days, such errors were likely [12]. Hopefully,
the computations included verification steps which could detect such errors.
For the mass storage, five 1.5 TB hard disks were used. The aggregated peak
I/O speed is about 500 MB/s.
6.3 Computation time
• Computation of the binary digits: 103 days
• Verification of the binary digits: 13 days
• Conversion to base 10: 12 days
• Verification of the conversion: 3 days
9
7 Ideas for future work
Here are some potentially interesting topics:
• Study the theoretical running time and memory usage of the evaluation of
the Chudnovsky series using the binary splitting and the mentioned improve-
ments.
• See if the Scho¨nhage-Strassen multiplication can be faster than the NTT
with small moduli when using mass storage.
• Speed up the computations using a network of PCs and/or many-core CPUs.
• Lower the memory usage of the pi computation.
8 Conclusion
We showed that a careful implementation of the latest arbitrary-precision arith-
metic algorithms allows computations on numbers in the terabyte range with
inexpensive hardware resources.
References
[1] Richard Brent and Paul Zimmermann, Modern Computer Arithmetic, version
0.4, November 2009,
http://www.loria.fr/~zimmerma/mca/mca-0.4.pdf .
[2] Richard Crandall, Integer convolution via split-radix fast Galois transform,
Feb 1999,
http://people.reed.edu/~crandall/papers/confgt.pdf .
[3] Hanhong Xue, gmp-chudnovsky.c program to compute the digits of Pi using
the GMP library, http://gmplib.org/pi-with-gmp.html .
[4] Richard P. Brent, Colin Percival, and Paul Zimmermann, Error bounds
on complex floating-point multiplication. Mathematics of Computation,
76(259):1469-1481, 2007.
[5] Arnold Scho¨nhage and Volker Strassen, Schnelle Multiplikation großer
Zahlen, Computing 7:281-292, 1971.
[6] Cooley, James W., and John W. Tukey, An algorithm for the machine calcu-
lation of complex Fourier series, Math. Comput. 19, 297-301 (1965).
[7] Pierrick Gaudry, Alexander Kruppa, Paul Zimmermann, A GMP-based Im-
plementation of Scho¨nhage-Strassen’s Large Integer Multiplication Algorithm,
ISSAC’07.
10
[8] Bailey David H., Borwein Peter B., and Plouffe Simon, On the Rapid Com-
putation of Various Polylogarithmic Constants, April 1997, Mathematics of
Computation 66 (218): 903-913.
[9] Fabrice Bellard, A new formula to compute the n’th binary digit of Pi, 1997,
http://bellard.org/pi/pi_bin.pdf .
[10] D. V. Chudnovsky and G. V. Chudnovsky, Approximations and complex mul-
tiplication according to Ramanujan, in Ramanujan Revisited, Academic Press
Inc., Boston, (1988), p. 375-396 & p. 468-472.
[11] Tommy Fa¨rnqvist, Number Theory Meets Cache Locality - Efficient Imple-
mentation of a Small Prime FFT for the GNU Multiple Precision Arithmetic
Library, 2005,
http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/
2005/rapporter05/farnqvist_tommy_05091.pdf .
[12] Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, DRAM Errors in
the Wild: A Large-Scale Field Study, SIGMETRICS/Performance’09, 2009.
11
本文档为【圆周率计算】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。