A Tour of NTL: Summary of Changes
# threads time (ms) effectiveness
1 606
2 310 97%
4 168 90%
8 103 74%
For ZZX:
# threads time (ms) effectiveness
1 461
2 247 93%
4 144 80%
8 96 60%
n 1024 2048 4096 8192 16384 non-AVX 0.171 0.741 3.192 14.348 60.812 AVX512 0.089 0.372 1.648 7.740 35.588
NTL_STD_CXX11=off NTL_THREADS=off NTL_SAFE_VECTORS=off
for (auto&& item : vec) { ... }
I also compared NTL's new mat_ZZ_p multiplication to FLINT's fmpz_mat multiplication. The latter also uses a multi-modular strategy. For n=128,256,512,1024 as above, NTL's code is between 2.7 and 3.1 times faster than FLINT's (and we did not count the time it would take to reduce the entries mod p for the FLINT code).
Part of this may be due to the AVX-enhanced small-prime matrix multiplication code used by NTL, and part may be due to better CRT code.
% configure TUNE=x86 % make -j20then the whole business of configuring and building NTL takes less than a minute :-).
It would be great to discuss these issues with any C++ experts out there!
Fixed a "pseudo" bug in the test script: BitMatTest in make check was reporting "failure", but this was a bug in BitMatTest, not in NTL itself.
NTL_MAXIMIZE_SP_NBITS=off # Allows for 62-bit single-precision moduli on 64-bit platforms. # By default, such moduli are restricted to 60 bits, which # usually gives *slightly* better performance across a range of # of parameters.
This feature is a work in progress. Currently, basic ZZ_pX arithmetic has been thread boosted. More code will be boosted later.
I also expanded the PRG interface a bit: see here for details.
One can still revert to the "classical" (pre-9.0) implementation using double-precision arithmetic (which imposes a 50-bit limit), or to the "long double" implementation introduced in v9.0 (60-bit limit).
Note that one must compile NTL with GMP to get any of these improvements. It would have perhaps been better to use GMP's longlong.h facility instead of relying on compiler support for extended integer types. However, at the moment, it is a bit inconvenient to use longlong.h as a freestanding header file. This might change in the future.
For details, see here, including the comments entitled "Compatibility notes".
Programming notes: MulMod(a, b, n) is equivalent to mulmod_t ninv = PrepMulMod(n); MulMod(a, b, n, ninv). Compared to the older, floating-point implementation, the relative cost of computing ninv is higher in the new regime. In a loop where n is invariant, the compiler should "hoist" the computation of ninv, so it is only done once. However, it is usually better to precompute and store ninv, and use the second form of MulMod, with ninv passed as a parameter (NTL does this internally quite consistently). The performance of MulMod(a, b, n, ninv) is somewhat faster in the new implementation. Where possible, one should use MulModPrecon, which is faster still (useful in situations where both n and b are invariant).
Note that this is an alternative to building NTL against the gf2x library (the latter is currently not thread or exception safe).
Basically, these changes to the interface abstract away some implementation details that arguably should never been there in the first place. By coding to the new interface, NTL clients will be able to benefit from the current and future improvements.
In particular, on 64-bit x86/GCC platforms, single precision moduli can now be up to 60 bits, rather than 50 bits. While some operations may in fact be a little slower, the most important ones (like MulModPrecon) should not be. Using larger moduli speeds up a number of things, like ZZ_pX arithmetic, as fewer primes need to be used in Chinese Remaindering steps. Other applications benefit from larger moduli as well.
It is expected that most NTL clients will not be affected at all. Moreover, any code that needs to be updated will be detected by the compiler, and the updates should be simple and mechanical. There is also a configuration flag that will enable the legacy interface (although this is not recommended practice).
For details, see here, including the comments entitled "Compatibility notes".
The goal since v6.2 has been to allow all modular types (ZZ_p, etc.) and all types derived from them (vectors, polynomials, matrices, etc.) to be safely copy constructed and assigned out of context. Hopefully, this goal has now been reached.
This is another major milestone for NTL, and hence the big version number bump (this will be the last of these big bumps for a while).
Prior to this version, error handling consisted of "abort with an error message". To enable exceptions in NTL, configure with NTL_EXCEPTIONS=on. You will also need a C++11 compiler for this to work properly (and if you don't enable exceptions, any old C++98 compiler will work, as always).
With exceptions enabled, errors are reported by throwing an appropriate exception. Of course, this was the easy part. The hard part was making NTL's code exception safe, which (among other things) means that no resources (i.e., memory) are leaked when an exception is thrown. This required a very painful top-to-bottom scrub of the whole library.
Despite major changes to the code base and many internal interfaces, the external (i.e., documented) interfaces remain completely unchanged.
More details are available here.
This is a major milestone for NTL (and hence a bump in the major version number). However, to actually use it, you will need a "bleeding edge" C++ that supports C++11 concurrency features. Most importantly, the C++11 storage class thread_local needs to be fully and correctly implemented. Some compilers claim to support it, but are very buggy to the point of being useless. All I can say is, as of right now, I have been able to successfully build and test a multi-threaded NTL program using GCC 4.9.2 on a Red Hat Linux distribution. I don't think any pre-4.9.x version of GCC will work. And unfortunatly, I don't think any compiler (GCC or CLANG) on any current Mac will work, but I haven't been able to directly test this.
As time goes on, I expect C++ compilers will provide the necessary support. In the meantime, you can try it out and see if it works for you. Configure with the NTL_THREADS flag turned on and see what happens. The test program ThreadTest that runs as the last step of make check will let you know if it works. If not, you can try building GCC 4.9.2 yourself. It is actually not that hard!
See the portability and implementation section for more information. In any case, if threads don't work for you, just don't use them. Everything still works as before using almost any compiler.
|
T a; Vec<T> v(INIT_SIZE, n, a); |
is equivalent to
|
T a; Vec<T> v; v.SetLength(n, a); |
In both cases, the copy constructor for T is used.
|
ZZ w = ZZ(1); // legal ZZ w(1); // legal ZZ w{1}; // legal in C++11 ZZ w = 1; // not legal |
Also added new names for the "monomial constructors", e.g., ZZX(INIT_MONO, i, c) is now preferred to ZZX(i, c), although the old constructors are still there. There are also new constructors like ZZX(INIT_MONO, i) for making monic monomials.
ZZ_p, zz_p, ZZ_pE, lzz_pE, GF2E,may now be used a bit more flexibly.
It is critical that such objects created under one modulus are not used in any non-trivial way "out of context", i.e., under a different (or undefined) modulus. However, for ease-of-use, some operations may be safely performed out of context. These safe operations now include: the default and copy constructor, the destructor, and the assignment operator. In addition it is generally safe to read any object out of context (i.e., printing it out, or fetching its underlying representive using the rep() function). (In the past, it was generally unsafe to use the the default and copy constructors out of context, which also prevented vectors and polynomials of such objects from being copied out of context.)
The implementations of Vec<ZZ_p> and Vec<GF2E> are still specialized to manage memory more efficiently than in the default implementation of Vec<T>. Contiguous elements in such an array are allocated in a contiguous region of memory. This reduces the number of calls to the memory allocator, and leads to greater locality of reference. A consequence of this implementation is that any calls to SetLength on such a vector will need to use information about the current modulus, and so such calls should only be done "in context". That said, it is still safe to construct a such a vector using the default or copy contructor, and to assign or append one to another "out of context".
|
{ ZZ_pPush push(p); ... } |
will save the current modulus and install p as the new modulus; when the destructor for push is invoked, the old modulus will be re-installed.
There are a few remaining trouble spots I've tagged: these mostly involve lazy initialization of tables; I have a plan for making this code thread safe using nearly lock-free coding techniques.
I will hopefully get this done within the next 6-12 months. One thing that is slowing me down is the lack of availibility of C++11 features that I need to do some of this, but it will come.
The main reason for getting rid of the esoteric compilation modes mentioned above is to make it easier to do this thread-safety work.
|
static void zz_p::UserFFTInit(long p); zz_pContext::zz_pContext(INIT_USER_FFT_TYPE, long p); |
in the lzz_p module.
For backwards compatibilty, all the names that were used in previous versions (e.g., vec_ZZ_p, mat_ZZ_p) have been replaced with appropriate typedefs.
For many years, I resisted the temptation of using templates, because compiler support was very inconsistent. But that no longer seems to be the case.
This change, while rather sweeping, should create very few, if any, incompatibilities with existing software. The biggest issue would be for software that uses the old template-like macros: such macro invocations can simply be replaced with appropriate typedefs.
There are many new conversions provided. Moreover, whenever there is a conversion from a ring R to a ring S, there is a corresponding, coefficiet-wise conversion from the polynomial ring R[X] to the polynomial ring S[X].
In addition, using the template mechanism, there are generic conversions for vectors and matrices. For example, if there is a conversion from S to T, then there is automatically a corresponding component-wise conversion from Vec<S> to Vec<T>.
Thanks to Paul Zimmermann for finding (and fixing) these bugs! Paul has also validated NTL's RR class by cross-checking it with the MPFR library.