Results 1 to 4 of 4
Thread: 64bit fp precision problem

Enjoy an ad free experience by logging in. Not a member yet? Register.



04252006 #1
 Join Date
 Aug 2005
 Location
 London
 Posts
 3
64bit fp precision problem
I tried this little test program, just to get a feel for roughly how many decimal significant figures of precision I can expect:
a=0.1;
b=1.0;
for (i=0;i<22;i++) {
printf("%3d%75.70llf\n",i,b);
b=1.0+a;
a/=10.0;
}
Now, when a,b are declared as long double, I get this output:
0 1.000000000000000000000000000000000000000000000000 0000000000000000000000
1 1.100000000000000005551115123125782702118158340454 1015625000000000000000
2 1.010000000000000000533427468862868181531666778028 0113220214843750000000
3 1.001000000000000000020816681711721685132943093776 7028808593750000000000
4 1.000100000000000000045449755070592345873592421412 4679565429687500000000
5 1.000010000000000000026229018956769323267508298158 6456298828125000000000
6 1.000001000000000000024306945345387021006899885833 2633972167968750000000
7 1.000000099999999999948220585910263480400317348539 8292541503906250000000
8 1.000000010000000000049032167215301569740404374897 4800109863281250000000
9 1.000000001000000000015745238446385201314114965498 4474182128906250000000
10 1.000000000100000000034100589019203653151635080575 9429931640625000000000
11 1.000000000009999999960041972002500187954865396022 7966308593750000000000
12 1.000000000000999999996004197200250018795486539602 2796630859375000000000
13 1.000000000000099999977916376270314913199399597942 8291320800781250000000
14 1.000000000000010000030317702801596624340163543820 3811645507812500000000
15 1.000000000000000999959663683380739485073718242347 2404479980468750000000
16 1.000000000000000099963440303163508815487148240208 6257934570312500000000
17 1.000000000000000009974659986866640792868565768003 4637451171875000000000
18 1.000000000000000000975781955236953990606707520782 9475402832031250000000
19 1.000000000000000000108420217248550443400745280086 9941711425781250000000
20 1.000000000000000000000000000000000000000000000000 0000000000000000000000
21 1.000000000000000000000000000000000000000000000000 0000000000000000000000
And, when a,b are declared as double (and also changing the format in the printf() from llf to lf), I get this output:
0 1.000000000000000000000000000000000000000000000000 0000000000000000000000
1 1.100000000000000088817841970012523233890533447265 6250000000000000000000
2 1.010000000000000008881784197001252323389053344726 5625000000000000000000
3 1.000999999999999889865875957184471189975738525390 6250000000000000000000
4 1.000099999999999988986587595718447118997573852539 0625000000000000000000
5 1.000010000000000065512040237081237137317657470703 1250000000000000000000
6 1.000000999999999917733362053695600479841232299804 6875000000000000000000
7 1.000000100000000058386717682878952473402023315429 6875000000000000000000
8 1.000000009999999939225290290778502821922302246093 7500000000000000000000
9 1.000000001000000082740370999090373516082763671875 0000000000000000000000
10 1.000000000100000008274037099909037351608276367187 5000000000000000000000
11 1.000000000010000000827403709990903735160827636718 7500000000000000000000
12 1.000000000001000088900582341011613607406616210937 5000000000000000000000
13 1.000000000000099920072216264088638126850128173828 1250000000000000000000
14 1.000000000000009992007221626408863812685012817382 8125000000000000000000
15 1.000000000000001110223024625156540423631668090820 3125000000000000000000
16 1.000000000000000000000000000000000000000000000000 0000000000000000000000
17 1.000000000000000000000000000000000000000000000000 0000000000000000000000
18 1.000000000000000000000000000000000000000000000000 0000000000000000000000
19 1.000000000000000000000000000000000000000000000000 0000000000000000000000
20 1.000000000000000000000000000000000000000000000000 0000000000000000000000
21 1.000000000000000000000000000000000000000000000000 0000000000000000000000
The problem is, there's not as much difference as I expected: with long double it goes off the end of the available precision at about 20 sig. figs, and with double at about 16 sig. figs. A difference of only about 10**4 or 2**14 or less than 2 bytes. Without knowing exactly how the floatingpoint numbers are encoded, I'd still expect several bytes more binary precision in 16byte over 8byte floatingpoint arithmetic.
I can see that there are lots of possible levels at which this poor improvement might result (incorrect use of printf formats, compiler type conversion conventions, kernel/CPU treatment of floatingpoint operations, ...) Can anyone explain it  and, more importantly, what, if anything I can do to get full 16byte floatingpoint arithmetic?
Thanks.
aw99

04262006 #2
 Join Date
 Aug 2005
 Location
 London
 Posts
 3
As a followup to the above, I have found an "answer"  by experiment, however, and I've still got no idea why they chose to design it that way ... so if there are any experts out there your insights would still be appreciated!
It appears that, although a long double on my computer occupies 16 bytes of storage, the floatingpoint representation it contains is only in fact 10 bytes long. The last six bytes of the 16 (or the first six, depending how you look at it) are empty. Without having tested the thing exhaustively, it looks to me like it is using 64 bits of abscissa, 15 bits of exponent and a sign bit, making a total of 80 bits, or 10 bytes. That's only 2 bytes longer than the conventional 8byte double, which explains why I'm only getting about a couple of bytes more binary precision in my calculations. The long double arithmetic is also quite a lot slower than double (80% more time taken, for my arbitrary collection of adds and multiplies), further eroding the benefits of using long doubles over conventional 8byte ones.
aw99

04272006 #3
 Join Date
 Aug 2005
 Location
 Italy
 Posts
 401
About double arithmetic, I can say only to trust about CPU calculations...
If you need speed, fixed point arithmetic is a LOT faster, because based on integer types. I can explain that: you should define a base type for fixed point operations; haveing N bit available, F bits (least significant) are dedicated to fixed point values, while the NF bit (most significant) are dedicated to the integer part and (eventually) sign.
Here is the weight of the bits (assuming 16 bit number, with 8 bits for fixed point, with sign):
bit[8] = 2^0
bit[9] = 2^1
bit[10] = 2^2
...
bit[15] = 2^7
bit[16] = sign (0: positive; 1: negative)
...
bit[7] = 2^1
bit[6] = 2^2
bit[5] = 2^3
...
bit[0] = 2^8
To build a fixed point operation you should use shifting operations: then a generic number fixedpoint number is managed as follow:
[code]
/* Build an integer */
fp = intval << FIXED_POINT_PRECISION;
/* Get integer part */
ip = fp >> FIXED_POINT_PRECISION
/* Addition and subtraction are trasparent */
fp_res = fp1 + fp2;
fp_res = fp3  fp4;
/* Multiplication and division take the double of bits */
fp_res = (fp1 * fp2) >> (FIXED_POINT_PRECISION);
fp_res = (fp1 / fp2) >> (FIXED_POINT_PRECISION);
[code]
I hope you've understood. Now you should realize what ranges of values you need, and what precision you need, so you can determine the number of bits to dedicate to the integer part and fracted part of the fixed point number.When using Windows, have you ever told "Ehi... do your business?"
Linux user #396597 (http://counter.li.org)

04272006 #4
 Join Date
 Oct 2004
 Posts
 158
Okay. First off the limits of accuracy of fp arithmetic is implmentation specific.
Read the /usr/include/limits.h file:
DBL_DIG tells you how many significant digits are in a double  usually fifteen.
DBL_MAX and DBL_MIN tell you the max and min values for the datatype.
If you need extreme precision download gmp  it's a bignum library that works with most platforms.
http://swox.com/gmp/