Find the answer to your Linux question:
Results 1 to 4 of 4
I've got a workstation in work with (as far as I can tell) 64bit RedHat WS3 on a 64-bit AMD Opteron-based machine, and using whatever gcc version came with it. ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Aug 2005
    Location
    London
    Posts
    3

    64bit fp precision problem


    I've got a workstation in work with (as far as I can tell) 64bit RedHat WS3 on a 64-bit AMD Opteron-based machine, and using whatever gcc version came with it. sizeof(double) returns 8, and sizeof(long double) returns 16. So far, so good. I'm trying to get as much floating-point precision as I easily can (ie without writing my own floating-point arithmetic).

    I tried this little test program, just to get a feel for roughly how many decimal significant figures of precision I can expect:

    a=0.1;
    b=1.0;
    for (i=0;i<22;i++) {
    printf("%3d%75.70llf\n",i,b);
    b=1.0+a;
    a/=10.0;
    }

    Now, when a,b are declared as long double, I get this output:

    0 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    1 1.100000000000000005551115123125782702118158340454 1015625000000000000000
    2 1.010000000000000000533427468862868181531666778028 0113220214843750000000
    3 1.001000000000000000020816681711721685132943093776 7028808593750000000000
    4 1.000100000000000000045449755070592345873592421412 4679565429687500000000
    5 1.000010000000000000026229018956769323267508298158 6456298828125000000000
    6 1.000001000000000000024306945345387021006899885833 2633972167968750000000
    7 1.000000099999999999948220585910263480400317348539 8292541503906250000000
    8 1.000000010000000000049032167215301569740404374897 4800109863281250000000
    9 1.000000001000000000015745238446385201314114965498 4474182128906250000000
    10 1.000000000100000000034100589019203653151635080575 9429931640625000000000
    11 1.000000000009999999960041972002500187954865396022 7966308593750000000000
    12 1.000000000000999999996004197200250018795486539602 2796630859375000000000
    13 1.000000000000099999977916376270314913199399597942 8291320800781250000000
    14 1.000000000000010000030317702801596624340163543820 3811645507812500000000
    15 1.000000000000000999959663683380739485073718242347 2404479980468750000000
    16 1.000000000000000099963440303163508815487148240208 6257934570312500000000
    17 1.000000000000000009974659986866640792868565768003 4637451171875000000000
    18 1.000000000000000000975781955236953990606707520782 9475402832031250000000
    19 1.000000000000000000108420217248550443400745280086 9941711425781250000000
    20 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    21 1.000000000000000000000000000000000000000000000000 0000000000000000000000

    And, when a,b are declared as double (and also changing the format in the printf() from llf to lf), I get this output:

    0 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    1 1.100000000000000088817841970012523233890533447265 6250000000000000000000
    2 1.010000000000000008881784197001252323389053344726 5625000000000000000000
    3 1.000999999999999889865875957184471189975738525390 6250000000000000000000
    4 1.000099999999999988986587595718447118997573852539 0625000000000000000000
    5 1.000010000000000065512040237081237137317657470703 1250000000000000000000
    6 1.000000999999999917733362053695600479841232299804 6875000000000000000000
    7 1.000000100000000058386717682878952473402023315429 6875000000000000000000
    8 1.000000009999999939225290290778502821922302246093 7500000000000000000000
    9 1.000000001000000082740370999090373516082763671875 0000000000000000000000
    10 1.000000000100000008274037099909037351608276367187 5000000000000000000000
    11 1.000000000010000000827403709990903735160827636718 7500000000000000000000
    12 1.000000000001000088900582341011613607406616210937 5000000000000000000000
    13 1.000000000000099920072216264088638126850128173828 1250000000000000000000
    14 1.000000000000009992007221626408863812685012817382 8125000000000000000000
    15 1.000000000000001110223024625156540423631668090820 3125000000000000000000
    16 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    17 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    18 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    19 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    20 1.000000000000000000000000000000000000000000000000 0000000000000000000000
    21 1.000000000000000000000000000000000000000000000000 0000000000000000000000

    The problem is, there's not as much difference as I expected: with long double it goes off the end of the available precision at about 20 sig. figs, and with double at about 16 sig. figs. A difference of only about 10**4 or 2**14 or less than 2 bytes. Without knowing exactly how the floating-point numbers are encoded, I'd still expect several bytes more binary precision in 16-byte over 8-byte floating-point arithmetic.

    I can see that there are lots of possible levels at which this poor improvement might result (incorrect use of printf formats, compiler type conversion conventions, kernel/CPU treatment of floating-point operations, ...) Can anyone explain it - and, more importantly, what, if anything I can do to get full 16-byte floating-point arithmetic?

    Thanks.

    aw99

  2. #2
    Just Joined!
    Join Date
    Aug 2005
    Location
    London
    Posts
    3
    As a followup to the above, I have found an "answer" - by experiment, however, and I've still got no idea why they chose to design it that way ... so if there are any experts out there your insights would still be appreciated!

    It appears that, although a long double on my computer occupies 16 bytes of storage, the floating-point representation it contains is only in fact 10 bytes long. The last six bytes of the 16 (or the first six, depending how you look at it) are empty. Without having tested the thing exhaustively, it looks to me like it is using 64 bits of abscissa, 15 bits of exponent and a sign bit, making a total of 80 bits, or 10 bytes. That's only 2 bytes longer than the conventional 8-byte double, which explains why I'm only getting about a couple of bytes more binary precision in my calculations. The long double arithmetic is also quite a lot slower than double (80% more time taken, for my arbitrary collection of adds and multiplies), further eroding the benefits of using long doubles over conventional 8-byte ones.

    aw99

  3. #3
    Linux User
    Join Date
    Aug 2005
    Location
    Italy
    Posts
    401
    About double arithmetic, I can say only to trust about CPU calculations...

    If you need speed, fixed point arithmetic is a LOT faster, because based on integer types. I can explain that: you should define a base type for fixed point operations; haveing N bit available, F bits (least significant) are dedicated to fixed point values, while the N-F bit (most significant) are dedicated to the integer part and (eventually) sign.

    Here is the weight of the bits (assuming 16 bit number, with 8 bits for fixed point, with sign):
    bit[8] = 2^0
    bit[9] = 2^1
    bit[10] = 2^2
    ...
    bit[15] = 2^7
    bit[16] = sign (0: positive; 1: negative)
    ...
    bit[7] = 2^-1
    bit[6] = 2^-2
    bit[5] = 2^-3
    ...
    bit[0] = 2^-8

    To build a fixed point operation you should use shifting operations: then a generic number fixed-point number is managed as follow:
    [code]
    /* Build an integer */
    fp = intval << FIXED_POINT_PRECISION;
    /* Get integer part */
    ip = fp >> FIXED_POINT_PRECISION
    /* Addition and subtraction are trasparent */
    fp_res = fp1 + fp2;
    fp_res = fp3 - fp4;
    /* Multiplication and division take the double of bits */
    fp_res = (fp1 * fp2) >> (FIXED_POINT_PRECISION);
    fp_res = (fp1 / fp2) >> (FIXED_POINT_PRECISION);
    [code]

    I hope you've understood. Now you should realize what ranges of values you need, and what precision you need, so you can determine the number of bits to dedicate to the integer part and fracted part of the fixed point number.
    When using Windows, have you ever told "Ehi... do your business?"
    Linux user #396597 (http://counter.li.org)

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Newbie
    Join Date
    Oct 2004
    Posts
    158
    Okay. First off the limits of accuracy of fp arithmetic is implmentation specific.

    Read the /usr/include/limits.h file:
    DBL_DIG tells you how many significant digits are in a double - usually fifteen.
    DBL_MAX and DBL_MIN tell you the max and min values for the datatype.

    If you need extreme precision download gmp - it's a bignum library that works with most platforms.

    http://swox.com/gmp/

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •