StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
2699886
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2010-04-23T15:38:24.560
FavoriteCount
0
LastActivityDate
2010-04-23T16:25:54.207
LastEditDate
2010-04-23T16:25:54.207
LastEditorUserId
80448
OwnerUserId
80448
ParentId
2661541
PostTypeId
2
Score
10
ViewCount
0
LastEditorDisplayName
text
Body
I could not resist spending an hour on your problem... This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[: <pre><code>e = 1 - D Q = N repeat K times: Q = Q * (1+e) e = e*e </code></pre> The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until <code>e</code> becomes too small to modify <code>Q</code>. Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range. Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33). This simplified algorithm may be implemented for your Q22.10 representation as follows: <pre><code>// Fixed point inversion // EB Apr 2010 #include <math.h> #include <stdio.h> // Number X is represented by integer I: X = I/2^BASE. // We have (32-BASE) bits in integral part, and BASE bits in fractional part #define BASE 22 typedef unsigned int uint32; typedef unsigned long long int uint64; // Convert FP to/from double (debug) double toDouble(uint32 fp) { return fp/(double)(1<<BASE); } uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); } // Return inverse of FP uint32 inverse(uint32 fp) { if (fp == 0) return (uint32)-1; // invalid // Shift FP to have the most significant bit set int shl = 0; // normalization shift uint32 nfp = fp; // normalized FP while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead uint64 q = 0x100000000ULL; // 2^32 uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP int i; for (i=0;i<4;i++) // iterate { // Both multiplications are actually // 32x32 bits truncated to the 32 high bits q += (q*e)>>(uint64)32; e = (e*e)>>(uint64)32; printf("Q=0x%llx E=0x%llx\n",q,e); } // Here, (Q/2^32) is the inverse of (NFP/2^32). // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33 return (uint32)(q>>(64-2*BASE-shl)); } int main() { double x = 1.234567; uint32 xx = toFP(x); uint32 yy = inverse(xx); double y = toDouble(yy); printf("X=%f Y=%f X*Y=%f\n",x,y,x*y); printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy); } </code></pre> As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products. The derivation of <code>64-2*BASE-shl</code> is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small). EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits: <pre><code>uint32 inverse2(uint32 fp) { if (fp == 0) return (uint32)-1; // invalid // Shift FP to have the most significant bit set int shl = 0; // normalization shift for FP uint32 nfp = fp; // normalized FP while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead int shr = 64-2*BASE-shl; // normalization shift for Q if (shr <= 0) return (uint32)-1; // overflow uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31 uint64 q = e; // 2^32 implicit bit, and implicit first iteration int i; for (i=0;i<3;i++) // iterate { e = (e*e)>>(uint64)32; q += e + ((q*e)>>(uint64)32); } return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit } </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POPicking good first estimates for Goldschmidt division
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USEric Bainville
UserOwnerUserId
1. USEric Bainville
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POPicking good first estimates for Goldschmidt division
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.