Showing newest posts with label Optimization. Show older posts
Showing newest posts with label Optimization. Show older posts

Tuesday, July 10, 2007

Nearest Neighbor

What's the fastest way to find the closest pair of three-dimensional points in a large set? That was the question posed for the most recent Hacker's Delight challenge.

We just released the results for Hacker's Delight #7: Nearest Neighbor. A little premature, it turns out, we misplaced a bunch of the entries. Once the remaining entries were found, I ran the tests before I came home tonight and Jim should be putting them up soon. It turns out that the missing entries don't really affect the outcome.

This is the third Hacker's Delight challenge I've blogged about. Challenge #6: 64 Choose 4 came from the post Let Me Count The Ways and, in full circle, Let Me Count The Ways (Part II) comes from the challenge. From Hacker's Delight #4: Primes came Sieve of Eratosthenes.

Alistair Milne from EA Montreal blew Nearest Neighbor out of the water with a blazingly fast and beautiful algorithm for finding the nearest two points in a cloud of points in three dimensional space. For comparison the very slow reference implementation is here. I wondered if I could get any more performance out of Alistair's algorithm and came up with an uglier but ~25% faster SIMD optimized version of his algorithm. Sadly, my own reasonably fast implementation for Nearest Neighbor is incorrect about 7% of the time.

My implementation, and many others we received were based on the canonical Nearest Neighbor algorithm as described in Cormen's Introduction to Algorithms. It seems that in every respect, Alistair's algorithm is superior. It's more elegant, it's much faster and it does less work.

Be sure to check it out.

Wednesday, April 04, 2007

Optimal Bytecode Interpretation

I always feel like I should blog more about engineering related stuff, since that's what I do all day.

For now, my earlier plans for bytecode to C++ compilation are on hold as we're looking into more immediate gains from aggressively optimizing the interpretation. One of the downsides of the hybrid C++ compilation/bytecode interpretation approach that I have in mind is that it will require a workflow change. The C++ will eventually need to be compiled and then linked to the executable. There's a pipeline/workflow bubble there that's not appealing. Still, I think it would be fantastic to get that built for critical core libraries of script (utilities) that don't change much.

Optimizing Interpretation

Our bytecode is that of a dynamic, and (as a result of the structure of the instruction set) terribly slow language. Unfortunately, we use an off-the-shelf compiler, so we can't infer much that was in the original source but fails to show up in the bytecode output.

Did some memoization, in this case caching of variable and function lookups, with excellent (awesome!) results.

A clever universal hashing scheme for both special strings and user strings opened up a lot of possibilities.

Close attention paid to string handling yielded good results.

Thread synchronizaton primitives were killing us. We usually try to keep our assembly language to a bare minimum for portability sake. This translates to a handful of primitives to improve PS2 performance. But to rid ourselves of our most time consuming locks, we applied asm to a couple of critical spots where we need to read in, modify, then write out whole cache lines as atomic operations on the PowerPC machines (PS3, 360). This avoids the expense of a more heavyweight lock such as a mutex in those places and gives us back a lot of performance for multithreaded operation.

Anyhow, we've significantly improved the bytecode execution performance, which is something the game teams have been asking (begging!) for for ages.

Of course, it's never fast enough.. ;)

Wednesday, March 21, 2007

Bytecode to Native Compilation

At work, some of the software I develop uses a bytecode interpreter. And, as always, we need better performance from the whole system. So I'm looking into bytecode-to-native (in this case C++) compilation. I've done this before, with embedded Lisp-based languages and there are a number of compilers available that do this for Java (GCC has a back-end for this), C#, Lisp and its derivatives (Bigloo for Scheme, for instance). Compiling to C or C++ is great as it serves as a sort of portable assembly language and it's possible to leverage further the fine optimization skillz of modern C/C++ compilers. I'll report on this when I make some progress - if I don't get pulled off onto something else.

Wednesday, January 24, 2007

7

(Updated 7/10/07)

Inspired by the interest in my 5-card poker hand code that plugs into Cactus Kev's evaluator, I've decided to revisit my unholy 7-card evaluator and make a faster?, cleaner one that I can then post up here.

For the 5-card hash I used Bob Jenkin's Perfect Hashing code. Check out his excellent site for great perfect hashing code & ideas.

My current 7 card evaluator first determines if there is a flush. If not, it looks up the final value in a 13 * 13 * 13 * 13 * 13 * 13 * 13 (13^7, 63M entries) precalc'd table. Arghh! If it is a flush, though, it evaluates all 21 combinations (7 choose 5) in the normal (albeit optimized) way.

But this is not how I want my grandchildren to remember my code. Let's think about other options. Now { 52 choose 7 } yields about 133 million possiblities, right? The first crucial step in thinking about optimizing the seven hand evaluation is figuring out a way to efficiently map every unique set of 7 out of 52 cards to one unique number of the 133 million possiblilities.

As it turns out, I've got some code to do that. Nevertheless, I need to do a little cleanup before I post that. So look for "7 Part II" sometime soon.. ;)

Part II: 52 Choose 7

As promised, code to map any 7 of 52 items (7 of 52 bits) to a unique index in the range of 0-133M (52 choose 7).

index52c7.h

This, of course, could be used for a super-fast 7 card hand evaluator with a precomputed table of size 266mb.

Jing, commenting below mentions that a 2+2 forum has some super-fast seven hand evaluators. Glancing briefly at the site I notice claims of 12.5 cycles per evaluation, which seems too good to be true. After all, a single out-of-cache table lookup can cost much more than that. But if it is true - sweet!

Some Clarifications

Andy Reagan emailed me and made some excellent points concerning the readability of index52c7.h.

"[It's hard] to understand what the code was doing without comments and with the generalized table and variable names.."

I apologize for that. Actually, I wrote another program to generate this file, which is one of the reasons why it's so obtuse. It would probably be a good idea to publish the generator program as well.

"What does the function index52c7 do?"

Here's the reasoning for index52c7:

We can completely represent a hand of 7 cards of 52 with a single 52-bit number with 7 bits set. We assign each possible card in a deck a number between 1 and 52, inclusive. For example, the Queen/Hearts might be 43 and 2/Spades might be 17. Then, we take a 64-bit number (large enough to contain the 52-bits) and set a bit for each of the 7 cards we have. If two of the seven cards we have are Queen/Hearts and 2/Spades we'd set bits 43 and 17 along with the bits that correspond to the other five cards.

Now, if we had unlimited memory, we could just use this number as an index into an enormous and very sparse array. Unfortunately, this array would have 2^52 (4.5 quadrillion) entries. Assuming two bytes per entry, that would require 9 petabytes of memory! So we need to somehow hash this number into a much smaller space. It turns out that the number of possible combinations of 7 items among 52 is about 133 million (52 choose 7), so ideally, we could somehow hash the 52 bit number into a number between 0 and 133 million that uniquely identifies a given hand.

That's what index52c7 does. It translates the 52 bit hand representation into a much smaller, but still unique number. At two bytes per entry, that gives us a table of 266 megabytes, which is large and in certain cases inconvenient, but certainly doable.

Using, say, Cactus Kev's code to evaluate each possible 7-card hand, we'd first generate the 266MB table and populate it by looking up the corresponding index with index52c7. Now that the table's fully populated, we can just pass index52c7 the 52-bit number and use the resulting index to pull the answer straight out of the array.

Monday, December 04, 2006

Inline Assembly

On my current project I will soon delve into optimization tasks at the level of inline assembly for PowerPC. These days the use of inline assembly is almost never justified. It's about as unportable as code can be and it's nearly impossible to understand once it's written. Most of the time, unless you devote a great deal of energy or unless you are using processor features (SIMD, for example) inaccessible through C or C++, hand-written assembly will actually be slower than compiler generated code. Furthermore, most of what you learned a few years ago about optimizing assembly code simply does not hold any more. For example, what's the faster way to multiply an integer by 5 on the x86?

A. x = (x << 2) + x;

or

B. x *= 5;

Old school assembly says that A is faster. Not true anymore. The imul (integer multiply) instruction is as fast as a single shift on the x86 these days. Counting cycles? Hard to do these days with deep pipelines, instruction reordering, branch prediction and unpredictable memory latency. The most effective way to optimize assembly seems to be aggressive profiling and trial and error. Gone are the days when you can optimize code by counting cycles with the processor manual tables in hand. Even so, these guidelines are important:

1. Most importantly, make sure you have the most efficient algorithm possible for the job before moving to assembly! There are a million good reasons for this and nothing could be more embarrassing than having your finely tuned assembly bubble sort owned by a C (or Java!) mergesort written in 12 minutes.

2. Profile changes aggressively and with the finest resolution (usually the CPU cycle counters) possible.

3. Space out memory accesses. Because of memory latency (and asynchronous memory access), you can hide cycles between your memory reads and writes.

4. Know your memory access patterns and take advantage of them. Do you only write and never read back from certain areas of memory? It may be beneficial to write-through directly to memory and avoid caching. It can also be useful to prefetch memory in certain cases.

5. Keep your data structures small enough to fit completely in cache. This will yield enormous benefits if you can do it.

6. Use SIMD where appropriate. This can give great benefit and itself may justify moving to inline assembly. However, don't spend an excessive number of cycles trying to fit data into SIMD-ready structures. It'll probably cost more than you'll get from it. Use SIMD when it's a good fit.

7. Unroll loops - to a point. Unroll tiny loops until they no longer provide a performance benefit. Keep unrolling and profiling. When you've gone too far you'll see a significant performance drop as that piece of code outgrows the instruction cache. If you have enough information on the hardware, you can figure out where this threshold will be.

8. On PC use SIMD for 64-bit integer arithmetic instead of the atrocious code that's generated for this by Visual C++.

Just so you know, this entry is subject to revision. Have any other guidelines? Let me know about 'em!

Tuesday, August 22, 2006

Let Me Count The Ways (Part II)

In the Sieve of Eratosthenes post, I mentioned that at work (EA) we have an optional programming challenge every once in a while. It was started last year by Jim Hejl and it's called Hacker's Delight - named after the excellent book of programming tricks by Henry S. Warren.

I love this stuff. Why? Who knows? Anyhow, this time I wrote the challenge - Hacker's Delight #6. I chose the { 64 choose 4 } problem from Let Me Count The Ways. My fast solution was initially ~3.9ms. Since I had not yet aggressively optimized it, I did that and got it down to ~2.5ms. That solution was written completely in C++. I figured that was about as fast as it got because it turned out that ~2.5ms is (slightly) faster than memset(data, 0, 635376 * sizeof(unsigned __int64)) !!

In short order, solutions came in that were almost exactly as fast as mine with quite different algorithms. Hmm. I thought, the bottleneck is memory-processor bandwidth - ~2.5ms is simply how long it takes to write 4.8mb of data. All the other processing is swamped by that.

Even so, Jim was experimenting with SIMD approaches and had an SSE version of 'memset' that executed in about half memset's time. I started playing with that but couldn't figure out how to use that to get better performance out of my algorithm.

Then Jim tells me that he received an entry from EA United Kingdom that executes in ~1.25ms - all SIMD (MMX)! The gauntlet is down. I asked him not to forward it to me or let me see it.

So today I finally got my own MMX version to about ~1.25ms. It seems to execute slightly faster than Jim's SSE memset. While minor speed improvements may be possible (on the order of tenths of milliseconds), I'm convinced that there's no way to get significantly faster performance.

I could be wrong.

When it's all over, I will post a more detailed description of the optimization steps for those of you who are interested. For those who are not, z z z z z z z z z z...

[The related, and more detailed document Optimizing 64 Choose 4 (.pdf)]

Thursday, June 08, 2006

Some Perfect Hash

The computer science department of the University of Alberta in Edmonton researches Artificial Intelligence (AI) for Poker. From their site I came upon Cactus Kev's Poker Hand Evaluator which is a killer fast five card hand evaluator. Reading through his algorithm you'll notice that the last step for any yet unclassified hand is a binary search through a list of values. Most hands end up in that search which is the most time-consuming part of the algorithm.

An Optimization

Replacing the binary search with a precomputed perfect hash for the 4888 values in the list yields a significant improvement over the original. The original test code included with Kevin's source (using eval_5hand) runs in 172 milliseconds on my machine and with my eval_5hand_fast it runs in 63 milliseconds. Yay, an improvement of 2.7 times!

fast_eval.c

(Updated 7/10/07)

This post has garnered a bit of attention. Cactus Kev added a comment and a link on his site back to the post and I've had an email conversation with a programmer (anonymous unless he's cool if I mention him) who ported the code to C# and reports:

Kev: 159ms
Your mod: 66ms
My C# version of your mod: 88ms

Pointers for seven hand evaluation? Check out the post 7.

Sieve of Eratosthenes

At work we had a little challenge in December to see who could come up with a program to count the number of primes between any two numbers (0 to 2^32-1 inclusive) as fast as possible. To this end I wrote this optimized Sieve of Eratosthenes (algorithm) that counts all the primes from 0 to 4,294,967,295 in about 13.75 seconds on my (admittedly fast) development machine at work.

If you come up with any improvements to it, let me know!

(2/2/2007 - I'm going to go ahead and add the link for the multicore Sieve of Eratosthenes here. On a 3GHz dual core Xeon, 7.15s!)

Friday, June 02, 2006

Let Me Count The Ways

Lately, I've been tinkering with card games (poker) a bit and one of the little questions that came up was what is the fastest way to enumerate all possible combinations of 4 items out of a possible 64? Actually, it doesn't have to be 4 of 64, it could be 7 of 48, or 2 specific aces from 52 cards or 3 bits of 10. The field of combinatorics can tell us how many there are. This is expressed as { n choose r } and has the formula factorial(n) / (factorial(r) * factorial(n - r)). In the case of n = 64 and r = 4 it yields 635,376 different combinations.

So, the task is to enumerate each of these unique combinations exactly one time until all 635,376 have been generated. This is one of those problems where the right approach makes all the difference.

The Naive Approach

Let's look, for a moment, at the most basic brute force approach: iterate through every value that can be contained in 64 bits and check if it has four bits. How long would this take? On my machine it takes 1 second to check 250 million numbers. Pretty fast, right? At this rate, however, it will take 2,340 years to check all 2^64 numbers! Clearly not a usable solution.

A Better Approach

Poking around the internet I found a function that takes a number and returns the next highest number with the same number of bits. I adapt it a bit and bam!


int enumerate_combinations(int n, int r, unsigned __int64 *v)
{
unsigned __int64 y, r, x;
unsigned count = (unsigned)math::choose(n, r);
v[0] = ((unsigned __int64)1 << r) - 1;
for (unsigned i = 1; i < count; i++)
{
x = v[i - 1];
y = x & -(__int64)x;
r = x + y;
v[i] = r | (((x ^ r) >> 2) / y);
}
return count;
}


This variation turns in 32 milliseconds for complete enumeration. Not a bad improvement over 2,340 years, eh?

The Best Approach

I actually developed the following function before the above. However, I figured (from looking at the code) that the above would blow this one away. How wrong I was. One problem with the above approach is that it has a nasty little division. Another thing is that this approach takes advantage of certain special cases like when r == 1 and n == r. The following approach is based on my initial recursive approach, but I removed the recursion so that I could rewrite it as an iterator. Removing the recursion did not seem to have a significant impact on performance. Anyhow the following runs to completion in just 3 milliseconds, over 10 times faster than the above version.


template <typename T = unsigned __int64>
int enumerate_combinations(int n, int r, T *v)
{
struct { int n, r; T h; } s[sizeof(T) * 8] = { { n, r, 0 } }, q;
int si = 1, i = 0;
T one = 1;

while (si)
{
q = s[--si];

tail:

if (q.r != 0)
{
one = 1;
if (q.r == 1)
{
for (int j = 0; j < q.n; j++)
{
v[i++] = q.h | one;
one <<= 1;
}
}
else if (q.r == q.n)
{
v[i++] = q.h | (one << q.n) - 1;
}
else
{
--q.n; s[si++] = q; q.r--;
q.h |= one << q.n;
goto tail;
}
}
}

return i;
}

Tuesday, May 23, 2006

Data Access Latency

Ryan and I have talked about making a chart to easily visualize the relative costs of common computer operations. A significant part of this is data access latencies in modern computer hardware. Yesterday I stumbled across some numbers in Raymond Chen's PDC 05 Talk: Five Things Every Windows Programmer Should Know, and behold! a chart is born:



Main memory latency has become a crucial consideration in modern software development. As the processor (CPU in the chart) becomes faster, the gulf between the processor memory (registers and caches) and main memory latency widens, making cache misses increasingly more expensive with respect to processor cycles. An increasingly frequent software design decision is to use simple array based data structures as opposed to pointer based tree and list structures for many operations. Typically, pointer structures (linked lists, binary trees, etc.) are theoretically more efficient than their linear counterparts (for example, binary trees [O(lg n) vs. O(n)] vs. linear search). Given the high cost of cache misses in modern hardware and the good memory locality of array based approaches, pointer based structures may perform 25-100 times more poorly than their simpler counterparts. In developing for the Xbox 360 and PlayStation 3 with crazy powerful processors, linearizing data structures is a crucial optimization.

So instead of using that binary tree next time, consider using a sorted linear array with a binary search.