Paul Khuong mostly on Lisp

rss feed

Sun, 11 Jul 2010

 

There's more to locality than caches

After some theoretical experiments on hash tables, I implemented a prototype for 2-left hash tables with large buckets (16 entries) to see how it’d work in the real world. It worked pretty well, but the way its performance scaled with table size sometimes baffled me. Playing with the layout (despite not fully understanding the initial prototype’s behaviour) helped, but the results were still confounding. The obvious solution was to run microbenchmarks and update my mental performance model for accesses to memory!

(The full set of results are at the end, but I’ll copy the relevant bits inline.)

The microbenchmark

The microbenchmark consists of ten million independent repetitions of a small loop that executes an access pattern on 16 independent cache lines. The addresses are generated with a Galois LFSR to minimise the number of uninteresting memory accesses while foiling any prefetch logic.

There are four access patterns. The first pattern, “0”, reads the first word in the line; “0-3” reads the first and fourth words; “0-3-7” reads the first, fourth and eighth (last) words; finally, “0-3-7-8” reads the first, fourth and eighth words of the cache line and the first of the next cache line. It would be interesting to test “0-8”, but it’s not that relevant to my application.

The working sets are allocated in regular pages (4K bytes) or in huge pages (2M bytes) to investigate the effect of the TLB (Translation Lookaside Buffer) when appropriate. A wide number of working set sizes are tested, from 16 KB to 1 GB.

Finally, “Cache miss” measures the number of cache misses (that is, accesses that hit main memory) across the execution of the program, in millions (including a small amount of noise, e.g. to record timings); “TLB miss” measures the number of TLB misses (the TLB caches mappings from logical to physical address), again in millions; and “Cycle/pattern” is the median number of cycles required to execute the access pattern once, computed as the average of 16 independent pattern executions (without compensating for timing or looping overhead, which should be on the order of 20-30 cycles for all 16 executions).

Mistake number one: Use the whole cache line, it’s free

CPUs deal with memory one cache line at a time. Barring non-temporal accesses, reading even a single byte results in reading the whole corresponding cache line (64 bytes on current x86oids). Thus, the theory goes, we’re always going to pay for loading the whole cache line from memory and it shouldn’t be any slower to read it all than to access only a single word.

My initial prototype had buckets (of 16 entries) split in two vectors, one for the hash values (four byte each, so one cache line per bucket of hash values), and another for the key-value pairs (more than a cache line per bucket, but the odds of two different keys hashing identically are low enough that it shouldn’t be an issue). Having the hashes in a vector of their own should have improved locality and definitely simplified the use of SSE to compare four hash values at a time.

The in-cache performance was as expected. Reads had a pretty much constant overhead, with some of it hidden by out of order and superscalar execution.

In my microbenchmark, that’s represented by the test cases with working set size from 16KB to 256KB (which all fit in L1 or L2 caches). All the misses are noise (5 million misses for 160 million pattern execution is negligible), and we observe a slow increase for “Cycle/pattern” as the size of the working set and the number of accesses go up.

                +--------------------------------+  

                |           4K pages             |  

                |    0      0-3    0-3-7  0-3-7-8|  

                +--------------------------------+  

  Size 16KB     |                                |  

Cache miss (M)  |   3.89    3.85    3.88    3.88 |  

TLB miss   (M)  |   0.50    0.51    0.50    0.58 |  

Cycle/pattern   |   4.50    6.25    6.50    6.50 |  

                |                                |  

                +--------------------------------+  

  Size 32KB     |                                |  

Cache miss (M)  |   3.79    3.87    3.86    3.88 |  

TLB miss   (M)  |   0.52    0.51    0.50    0.50 |  

Cycle/pattern   |   4.75    6.25    6.25    6.50 |  

                |                                |  

                +--------------------------------+  

  Size 128KB    |                                |  

Cache miss (M)  |   3.80    3.67    3.84    3.66 |  

TLB miss   (M)  |   0.52    0.36    0.50    0.39 |  

Cycle/pattern   |   5.25    6.25    6.50    7.25 |  

                |                                |  

                +--------------------------------+  

  Size 256KB    |                                |  

Cache miss (M)  |   5.03    5.07    5.07    4.06 |  

TLB miss   (M)  |   0.51    0.50    0.51    0.47 |  

Cycle/pattern   |   5.25    6.50    7.25    7.25 |  

                |                                |  

                +--------------------------------+

Once we leave cache, for instance at 128MB, things get weirder:

                +--------------------------------+  

                |           4K pages             |  

                |    0     0-3    0-3-7  0-3-7-8 |  

                +--------------------------------+  

  Size 128MB    |                                |  

Cache miss (M)  | 153.21  153.35  296.01  299.06 |  

TLB miss   (M)  | 158.00  158.07  158.10  160.58 |  

Cycle/pattern   |  30.50   34.50   41.00   44.00 |  

                |                                |  

                +--------------------------------+

According to my theory, the cost for accessing additional words in the same cache line should be negligible compared to loading it from memory. Yet, “Cycle/pattern” increases slowly but regularly. I believe that the reason for that is that memory controllers don’t read a full cache line at a time. When an uncached address is accessed, the CPU does load the whole line into cache, but only in multiple steps.

The first step also executes much slower than the others because of the way memory is addressed in RAM: addresses are sent in two halves, first the “column”, and then the “row”. To read an arbitrary address, both halves must be sent, one after the other. Reading an address close to the previous one, however, only updates the row.

It’s also interesting to note that both patterns “0-3-7” and “0-3-7-8” trigger about twice as many cache misses as “0” and “0-3”. Yet, “0-3-7” only reads from one cache line, while “0-3-7-8” reads from two. I believe that’s because of prefetching. We’ll come back to the issue of multiple reads from out-of-cache memory later.

So, while it is true that it’s better to fully use a cache line (because it’ll be completely read regardless), reading more words still incurs additional latency, and it’s only slightly cheaper than hitting an adjacent cache line.

Mistake number two: Cache misses dominate everything else

In response to my better understanding of memory, I changed the layouts of my buckets to minimise the expected number of reads. In the end, I decided to pack each bucket’s hash values in a header of 128 bit (the size of an XMM register). With 16 bytes used for the hash values, I could append 15 key-value pairs to have a round size of 256 bytes per bucket, and execute only adjacent accesses on successful lookups. The extra 16th hash value stored the number of entries in the bucket.

So, instead of having one vector of hash buckets and one of value buckets per subtable (and two subtables per hash table):

struct entry {  

        u64 key, val;  

};  

 

struct hash_bucket {  

        u32 hashes[16];  

};  

 

struct value_bucket {  

        struct entry entries[16];  

};

I had:

struct bucket {  

        union {  

                vu8 pack; // SSE vector of u8  

                u8  vec[16];  

        } hash;  

        struct entry entries[15];  

};

The new layout meant that I only had one byte of hash value for each entry. It wasn’t such an issue, since I was already computing two independent hash values per key (for the two subtables). When working with the right subtable, I could simply store the hash value from the left subtable (but still index buckets with the right subtable’s hash value), and vice versa. Since the hashes are independent, the odds of false positives are on the order of 5%. According to my new-found knowledge, this should perform really well: only one access to scan the hash values, and then, when the hash values match, the key-value pairs are at adjacent addresses.

The histogram of timings for inserts and lookups did improve and even showed nice, steep peaks (depending on whether one or two subtables were probed). Yet, there was still something really hard to explain: I seemed to run out cache much earlier than expected.

I have 256KB of L2 cache, and 12MB of L3 on my machine... but, in my microbenchmark, we observe drastic changes in timings even between working sets of 2MB and 8MB:

                +--------------------------------+  

                |           4K pages             |  

                |    0     0-3    0-3-7  0-3-7-8 |  

                +--------------------------------+  

  Size 2MB      |                                |  

Cache miss (M)  |   5.06    5.10    5.11    5.14 |  

TLB miss   (M)  |   0.67    0.85    0.88    0.90 |  

Cycle/pattern   |   6.50    7.25    8.75   10.75 |  

                |                                |  

                +--------------------------------+  

  Size 8MB      |                                |  

Cache miss (M)  |   7.29    7.28    8.67    8.68 |  

TLB miss   (M)  | 120.45  120.55  120.63  122.52 |  

Cycle/pattern   |  11.00   13.00   15.50   17.25 |  

                |                                |  

                +--------------------------------+

The access times nearly double, for working sets that both are much larger than L2 but do fit in L3 (as evidenced by the very low number of cache misses). This is where the “TLB miss” row is interesting: the number of TLB misses goes from negligible to nearly one miss per access (each access to memory triggers a TLB lookup to map from logical to physical address). The L2 TLB on my machine holds 512 pages, at 4K each, for a total of 2MB; a working set not fitting in TLB has as much of an impact as not fitting in cache!

I should have thought of that: people who ought to know like kernel developers or Kazushige Goto (of GotoBLAS and libflame fame) have been writing about the effect of TLB misses since at least 2005. So, I used huge pages (2MB instead of 4KB) and observed a return to sanity. On my microbenchmark, this shows up as:

                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 2MB      |                                :                                 |  

Cache miss (M)  |   5.06    5.10    5.11    5.14 :   5.03   5.01    5.02    5.05   |  

TLB miss   (M)  |   0.67    0.85    0.88    0.90 :   0.23   0.27    0.23    0.24   |  

Cycle/pattern   |   6.50    7.25    8.75   10.75 :   5.25   6.75    7.75    9.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 8MB      |                                :                                 |  

Cache miss (M)  |   7.29    7.28    8.67    8.68 :   5.21   5.22    5.22    5.25   |  

TLB miss   (M)  | 120.45  120.55  120.63  122.52 :   0.23   0.30    0.27    0.25   |  

Cycle/pattern   |  11.00   13.00   15.50   17.25 :   5.00   6.75    7.75   10.00   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+

Using huge pages cuts the times by almost 50% on the microbenchmark; that’s on par the difference between L3 and L1 (only 33%, but timing overhead is much more significative for L1). More importantly, the timings are the same for two working sets that fit in L3 but largely spill out of L2.

Improvements are almost as good when hitting main memory:

                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 16MB     |                                :                                 |  

Cache miss (M)  |  49.37   49.41   90.72   91.77 :  47.82  47.72   81.46   88.01   |  

TLB miss   (M)  | 140.59  140.60  140.67  142.87 :   0.25   0.25    0.24    0.25   |  

Cycle/pattern   |  18.00   20.50   24.00   26.50 :  14.00  15.25   17.00   18.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 32MB     |                                :                                 |  

Cache miss (M)  | 106.93  107.09  203.73  207.82 : 106.55  106.62  186.50  206.83  |  

TLB miss   (M)  | 150.56  150.74  150.82  153.09 :   0.24   0.26    0.25    0.27   |  

Cycle/pattern   |  22.25   24.75   31.00   34.00 :  15.75  17.25   20.00   27.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 64MB     |                                :                                 |  

Cache miss (M)  | 137.03  137.23  263.73  267.88 : 136.67  136.82  232.64  266.93  |  

TLB miss   (M)  | 155.63  155.79  155.81  158.21 :   5.09   5.25    5.69    5.78   |  

Cycle/pattern   |  26.00   29.25   36.75   39.75 :  16.75  18.25   24.25   30.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+

The access times are much better (on the order of 30% fewer cycles), but they also make a lot more sense: the difference in execution time between working sets that don’t fit in last level cache (L3) is a lot smaller with huge pages. Moreover, now that TLB misses are out of the picture, accesses to two cache lines (“0-3-7-8”) are almost exactly twice as expensive as an access to one cache line (“0”).

My test machine has a 32 entry TLB for 2M pages (and another 32 for 4M pages, but my kernel doesn’t seem to support multiple huge page sizes). That’s enough for 64 MB of address space. Indeed, we observe TLB misses with larger working sets:

                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 128MB    |                                :                                 |  

Cache miss (M)  | 153.21  153.35  296.01  299.06 : 152.77  152.86  261.09  298.71  |  

TLB miss   (M)  | 158.00  158.07  158.10  160.58 :  80.84   84.96   91.48   96.40  |  

Cycle/pattern   |  30.50   34.50   41.00   44.00 :  18.75   20.75   27.25   33.25  |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 512MB    |                                :                                 |  

Cache miss (M)  | 170.65  170.90  326.84  329.59 : 169.90  170.22  286.47  326.54  |  

TLB miss   (M)  | 160.39  160.41  162.17  164.62 : 140.35  147.28  160.81  179.58  |  

Cycles/patterm  |  36.75   41.00   47.00   50.00 :  20.50  23.00   29.75   35.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 1GB      |                                :                                 |  

Cache miss (M)  | 184.29  184.29  353.23  356.73 : 180.11  180.43  300.66  338.62  |  

TLB miss   (M)  | 163.64  164.26  176.04  178.88 : 150.37  157.85  169.58  190.89  |  

Cycle/pattern   |  37.25   41.50   52.00   55.25 :  22.00  24.75   30.50   37.00   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+

However, even with a working set of 1GB, with nearly as many TLB misses for huge as for regular pages, we see a reduction in timing by 30-40%. I think that’s simply because the page table fits better in cache and is quicker to search. 1GB of address space uses 256K pages (at 4KB each). If each page descriptor uses only 16 bytes (one quad word for the logical address and another for the physical address), that’s still 4MB for the page table!

TL;DR
  • Multiple accesses to the same cache line still incur a latency overhead over only one access (but not in memory throughput, since the cache line will be fully read anyway).
  • Use huge pages if you can. Otherwise, you’ll run out of TLB space at about the same time as you’ll leave L2 (or even earlier)... and TLB misses are more expensive than L2 misses, almost as bad as hitting main memory.
  • Prefer accessing contiguous cache lines. If you can’t use huge pages or if your working set is very large, only one TLB miss is incurred for accesses to lines in the same page. You might also benefit from automatic prefetching.
  • This is why cache-oblivious algorithms are so interesting: they manage to take advantage of all those levels of caching (L1, L2, TLB, L3) without any explicit tuning, or even considering multiple levels of caches.

The test code can be found at http://discontinuity.info/~pkhuong/cache-test.c.

I could have tried to tweak the layout of my 2-left hash table some more in reaction to my new cost model. However, it seems that it’s simply faster to hit multiple contiguous cache lines (e.g. like linear or quadratic probing) than to access a few uncorrelated cache lines (2-left or cuckoo hashing). I’m not done playing with tuning hash tables for caches, though! I’m currently testing a new idea that seems to have both awesome utilisation and very cheap lookups, but somewhat heavier inserts. More on that soon(ish).

Annex: full tables of results from the microbenchmark
  • Test machine: unloaded 2.8 GHz Xeon (X5660) with DDR3-1333 (I don’t remember the timings)
  • Cache sizes:
    • L1D: 32 KB
    • L2: 256 KB
    • L3: 12 MB
  • TLB sizes
    • L1 dTLB (4KB): 64 entries
    • L1 dTLB (2M):32 entries
    • L2 TLB (4 KB): 512 entries

Benchmark description: access 16 independent cache lines, following one of four access patterns. 10M repetitions (with different addresses). Test with regular 4KB pages and with huge pages (2MB). Report the total number of cache and TLB misses (in million), and the median of the number of cycle per repetition (divided by 16, without adjusting for looping or timing overhead, which should be around 30 cycles per repetition). Source at http://discontinuity.info/~pkhuong/cache-test.c.

Access patterns:

  • 0: Read the cache line’s first word
  • 0-3: Read the cache line’s first and fourth words
  • 0-3-7: Read the cache line’s first, fourth and eighth words
  • 0-3-7-8: Read the cache line’s first, fourth and eighth words, and the next cache line’s first word

                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 16KB     |                                :                                 |  

Cache miss (M)  |   3.89    3.85    3.88    3.88 :   3.58   3.78    3.78    3.79   |  

TLB miss   (M)  |   0.50    0.51    0.50    0.58 :   0.16   0.25    0.25    0.25   |  

Cycle/pattern   |   4.50    6.25    6.50    6.50 :   4.50   6.25    6.50    6.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 32KB     |                                :                                 |  

Cache miss (M)  |   3.79    3.87    3.86    3.88 :   3.77   3.78    3.78    3.78   |  

TLB miss   (M)  |   0.52    0.51    0.50    0.50 :   0.25   0.24    0.26    0.24   |  

Cycle/pattern   |   4.75    6.25    6.25    6.50 :   4.75   6.25    6.25    6.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 128KB    |                                :                                 |  

Cache miss (M)  |   3.80    3.67    3.84    3.66 :   3.76   3.77    3.77    3.78   |  

TLB miss   (M)  |   0.52    0.36    0.50    0.39 :   0.26   0.26    0.24    0.26   |  

Cycle/pattern   |   5.25    6.25    6.50    7.25 :   5.25   6.25    6.50    7.25   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 256KB    |                                :                                 |  

Cache miss (M)  |   5.03    5.07    5.07    4.06 :   3.98   3.98    3.83    3.83   |  

TLB miss   (M)  |   0.51    0.50    0.51    0.47 :   0.27   0.26    0.23    0.25   |  

Cycle/pattern   |   5.25    6.50    7.25    7.25 :   5.25   6.25    6.75    7.25   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+



Figure 1: Working sets fit in L1 or L2 (16 to 256 KB)


                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 1MB      |                                :                                 |  

Cache miss (M)  |   5.04    5.09    5.09    5.09 :   5.00   4.99    5.00    4.99   |  

TLB miss   (M)  |   0.50    0.50    0.49    0.50 :   0.23   0.25    0.24    0.24   |  

Cycle/pattern   |   6.25    7.25    8.50   10.50 :   5.25   6.75    7.75    9.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 2MB      |                                :                                 |  

Cache miss (M)  |   5.06    5.10    5.11    5.14 :   5.03   5.01    5.02    5.05   |  

TLB miss   (M)  |   0.67    0.85    0.88    0.90 :   0.23   0.27    0.23    0.24   |  

Cycle/pattern   |   6.50    7.25    8.75   10.75 :   5.25   6.75    7.75    9.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 4MB      |                                :                                 |  

Cache miss (M)  |   5.19    5.19    5.19    5.22 :   5.08   5.07    5.08    5.11   |  

TLB miss   (M)  |  80.42   80.59   80.70   81.98 :   0.24   0.25    0.24    0.24   |  

Cycle/pattern   |   8.25   10.00   12.00   13.75 :   5.00   6.75    7.75    9.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 8MB      |                                :                                 |  

Cache miss (M)  |   7.29    7.28    8.67    8.68 :   5.21   5.22    5.22    5.25   |  

TLB miss   (M)  | 120.45  120.55  120.63  122.52 :   0.23   0.30    0.27    0.25   |  

Cycle/pattern   |  11.00   13.00   15.50   17.25 :   5.00   6.75    7.75   10.00   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+



Figure 2: Working sets fit in L3 and in huge TLB, but not always in regular TLB


                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 16MB     |                                :                                 |  

Cache miss (M)  |  49.37   49.41   90.72   91.77 :  47.82  47.72   81.46   88.01   |  

TLB miss   (M)  | 140.59  140.60  140.67  142.87 :   0.25   0.25    0.24    0.25   |  

Cycle/pattern   |  18.00   20.50   24.00   26.50 :  14.00  15.25   17.00   18.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 32MB     |                                :                                 |  

Cache miss (M)  | 106.93  107.09  203.73  207.82 : 106.55  106.62  186.50  206.83  |  

TLB miss   (M)  | 150.56  150.74  150.82  153.09 :   0.24   0.26    0.25    0.27   |  

Cycle/pattern   |  22.25   24.75   31.00   34.00 :  15.75  17.25   20.00   27.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 64MB     |                                :                                 |  

Cache miss (M)  | 137.03  137.23  263.73  267.88 : 136.67  136.82  232.64  266.93  |  

TLB miss   (M)  | 155.63  155.79  155.81  158.21 :   5.09   5.25    5.69    5.78   |  

Cycle/pattern   |  26.00   29.25   36.75   39.75 :  16.75  18.25   24.25   30.75   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+



Figure 3: Working sets exceed L3, but still fit in huge TLB


                +--------------------------------+---------------------------------+  

                |           4K pages             :          2M pages               |  

                |    0     0-3    0-3-7  0-3-7-8 :   0      0-3    0-3-7   0-3-7-8 |  

                +--------------------------------+---------------------------------+  

  Size 128MB    |                                :                                 |  

Cache miss (M)  | 153.21  153.35  296.01  299.06 : 152.77  152.86  261.09  298.71  |  

TLB miss   (M)  | 158.00  158.07  158.10  160.58 :  80.84   84.96   91.48   96.40  |  

Cycle/pattern   |  30.50   34.50   41.00   44.00 :  18.75   20.75   27.25   33.25  |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 512MB    |                                :                                 |  

Cache miss (M)  | 170.65  170.90  326.84  329.59 : 169.90  170.22  286.47  326.54  |  

TLB miss   (M)  | 160.39  160.41  162.17  164.62 : 140.35  147.28  160.81  179.58  |  

Cycles/patterm  |  36.75   41.00   47.00   50.00 :  20.50  23.00   29.75   35.50   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+  

  Size 1GB      |                                :                                 |  

Cache miss (M)  | 184.29  184.29  353.23  356.73 : 180.11  180.43  300.66  338.62  |  

TLB miss   (M)  | 163.64  164.26  176.04  178.88 : 150.37  157.85  169.58  190.89  |  

Cycle/pattern   |  37.25   41.50   52.00   55.25 :  22.00  24.75   30.50   37.00   |  

                |                                :                                 |  

                +--------------------------------+---------------------------------+



Figure 4: Working sets exceed caches and TLBs

posted at: 22:31 | /LowLevel | permalink

Mon, 12 Apr 2010

 

The type-lower-bound branch

Nikodemus was intrigued by the type-lower-bound branch I pushed on repo.or.cz a few weeks ago. It’s a convenience hack that wound up being slightly more intricate and interesting than planned: a user was complaining that there was no nice way to muffle some optimisation notes. Take

(lambda (x)  

  (declare (type integer x)  

           (optimize speed))  

  (1+ x))  

 

; in: LAMBDA (X)  

;     (1+ X)  

; ==>  

;   (+ X 1)  

;  

; note: forced to do GENERIC-+ (cost 10)  

;       unable to do inline fixnum arithmetic (cost 1) because:  

;       The first argument is a INTEGER, not a FIXNUM.  

;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T).  

;       unable to do inline fixnum arithmetic (cost 2) because:  

;       The first argument is a INTEGER, not a FIXNUM.  

;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T).  

;       etc.

If the user already knows that the best they can do is declare that x is an integer, there is no way to muffle only the notes that amount to wishing that the type of x was more precisely known.

My first (failed) attempt tagged variables as hopeless (any note mentioning these variables’ types would then be muffled). I forgot why I couldn’t make it work, but I believe it’s because invisible copies are very common, so that transforms manipulated lvars that were mere untagged copies of the tagged variable.

Luckily, we have a well-established mechanism to make properties flow across copies: the type (propagation) system. Again, my first attempt was to create a type for hopelessly vague types. However, types must implement all three set operations (negation, intersection and union), and making that work with such an ad hoc type wasn’t an attractive prospect.

This is where type lower bounds come in. In CL, types, as exposed through declarations, are always upper bounds: they are treated as conservative approximations of the set of values the annotated form, variable, etc. can take. The conservativeness is necessary because the exact static type is generally undecidable. In other words, the meaning of (the integer x) is that x can evaluate to any integer; it could actually only ever evaluate to a fixnum (or any other subset of integers), which is what the note in the original example asks for.

If we had a way to also denote lower bounds (e.g. that x can take (not fixnum) values) on the exact static type of forms, compiler notes could be tested against these lower bounds to determine when the programmer knows that the declared or derived type cannot be improved enough for a transform to fire. In the original example, this would amount to declaring that x’s exact static type is between fixnum (exclusively) and integer (inclusively), or, equivalently, that x will always be an integer, and will sometimes not be a fixnum. The note is then obviously moot, and can be muffled by the compiler.

Unlike ad hoc “hopeless” types, intersection and union of lower bound types (really, range of types, since lower bound types are always accompanied by an upper bound) are straightforward and theoretically sound. The only issue is with type negation. Since this is all for convenience, I decided to just punt and drop the lower bound before negating.

The branch, as pushed on repo.or.cz, seems to be working. In order to keep changes to a minimum, regular types are treated as having an implicit lower bound of nil, and range types (with a non-trivial lower bound) are aggressively converted to regular types. This gives the muffling effect for some interesting simple cases, and reverts to the old behaviour very quickly. There are probably hidden bugs (both in the code and in the design), but since they could only be triggered by using the extension, I’m not too worried.

N.B. This isn’t actually useful for compilation speed

I originally thought type lower bounds could be useful to improve compilation speed: by keeping around both lower and upper bounds, we are able to overapproximate types even in the presence of type negation. Once I implemented a quick prototype, ir1-widening, I realised we don’t need lower bounds at all: we only have to make sure we only ever approximate types once we’re sure they’ll never be negated. Actually, what would be even more useful is a way to compute approximate unions and intersections quickly, instead of widening types after the fact.

posted at: 20:44 | /Lisp | permalink

Mon, 28 Dec 2009

 

Constraint sets in SBCL: preliminary exploration

Part of what makes SBCL’s static analyses powerful is their flow-sensitivity: the system can express the fact that something is true only in one of the branches of a conditional. At the heart of that capability are sets of constraints that represent that knowledge; e.g. (eql x y) to assert that x and y hold the same value in a given branch.

Until September 2008, we implemented these sets as specialised hash tables. In commit 1.0.20.4, Richard M. Kreuter replaced the hash tables with bit vectors. On most workloads, the change was beneficial, if only because the compiler now spends less time in GC. However, in some cases (I would expect, with a large number of variables and conditional branches), users have reported severe performance regressions (e.g. https://bugs.launchpad.net/sbcl/+bug/394206), regarding not only compilation times, but also memory usage.

Exploring alternative representations for constraint sets (consets) looks like good way to get started with SBCL: it’s interesting as a data structure challenge, involves “real world” engineering trade-offs, and will have an immediate observable impact on the performance of the compiler. I’ve said as much before. Maybe this post will spur some interest.

As a first step, I instrumented the conset implementation to output some profiling information on each operation: the size of the constraint universe (approximately, the number of unique constraints in the compilation unit), and, for each conset argument, the size of the bit vector and the size of the set. I’m only using a random sample of 1% of the output from a full SBCL build.

Before graphing out histograms for the population and size of the consets, I counted the operations that were actually performed to see what was important:



operation frequency (%)


member 53
do 19
adjoin 14
copy 4.4
do-intersect 3.1
equal 2.7
union 1.4
empty 0.98
intersection 0.88
difference 0.38


All the operations should be obvious, except maybe do, which iterates over the constraints in the set, do-intersect, which iterates over the intersection of two sets, and empty, which tests for emptiness.

Of the 53% calls to member, 38% come from do-intersect, and another 14% from adjoin. There are actually very few calls to member per se. A large number of the calls to member returned T: 29%. That’s unfortunate, since failure is much easier to fast path. do is also slightly over-represented, since do-intersect also count as calls to do.

Graphs galore

PIC

The spikes are powers of 2, and the last non-zero value is 4096. Kreuter reported that the largest universe he’d seen was 8k elements in an email. We don’t really have to worry about enormous universes except to avoid disastrous blow-ups.

PIC

PIC

PIC

We can also see that the representation as bitvector isn’t very dense. However, keep in mind that a bit is much smaller than a word. It takes very sparse sets for alternatives such as hash tables to save space.

PIC

The very long tails here means that while there are some fairly large consets, the vast majority of the operations only touch sets that have less than a half dozen elements.

PIC

PIC

PIC

That’s also true for calls to member, but even more for adjoin and empty.

PIC

The profile for binary operations (union, intersection, difference and equal) is very similar to that of adjoin. In detail, we have:

PIC

PIC

PIC

PIC

equal only shows a heavier tail. intersection and difference, however, both peak around 20 elements, instead of 1 to 3. Finally, we can see that union is pretty much a non-issue.

It’s also interesting to see the difference between the size of the largest and smallest argument for binary operations.

PIC

PIC

The heavier tails on binary operations probably come from the large arguments, while the smaller arguments seem to follow the overall distribution.

Finally, iteration through the set’s elements (do or do-intersect) is interesting because the ratio between the set’s population and the bit vector’s size is a good indicator of that operation’s performance. For do-intersect only the set that’s iterated over is counted (not the one on which member tests are then executed).

PIC

PIC

PIC

Making sense of the data

I’m not sure what information we can extract from this data, and especially not how to extrapolate to the cases reported by users who experience serious regressions.

One obvious way to speed do up is to work with vectors of fixnum or machine words, and do the rest ourselves. That would help with long spans of 0s, and would simplify the inner loop. We could do something similar for empty for implementations that don’t special-case FIND on bit-vectors (e.g. SBCL).

This does nothing to reduce memory usage. We probably want to switch to another representation for really sparse sets, a hash table or a (sorted) vector. But, if we do that, we run the risk of switching to a sparse implementation only because we have constraints with very high and very low indices, but nothing in the middle. A fat and shallow (two-level?) tree would help avoid that situation, and let us choose locally between a few representations.

posted at: 22:16 | /Lisp | permalink

Sat, 26 Dec 2009

 

Some notes on Warren

N.B. A pdf version is also available.

My girlfriend gave me Warren’s Hacker’s Delight for Christmas. It’s really a nice compendium of tricks that are usually available on the web, but strewn across a dozen websites.

I only started reading it this morning, and I figured I’d put some of my notes here instead of leaving them in the margins. The page numbers refer to the ninth printing.

2-5: Sign Extension (p. 18)

For sign extension (i.e. replicate the kth bit to the left), Warren suggests (for sign extension of a byte into a word):

  1. ((x + 0x00000080) & 0x000000FF) - 0x00000080
  2. ((x & 0x000000FF) 0x00000080) - 0x00000080

When one knows that the higher bits of x are all zero, the second variant becomes (x 0x00000080) - 0x00000080. A similar variant is x|- (x & 0x00000080).

Warren’s variant doesn’t require any temporary register, but needs a single constant twice. Mine only requires that constant once, but needs a temporary register. On x86, with its good support for constant operands, Warren’s is probably preferable. With a RISCier ISA, the other version could be useful.

2-9: Decoding a “Zero Means 2**n” Field (p. 20)

The idea here is that we have a field which will never take a value of 0; it could however, take any value from 1 to 2n. We obviously want to pack this into exactly n bits. A simple encoding would simply map 0 to 1, 1 to 2, etc. For various reasons, we’re sometimes stuck with an encoding where everything except 0 maps to itself, and 0 to 2n.

Notice that 0 2n mod 2n. What we want to do is perform an identity modulo 2n, but skip the modulo on the final result. Obvious candidates are x - 1 + 1, x + 1 - 1 and 0 --x (and since we’re working modulo 2n, -1 2n- 1 and 0 2n).

From Warren’s list of eight “identities” (for 2n = 8), three clearly fall from the above:

  1. ((x - 1) & 7) + 1
  2. 8 - (-x & 7)
  3. ((x + 7) & 7) + 1

Interestingly, those involving  |  - 8 also do! x |  - 8 computes (x & 7) - 8: it’s sending x to a representative from its equivalence class modulo 8, but to the smallest negative value, instead of the smallest positive value. The intuition is that, like masking with 7, all but the three low bits are discarded; however, instead of filling the rest with 0s, like & 7, |  - 8 fills them with 1s.

Extra! Extra!

This entry is more markup-heavy than usual. That would be because I’m actually typing this in LATEX, while a Lisp script drives the conversion (via tex4ht) into XHTML for pyblosxom. You can find the script at http://discontinuity.info/~pkhuong/tex2blosxom.lisp. It’s a hack, but it works!

posted at: 15:18 | /LowLevel | permalink

Mon, 29 Jun 2009

 

Complex float improvements for sbcl 1.0.30/x86-64

Complex single- and double- floats used to be represented as pairs of SSE registers on the x86-64 port. Doing so simplified the porting work, since more code could be shared with other backends. Representing complexes packed in SSE registers as would be obviously preferable was left for the future. Luckily, most of my work on SSE intrinsics is directly applicable to that task: all the SSE{1,2} instructions were finally defined, and modifying a pre-existing primitive object type is much simpler than adding a new one.

The initial work was mostly straightforward, since, by default, the only operations on complexes are moves (to/from registers, the stack/arguments, vectors, or the heap [as boxed objects]), creation from scalar components, and extraction of components. The modifications mostly entailed modifying the way register allocation manipulates complexes (they only need a single SSE register, like scalar values), and making the boxed representation more SSE-friendly (alignment, packing single floats for MOVQ). There's only one problem I can remember in that part of the code: the stack isn't guaranteed to be aligned on 16 byte boundaries! That means that I'm forced to use MOVUPD when moving complex double floats between the stack and an SSE register.

Already, using half the registers and half the memory accesses is interesting. The real pay-off with packed complexes, however, is the ability to use vector instructions in arithmetic operations. The obvious cases are addition and subtraction (both complex-complex and complex-real), which exhibit natural horizontal parallelism: each pair of components (real-real or imaginary-imaginary) can be processed separately. For complex-real multiplication or division, things are only slightly more complex: the real operand has to be duplicated before operating in parallel. I encountered two important bugs in that code.

First, scalar values were still passed around with narrow register-register moves (movsd or movss). When we only ever manipulated scalars, that wasn't an issue. Complex-real addition and subtraction, however, really want to exploit the fact that the high part of a scalar value is filled with 0. That's not the case when a narrow move is used to move a single float value into a register that used to hold a double... Using full register moves instead (for both complex and scalar values) not only made the code correct, but also improved performance, due to their being simpler for the SSE pipeline to handle.

Second, packed single float division executes four divisions, not two. Thus, it's not enough to duplicate the real operand in a complex-real division before executing a packed division: depending on the exception mask, the 0.0/0.0 in the higher half of the result may trigger an arithmetic exception. The simplest solution I found was to perform the same operation in both halves of the operands. Obviously, an exception will be signalled iff operating only on the lower half would have signalled too. The cost is one additional shuffle step for the complex value: the real value always has to be duplicated, and the high half of the result must be reset to 0 anyway.

That leaves complex-complex multiplication, and complex-complex/real-complex division. Complex multiplication is a fairly common operation; common enough for Intel to pretty much dedicate an instruction to that operation (ADDSUBP{S,D}) in SSE3. Even when restricting ourselves to SSE2, the code is fairly simple. The most unnatural part is computing the conjugate, given the absence of ADDSUB (more on that later). Complex division, however, is a much more complicated beast. I decided to stick to an implementation in CL, but rewritten to use block (complex-at-a-time) instead of component-wise arithmetic. In the end, it still wanted an additional operation: swapping the real and imaginary components of a complex.

Finally, that leaves non-arithmetic operations, like equality tests (both bitwise and as numbers), conjugate and negation. Once there's a guarantee that the unused components in a SSE register are 0-filled (a guarantee that is supported by most load-from-memory instructions), testing for equality is very straightforward. Computing the conjugate of a complex or negating one, on the other hand, isn't so nice. One look at the way floats are laid out in register shows that they may be implemented as simple bit-wise operations. That leaves the problem of loading the relevant constants in. That's where a later patch comes in.

Negating and taking the absolute value of a real requires constants for bit-wise operations, like conjugating or negating a complex. Until now, SBCL generated code that loaded the constant in a GPR, and then moved it from the GPR into an SSE register. Generating the constant usually required some shifting (often slow), and moving data between the GPR and SSE register files takes a relatively long time on current microarchitectures. Loading such constants from memory directly into SSE registers would be preferable. However, by the time assembly code is generated, it's too late to add constants, and some of these constants would have to be boxed (be it as fixnums or bignums). In other words, both not feasible and not always a good idea.

A second patch adds support for unboxed constants stored inline, after the executable code. The system must already support code relocation, so the moving GC isn't an issue more than it already is (RIP-relative addressing on x86-64, and post-GC fixups on x86). I haven't bothered to look at the situation for other platforms yet, but, with some luck, things will be similarly simple. Storing constants in the code segments was already possible, with a small bit of hacking. What the patch mostly offers is the ability to pack the constants together. The advantage is that they can be stored far enough away from code (one cache line) to not confuse caches, and that a more global outlook allows the code generator to merge identical constants together and to reorder them to guarantee sufficient alignment without wasting too much space.

With that infrastructure in place, it was easy to generate much simpler and faster code for the previous bitwise operations on floats. It also became interesting to store some unboxed constants inline. On x86-64, I implemented that for constant floats (both real and complex) and fixnums that fit in registers, but can't be assembled as immediates (so +/- 2^29 and wider). On x86, fixnums can always be assembled as immediates, so only floats are stored as unboxed inline constants. Float constants are interesting to store inline because they're frequently found in speed-oriented code. Wider-than-fixnum constants, on the other hand, not so much. There's also a certain trade-off to consider: unboxed constants cause additional consing (and code bloat) each time they must be boxed.

Unboxed constants open the door for another set of improvements for x86oids: many instructions can load one operand directly from memory. Using that feature may save a register, and always makes the code smaller. The latter can often have a heavier impact on performance than the former. Using fewer registers mostly matters when the compiler has to spill values; that does not happen so often in inner loops on the x86-64. Smaller code, on the other hand, helps with memory (fits better in cache, needs less bandwidth) and reduces the pressure on the instruction decoder. Finally, when a constant shows up as an operand, it pretty much always pays off to store is inline, so even wider-than-fixnum integer constants are stored inline in that case.

The union of these changes transparently speeds up most complex float operations by up to ~100% on my Core 2. In the current version of Bordeaux-FFT, that translates into a 110% speed-up for 1024-point FFTs. Scalar operations also show significant improvements. Nikodemus Siivola's implementation of the (broken) mandelbrot routine at http://random-state.net/log/3452921796.html shows a 55% speed-up for hand-rolled complex arithmetic (within 20% of the execution time of the fastest g++ version). The complex-ful version shows a lesser speed-up, 38%, probably because a good portion of the runtime is spent operating element-wise to compute the 2-norm of complexes. If you have performance-sensitive code that performs a lot of floating point computations, you might be pleasantly surprised by this month's release.

P.S. Considering the size of these patches, I would probably be even more pleasantly surprised if they were bug-free. Bug reports (to sbcl-bugs, or, more realistically, directly to sbcl-devel) are always welcome, especially during the code freeze period starting today!

posted at: 02:00 | /Lisp | permalink

Mon, 13 Apr 2009

 

Interlude: Numerical experiments in hashing

EDIT: Added clarifications, modifed the wording a bit, and linked to the source (see bottom of post).

Cleaning up the SSE work, making it useful and using it in something interesting would take too much time given my current situation. SSE intrinsics, part 2, will have to wait. However, there's always time for a couple experiments induced by reading papers. d-left hashing seems to have gotten some attention in the last couple years. Unfortunately, I have a hard time mapping from the papers to real-world performance.

The idea behind d-left hashing is simple. Insertions work as follows:

  1. Split our hash table in d equal sections (typically two, the left and the right halves), and order the sections from left to right arbitrarily.
  2. Associate d independent hash values to the key, h_1, h_2, ... h_d.
  3. Use the d hash values to find a bucket in each section of the hash table, and put the key-value pair in the least loaded bucket; break ties not randomly, but by taking the bucket in the leftmost section (as per the arbitrary ordering established in 1.).

The procedure for lookups is easily derived from the above.

Surprisingly, breaking ties unfairly behaves slightly better than random tie-breaking. Mitzenmacher's, Richa's and Sitaraman's The Power of Two Random Choices: A Survey of Techniques and Results covers the topic fairly extensively.

Having a choice in which bucket to use gives a substantial (theoretic) advantage over having a single hash function (d = 1). Assuming ideal hash functions, we expect the greatest number of elements in the buckets of a regular bucketed hash table with n buckets in which we insert n elements to be around log n/log log n; for d > 1, that's log log n/log d + Theta(1). Clearly, setting d to 2 yields a great improvement over d = 1, and d > 2 only shaves a constant factor.

I don't doubt the theory, but "approximately" and "Theta(1)" aren't negligible for hash tables, where we're pretty much always fighting for better constant factors regarding space usage. Luckily, this idealised ball and bin setting is easily to simulate (the hash functions are just PRNGs here and in the sequel). I ran numerical experiments with n = 2^4, 2^5, ... 2^20, each time with 1000 repetition (I'll use the same number of independent repetitions throughout the post). The distribution of bucket sizes is fairly constant over the range, so I'll only show the results for n = 2^20.

Size  1-left (%)  2-left (%)
 0      36.8        22.8
 1      36.8        54.8
 2      18.4        21.9
 3       6.1         0.4
 4       1.5         0.0
 5       0.3          0
 6       0.1          0
         ...         ...
12       0.0          0
13        0           0

The theoretical results predict a maximal size of 5~6 for 1-left and 3~4 for 2-left, which is pretty close to what we observe here.

Unfortunately, I don't find that result satisfying: how will the buckets be implemented? If they're preallocated, we can only tell that the buckets would have to all be 3-entry deep (assuming a short overflow list to supplement the 6e-6% buckets with more than 3 entries) for all 2^20 entries to fit. That uses 3 times as much space as the actual data! Buckets implemented as linked lists or allocated adaptively might use less space, but, again, the constant factors are extremely important, since we expect to have relatively shallow buckets; the additional pointers might outweigh most or all the space savings. The results above are a good hint that 2-left hashing will perform much better than regular buckets, but otherwise don't seem particularly telling.

I found it interesting to compare the space efficiency of 2-left hash tables to that of a regular, even naive, hash table with linear open addressing. Again, the experiments fills 2^20 bins with 2^20 balls (the hash functions are again PRNGs), so the load is the inverse of the space usage. However, each bin can only contain up to 1 ball; when full, we try to put the ball in the next bin, up to a certain probe length limit (so we examine at most a small fixed number of bins). We won't be able to place all the balls (without a lot of luck, anyway), but it'll be interesting to see how much of the 2^20 balls will fit. I tried that with a maximal probe length of 16, and another one with again a maximal probe length of 16, but also an overflow list of at most 8 entries. The overflow list is used to host a small number of entries that exceed the probe length limit (or bucket size) instead of aborting immediately. Since very few entries hit the limit, a short list is often enough to substantially increase the expected load.

           Open       Open+Overflow
Load (%)  Freq. (%)  Freq. (%)
18           0         0
19          0.10       0
20          0.20       0
21          0.31       0
22          0.31       0
23          0.20       0
24          0.82       0
25          1.33       0
26          2.34       0
27          3.37       0
28          3.88       0
29          8.17       0
30          8.99       0
31         11.34       0
32         12.77       0
33         13.89       0
34         13.59      0.10
35         10.83      0.31
36          5.82      2.65
37          1.63      8.56
38          0.10     21.71
39           0       35.07
40           0       24.57
41           0        6.93
42           0        0.10
43           0         0

With linear open addressing, we can already expect to have approximately the same space overhead as 2-left tables with preallocated buckets if we wanted to guarantee that all 2^20 entries fit. The small overflow list has two effects: the worst case is greatly improved, and the mode (and average) is also considerably better. The expected space overhead is now even lower, probably close to what 2-left with buckets as linked lists would achieve! However, probing 16 cells might be a bit slow. Looking at the experiment's results show that most probes are very short (although this doesn't tell us much about lookups of missing keys), so it's not that bad. For the open addressing table with overflow list, we have:

Probe length  Frequency (%)
 1             80.55
 2             12.77
 3              3.84
 4              1.49
 5              0.66
 6              0.32
 7              0.16
 8              0.09
 9              0.05
10              0.03
11              0.02
12              0.01
13              0.01
14              0.00
15              0.00
16              0.00

Still, as a comparison point, 2-left hashing with 1/4 * 2^20 4-deep buckets and an 8-entry overflow list achieves much better loads (regular preallocated buckets are almost pitiful). Unlike the theoretical setting, buckets have a bounded size (at most 4), so the table uses 2^20 cells (+ the overflow list). The amount of space allocated is much more interesting, but it also means that it will take much fewer than 2^20 balls to have enough collisions to fill the overflow list. Note that for the sequel, I'll always use the same space budget: 2^20+8 key-value cells, whether they're arranged in buckets or not. Load is always computed as the fraction of 2^20 balls that could be placed before hitting the probe length or bucket size limits 8 times (enough to fill the overflow list). All the statistics are aggregated from 1000 independent runs.

           2-left, 4-buckets, 8-overflow
Load (%)  Frequency (%)
54           0
55          1.31
56         14.76
57         49.75
58         32.46
59          1.72
60           0

Bucket utilisation is also pretty good:

Bucket size  Frequency (%)
0              1.90
1             13.39
2             42.50
3             38.38
4              3.83

2-left hashing is clearly very interesting, even with preallocated buckets; that's much more obvious here than with the previous unbounded bucket experiment. Can we do better?

Having the choice of which table to use can clearly be very effective. A similar trick can be used on linear open addressing tables: split in two sub-tables, probe both sub-tables and stop as soon as one of them has an empty slot (again, breaking ties unfairly). I simulated such a scheme with 2 sub-tables of 1/2 * 2^20 cells, an 8-element overflow list, and maximal probe lengths of 8 and 16.

           Probe: 8
Load (%)   Freq. (%)
52            0
53           0.51
54           4.18
55          18.33
56          34.83
57          31.16
58          10.29
59           0.71
60            0
           Probe: 16
Load (%)   Freq. (%)
68            0
69           0.10
70           4.76
71          25.63
72          47.11
73          21.38
74           1.01
75            0

With a maximal probe length of 8, the 2-open addressing table already hits an expected load similar to that of 2-left table. When the maximal length is pushed to 16, load is much better, and, again, successful probes usually aren't very long: 98.9% of probes are of length at most 4 (in both tables, meaning that at most 8 bins are examined).

Allowing probes (of length greater than 1) has the advantage of letting buckets overflow into adjacent ones, so that lightly loaded buckets are better exploited in the rare cases when one overflows. Deeper buckets, on the other hand, avoid the clustering issues of linear open addressing; they're also slightly more cache-friendly. When data is moved into cache, it's loaded a whole cache line at a time. Linear addressing can exploit caches nicely since it will probe consecutive addresses. However, there's no guarantee that the probe will start at the beginning of the cache line. If we take care to align buckets cache lines will be used more efficiently.

Why not do both? Once we've settled on using multiple tables, short probes and small preallocated buckets don't really make the code more complex. For a first attempt, I simulated hash tables with 2 sub-tables, 4-deep buckets, up to 2, 3 or 4 -long probes and an overflow list of at most 8 entries. So, an insertion will compute two hash values to decide where to start probing, and insert in the least loaded bucket (breaking ties as usual), continuing to the next bucket in each table if both are empty. Load is pretty good, better than 2-left hashing:

           2-probes
Load (%)   Freq. (%)
66           0
67          0.91
68         17.00
69         51.11
70         30.77
71          0.20
72           0
           3-probes
Load (%)   Freq. (%)
72           0
73          0.20
74         10.82
75         53.79
76         34.78
77          0.40
78           0
           4-probes
Load (%)   Freq. (%)
76           0
77          0.10
78         11.36
79         62.27
80         26.17
81          0.10
82           0

The 2-table with 3-long probes and 4-deep buckets already performs better than the 2-open table with 16-long probes while examining fewer elements in the worst case. Longer probes on shallower buckets aren't really an improvement; deeper buckets, on the other hand, are. With 8-deep buckets (but still the same space budget) and no open addressing, we hit a load factor of ~78% (85% with probes of length up to 2), and 90% for 16-deep buckets (open addressing isn't that effective anymore). The problem with deeper buckets is that they take more time to search (since they're usually nearly full), while the vast majority of probes are of length 1 (> 99% for the last three parameter sets). Let's also keep in mind that we have to execute the search on two sub-tables; searching among 16 elements twice adds up to a substantial amount of work.

I've listed many possibilities. The following table recapitulates the modal load factor for all the simulations above. Keep in mind that the inverse, the space usage, is actually more relevant in practice. As load values get closer to 1, the difference in space overhead becomes smaller, so the 6% difference between open(16) without overflow and open(16) is much more important than the 5% difference between 2-combo(8,2) and 2-left(16).

Implementation   Modal load (%)
open(16) w/o OF      33         no overflow list
open(16)             39         16-probe at most
2-left(4)            57         4-buckets
2-open(8)            56         2 x 8-probe at most
2-open(16)           72
2-combo(4,2)         69         2-left, 4-buckets, 2-probes
2-combo(4,3)         75
2-combo(4,4)         79
2-left(8)            78
2-combo(8,2)         85
2-left(16)           90

Implementation-wise, 4-deep buckets are particularly interesting because four 32-bit values (e.g. keys or hash values) are exactly the size of an SSE register and can easily be searched in a single SSE instruction. 16-deep buckets, on the other hand, usually fill cache lines (often 64 bytes) exactly. 8-deep buckets still exploit cache lines fairly well, without being too long to search in the common case. Moreover, if we modify 2-combo(8, 2) so that probes always stay in the same 64 bytes segment (probe bucket i, then i xor 1), the load factor isn't affected (in simulations), but cache lines are used as well as with 16-deep buckets. That's a nice bunch of interesting structures to implement and test. I'll try and come back to the topic with real timings in a couple posts.

Unorganised source code for the simulations may be found at http://discontinuity.info/~pkhuong/hashing-simulation.lisp.

posted at: 23:09 | / | permalink

Sun, 22 Mar 2009

 

Hacking SSE Intrinsics in SBCL (part 1)

I think I managed to add sane support for SSE values (octwords) and operations to SBCL this weekend (all that because of a discussion on PRNGs which lead to remembering that SFMT wasn't too hard to implement, which lead to SSE intrinsics). It wasn't too hard, but clearly the path could have been better documented. The interesting part is the creation of a new primitive data type in the compiler; once that is done, we only have to define new VOPs. Note that since I'm only targetting x86-64, alignment isn't an issue: objects are aligned to 128 bit by default.

The compiler has to be informed of the new data type at two levels:

  • The front/middle -end, to create a new type (as in CL:TYPEP), and to define how to map from that type to the primitive type (which is more concerned about representation than semantics) used in the back-end;
  • The back-end, which has to be informed about the existence of a new primitive type, and must also be modified to correctly map that primitive type to machine registers or stack locations, and to know how to move from one to the other or how to box such values in a GC-friendly object on the heap.

Obviously the runtime was also be modified to be able to GC the new type of objects.

I believe it makes more sense to do this by starting low-level, in the back-end, and then building things up to the front-end, so I'll try and explain it that way.

Hacking the machine definition

SBCL's backend has a sort of type system at the VOP / TN level (virtual operations/registers). Two elements of type information are associated with each TN: the primitive type and the storage class. As its name implies, the primitive type is a low-level, C-style type (e.g. SIMPLE-ARRAY-UNSIGNED-BYTE-64 or DOUBLE-FLOAT), which is almost entirely concerned with representation. Apart from bad puns, there is no subtyping; COMPLEX and COMPLEX-DOUBLE-FLOAT are disjoint types. However, that still leaves some leeway to the back-end. A DOUBLE-FLOAT may be stored in a FP register, on the stack, or as a boxed value. Thus, TN are also assigned a storage class before generating code.

The first step was to define new storage classes for SSE values: sse-reg and sse-stack for SSE values in XMM registers and on the stack, respectively. That's done in src/compiler/x86-64/vm.lisp, !define-storage-classes:

  (sse-stack stack :element-size 2 :alignment 2)

[...]

  (sse-reg float-registers
           :locations #.(loop for i from 0 below 15 collect i)
           :constant-scs ()
           :save-p t
           :alternate-scs (sse-stack))

The first form defines a new storage class (SC) that's stored on the stack (the storage base, SB), where each element takes two locations (in this case, 64 bit words) and requires an alignment of two locations. The second form defines a new storage class that uses the float-registers storage base, and may take any of the first 15 locations in that SB (xmm15 is reserved). Values in that SC must be saved to the stack when needed. The sse-stack SC is to be used when there aren't enough registers or when saving registers (e.g. for a call). Some more modifications were needed in the assembler to make it aware of the new SC, but that's a mostly orthogonal concern.

Adding a new primitive type

That's enough to define a new primitive type in src/compiler/generic/primtype.lisp:

(!def-primitive-type sse-value (sse-reg descriptor-reg))

sse-value can be stored in sse-reg or descriptor-reg (as boxed values), or in any of their alternate SC.

Defining a new kind of primitive object

I've defined a new primitive type that can be stored as a boxed value. However, I haven't yet defined how that boxed value should be represented. I allocated one of the unused widetags to sse-values in src/compiler/generic/early-objdef.lisp:

  #!-x86-64
  unused02
  #!+x86-64
  sse-value                         ; 01100010

That's not necessary, but I would then have to define my own typechecking VOPs, adapt the GC somehow, etc. It will define a new constant, SB!VM::SSE-VALUE-WIDETAG. I use it in the primitive object definition (src/compiler/generic/objdef.lisp):

(define-primitive-object (sse-value
                          :lowtag other-pointer-lowtag
                          :widetag sse-value-widetag)
  (filler) ; preserve the natural 128 bit alignment
  (lo-value :c-type "long" :type (unsigned-byte 64))
  (hi-value :c-type "long" :type (unsigned-byte 64)))

The macro will also define useful constants, e.g. SSE-VALUE-SIZE, as well as slot offsets for the low and high values. That information will be needed by the GC. Genesis only exports CL constants to C when the symbols are external to the package. Thus, I had to add some symbols to the export list for SB!VM in ./package-data-list.lisp-expr:

               #!+x86-64 "SSE-VALUE"
               #!+x86-64 "SSE-VALUE-P" ; will be defined later on
               #!+x86-64 "SSE-VALUE-HI-VALUE-SLOT"
               #!+x86-64 "SSE-VALUE-LO-VALUE-SLOT"
               #!+x86-64 "SSE-VALUE-SIZE"
               #!+x86-64 "SSE-VALUE-WIDETAG"

and similarly for SB!KERNEL:

               #!+x86-64
               "OBJECT-NOT-SSE-VALUE-ERROR" ; so will this

Adapting the GC

I chose a very simple representation for sse-values: the header word will contain the right widetag, obviously, and the rest will encode the size of the object (in words). That's a common scheme in SBCL, and well supported by generic code everywhere. The garbage collector (a copying Cheney GC) has three important tables that are used to dispatch to the correct function given an object's widetag: scavtab to scavenge objects for pointers, transother to copy objects to the new space and sizetab to compute the size of an object. They're all initialised in src/runtime/gc-common.c, gc_init_tables. The representation is standard, so I only had to add pointers to predefined functions, scav_unboxed, trans_unboxed and size_unboxed:

#ifdef SSE_VALUE_WIDETAG
    scavtab[SSE_VALUE_WIDETAG] = scav_unboxed;
#endif

[...]

#ifdef SSE_VALUE_WIDETAG
    transother[SSE_VALUE_WIDETAG] = trans_unboxed;
#endif

[...]

#ifdef SSE_VALUE_WIDETAG
    sizetab[SSE_VALUE_WIDETAG] = size_unboxed;
#endif

Adding utility VOPs to the compiler

That's enough code to be able to use the sse-value primitive type, use our new storage classes in VOP definitions, and pass boxed sse-values around without crashing the GC. The first VOPs to define are probably those that let the compiler move sse-values around, from an sse-reg, sse-stack or descriptor-reg to another. Not all VOPs in the cartesian product must be defined; the compiler can figure out how to piece together move VOPs to a certain extent.

(define-move-fun (load-sse-value 2) (vop x y)
  ((sse-stack) (sse-reg))
  (inst movdqa y (ea-for-sse-stack x)))

(define-move-fun (store-sse-value 2) (vop x y)
  ((sse-reg) (sse-stack))
  (inst movdqa (ea-for-sse-stack y) x))

are enough to define how to move between the stack and xmm registers (ea-for-sse-stack is a helper function that generates an effective address from an sse-value 's location in the compile-time stack frame).

(define-vop (sse-move)
  (:args (x :scs (sse-reg)
            :target y
            :load-if (not (location= x y))))
  (:results (y :scs (sse-reg)
               :load-if (not (location= x y))))
  (:note "sse move")
  (:generator 0
    (unless (location= y x)
      (inst movdqa y x))))
(define-move-vop sse-move :move (sse-reg) (sse-reg))

provides and registers code to move from one xmm register to another.

(define-vop (move-from-sse)
  (:args (x :scs (sse-reg)))
  (:results (y :scs (descriptor-reg)))
  (:node-var node)
  (:note "sse to pointer coercion")
  (:generator 13
     (with-fixed-allocation (y
                             sse-value-widetag
                             sse-value-size
                             node)
       (inst movdqa (make-ea :qword
                             :base y
                             :disp 1)
             x))))
(define-move-vop move-from-sse :move
  (sse-reg) (descriptor-reg))

(define-vop (move-to-sse)
  (:args (x :scs (descriptor-reg)))
  (:results (y :scs (sse-reg)))
  (:note "pointer to sse coercion")
  (:generator 2
    (inst movdqa y (make-ea :qword :base x :disp 1))))
(define-move-vop move-to-sse :move (descriptor-reg) (sse-reg))

will be used to move from an xmm register to a boxed representation and vice-versa.

Finally,

(define-vop (move-sse-arg)
  (:args (x :scs (sse-reg) :target y)
         (fp :scs (any-reg)
             :load-if (not (sc-is y sse-reg))))
  (:results (y))
  (:note "SSE argument move")
  (:generator 4
     (sc-case y
       (sse-reg
        (unless (location= x y)
          (inst movdqa y x)))
       (sse-stack
        (inst movdqa (ea-for-sse-stack y fp) x)))))
(define-move-vop move-sse-arg :move-arg
  (sse-reg descriptor-reg) (sse-reg))

defines how arguments are loaded before calling a function.

Creating a new CL type and associated functions

Now that pretty much all the groundwork has been done in the backend, it's time to inform the frontend about the new data type. The system's built-in classes are defined in src/code/class.lisp, right after (defvar *built-in-classes*). I only had to insert another sublist in the list of built-in classes.

     (sb!vm:sse-value
      :codes (#.sb!vm:sse-value-widetag))

That takes care of defining a new class for the middle-end, and of mapping the class's name, in the front-end, to the class object in the middle-end.

SBCL tries pretty hard to always provide a safe language by default, so I have to make sure the type sse-value can be checked. First, a new error kind is defined in src/code/interr.lisp:

(deferr object-not-sse-value-error (object)
  (error 'type-error
         :datum object
         :expected-type 'sb!vm:sse-value))

The mapping from internal error number to a meaningful message is specified in src/compiler/generic/interr.lisp, define-internal-errors: (object-not-sse-value "Object is not of type SSE-VALUE."). Since I use a normal representation, I can use preexisting machinery to define the type checking VOPs in src/compiler/generic/late-type-vops.lisp:

(!define-type-vops sse-value-p check-sse-value sse-value
    object-not-sse-value-error
  (sse-value-widetag))

That only creates a VOP for sse-value-p; we'd sometimes like to have a real function. That's created in src/code/pred.lisp, with (def-type-predicate-wrapper sb!vm:sse-value-p). Moreover, when a VOP is defined as a translation for a function, that function must be defknown ed to the compiler. I do that in src/compiler/generic/vm-fndb.lisp, with (defknown sb!vm:sse-value-p (t) boolean (foldable flushable)).

TYPEP must also be informed to use that function. That's set up in src/compiler/generic/vm-typetran.lisp, with (define-type-predicate sb!vm:sse-value-p sb!vm:sse-value).

Making the middle and the back meet

The modifications above added a new primitive type to the back-end, along with some machinery to represent and manipulate values of that type. They also added a new built-in class to the front and middle -end and some type-checking code for that new class. The only thing left is to make sure the new class is mapped to the correct primitive type. The default is to map everything to the primitive type T, a boxed value. That's done at the bottom of src/compiler/generic/primtype.lisp, in primitive-type-aux. I only had to modify the case of translating built-in class(oids) to primitive types: sse-value are treated like complex, function, system-area-pointer or weak-pointer. The class sse-value is mapped to the primitive type sse-value. That way, a function that is defknown to, e.g., take an argument of type sse-value (the class) can be translated by a VOP that takes an argument of primitive type sse-value, which can be stored in an sse-reg (an xmm register).

Now what?

We have the data type definitions. To make them useful we still have to define a lot of functions and VOPs (and SSE instructions). However, that's much closer to regular development and doesn't require as much digging around in the source. I'll leave that for another post.

posted at: 18:25 | /Lisp | permalink

Mon, 06 Oct 2008

 

Revisiting VM tricks for safepoints

In February this year, Nathan Froyd described allocation sequences in SBCL and alternative approaches other implementations use to make sure the runtime never sees half-initialised objects. At first sight, SBCL's sequence seems fairly large and slow for what, in the end, increments a thread-local pointer, even on x86-64, with its awesome 14 (RBP actually serves as a frame pointer in SBCL) GPRs:

    Set the pseudo-atomic bit
OR BYTE PTR [R12+160], 8

    (what you'd expect for bump:) Load the allocation pointer
MOV R11, [R12+80]
    Test for overflow
LEA RDX, [R11+16]
CMP [R12+88], RDX
    Jump to an out of line sequence
JBE L4
    Or save incremented pointer
MOV [R12+80], RDX
    ... and tag it
LEA RDX, [R11+7]
    (end of allocation per se)

    Finally, unset the pseudo-atomic bit
XOR BYTE PTR [R12+160], 8
    Check for interrupt pending
JEQ L2
    Trigger a signal if so
BREAK 9                    ;      pending interrupt trap
L2: ...

That's two load/store and a test just to make sure we're not caught with our pants down, handling improperly initialised objects.

One element of the alternatives is to actively check for pending interrupts (or GCs, etc). Instead of allowing interruptions everywhere except around key sequences, interruptions are queued, and explicit checks inserted by the compiler. That seems like an interesting implementation choice, so I wanted to see what it'd look like in SBCL.

It was pretty easy to adapt the compiler to insert such checks at appropriate locations (in the current case, at the end of each function' prologue, and at the head of loops). Using memory protection to turn an access into a conditional interruption seemed like a good idea, and that's what I quickly implemented.

These checks can be executed fairly often in tight loops, so it's important that they be very efficient. The first version loaded a magic address from thread-local storage (an address pointed to by a register), and then wrote to that address. When the thread had to be interrupted, the page containing that address was made unreadable and unwritable, so the last access triggered a segfault, which was then handled specially.

The result was slow... 25% as much time as the original (signals-ful) version for a simple (loop repeat n do [not cons]) function, and no difference for a consing loop. Modifying the safepoint sequence to read from the magic address instead of writing to it halved the slowdown to ~10%, and did not improve the runtime of the consing loop. That's still far from impressive.

Executing two instructions at each safepoint seems obviously fishy. Indeed, things improved sensibly when the safepoint sequence became a single TEST, as in Nathan's example from HotSpot. Instead of having to read an arbitrary address from the thread struct, I moved the "magic page" to a fixed offset from the thread structure. The safepoint sequence then became a single instruction, TEST EAX, [THREAD_STRUCT_REG + offset] (a register is dedicated to thread-local storage). That's a single instruction, reads from memory and does not clobber any register. Unfortunately, that was only enough to bring the runtime for safepoints to the same level (+/- 1-2%) as that of the original code (it does save ~50 KB out of 40 MB on the x86-64 core :).

I'm not sure how to explain the results, except by saying that x86-64 (both Core 2 and K10) memory subsystems are awesome. In any case, unless I was doing something extremely wrong, the current XOR/XOR/JEQ/BREAK sequence seems to perform well, even when compared to what other implementations (not constrained by any systems programming goal) do. There still are reasons to look into a safe point system for SBCL (simpler to reason about, easier to avoid certain states, easier to manipulate where and when interruptions can happen, ...), but performance doesn't seem to be one of them.

posted at: 23:25 | /LowLevel | permalink

Wed, 03 Sep 2008

 

SWAR implementation of (some #'zerop ...)

SWAR (SIMD Within A Register) codes can portably express short-vector parallelism for operations on small packed data (e.g. byte or even nybble vectors). A trivial application of the technique is when we test whether any bit in a word is set (equal to 1) by comparing the whole word against 0. Obviously, that also work to test whether any field (of arbitrary width) is not filled with 0. This document, from 1997, provides a fairly clear and complete overview.

Just like testing whether any bit is set, it is easy to find whether some bit is not set (it's simply the opposite of whether every bit in the word is set). Things are more complex when the data are wider than a single bit (but obviously narrower than a full word). I found a short implementation (and barely tested it), but it might be possible to do even shorter. Skip to the series of asterisks if you want to solve that puzzle (to efficiently find whether any field in a sequence of data, itself packed into a single word, is 0) yourself.

To simplify the description, I'll assume that we're working with 4-bit-wide fields. It should be clear how the code can be adapted to other widths or even mixed widths.

Let x = aaaabbbbccccdddd... be the input.

1. x' = x | (x >> 1). The first bit in each field is now, for our purposes, noise. However, some of the remaining 3 bits are now non-zero iff some the original 4 were.

2. ones = x' & ~0x8888.... The noise bits are masked out.

3. borrow = 0x8888... - ones. The first bit of each field in borrow is 0 iff some of the 3 other bits in ones aren't (iff some of the 4 bits in x weren't).

4. result = borrow & 0x8888... is zero iff the first bit of every field in borrow is 0 (iff every field in x was non-null).

And, finally, that is easy to test for, a word at a time. In the end, it takes 5 (hopelessly serial) operations (>>, |, &, - and &) and a conditional branch.

**** Testing whether any field in a word is filled with 0 may seem incredibly obscure. Apart from (some #'zerop [packed-unboxed-vector]), what use is there for such code sequences? One trick is to exploit xor. xor lets us compare multiple fields at a time: a field is 0 in a xor b iff the corresponding fields in a and b are identical (bit for bit). Now that we can determine when at least one pair of fields is equal, it's simple to implement, e.g., default FIND on specialised vectors without having to test each datum separately (until the very end, when we know that one of the pairs is equal, but not which). As usual, a full implementation, capable of dealing with :start, :end and displaced vectors is a lot more work.

posted at: 02:07 | /LowLevel | permalink

Mon, 02 Jun 2008

 

Yet another way to fake continuations

The developers of the COMET constraint programming library faced an interesting challenge when they decided to provide means to distribute computations. Their search model is, at the bottom level, based on continuations (implemented by copying the C stack in the sequential or threaded cases). To distribute work, they had to come up with a way to send continuations across the wire. Obviously, a copy of the stack isn't likely to work here, with pointers to the heap, randomised address space, multiple architectures, etc. Instead, they send a log of the non-deterministic choices taken so far, in order to make the same choices on the receiving machine (and then, hopefully, different ones).

"Continuations represent the rest of the computation" is a common way to try to explain continuations. One approach to represent the rest of the computation is simply to store all the actions performed on the way to getting there. It is easy to see that only actions that result from non referentially-transparent code must be stored. In COMET's case, that means the values returned at choice points (the program is assumed to be otherwise referentially transparent).

The approach is much more practical than it sounds: in a good constraint solver, most of the work is done outside the search itself, to propagate information through the constraint graph. Thus, to avoid the vast majority of recomputations, it suffices to send the constraint graph along with the 'continuation'. In general, it would also be possible to treat computationally expensive operations as non RT and save their results... they do take an observable amount of resources to evaluate after all.

In some ways, stateful network interactions are similar. While they do have to somehow save and restore their computational state, most computationally expensive code is completely executed between receiving a request and sending a response back. The results of these computations are also easily serialised. On the other hand, code that must use continuations (because it waits for responses) is often nearly pure IO and flow control, directing the data to and from code that does the heavy lifting. In other words, if, as Cox wants us to, we instead used a state machine, we'd have a simple state machine with a couple fat states.

One problem with this approach is that restoring a continuation takes time proportional to the number of actions previously executed. This can be alleviated by logging the result of forms that execute many actions; once the forms have been fully evaluated, there is no need to save or replay the intermediate actions. Still, that's not very good when suspending a potentially infinite loop. I see the approach more as a way to implement the part of the aforementioned state machine that is painful to do by hand (with many sequential or nested states). For the outer, driving loop, I would stick to a real state machine.

A toy implementation of such continuations would look like:

(defconstant +in-eval+ '+in-eval+
  "Used as a flag for calls that have been captured
before completion")
(defparameter *action-log* ()
  "Stack of executed actions.")
(defparameter *replay* ()
  "Stack of actions to replay.")
(defun log-action (action)
  (push action *action-log*)
  (values-list (first action)))

(defmacro /log (&body body)
  `(if (and *replay*
            (if (eq (car *replay*) +in-eval+)
                (prog1 nil
                  (pop *replay*))
                t))
       (values-list (pop *replay*))
       (log-action (multiple-value-list
                    (let ((*action-log* (cons +in-eval+
                                              *action-log*)))
                      ;; rebind, since we don't need to replay
                      ;; intermediate actions if we can simulate
                      ;; the complete evaluation of body
                      ,@body)))))

(defun call-with-log (fn &rest args)
  (let ((*replay*     ())
        (*action-log* ()))
    (apply fn args)))

(defun capture-log ()
  (reverse *action-log*))

(defun replay-log (log fn &rest args)
  (let ((*replay*     log)
        (*action-log* ()))
    (apply fn args)))

With these, we can implement a simple backtracking search.

(defparameter *search-states* ()
  "Stack of states to explore.")
(defparameter *next-choices* ()
  "List of choices to take in the first uncommitted choice point")
(defun fail ()
  (throw 'fail nil))

(defun choose (&rest choices)
  (/log                  ; capture itself is non RT!
    (when *next-choices* ; override the choices if we must
      (shiftf choices *next-choices* nil))
    (cond ((null choices)
           (fail))
          ((null (rest choices))
           (first choices))
          (t
           (push (cons (rest choices) (capture-log))
                 *search-states*)
           ;; this call to choose still hasn't returned in the
           ;; captured log, so will be entered when replayed.
           (first choices)))))

(defun execute-search (fn &rest args)
  (let ((*search-states* ())
        (count           0))
    (flet ((body ()
             (incf count)
             (catch 'fail
               (multiple-value-list
                (apply fn args)))))
      (call-with-log #'body)
      (values (loop while *search-states*
                    for (*next-choice* . log) = (pop *search-states*)
                    for values = (replay-log log #'body)
                    when values collect values)
              count))))

Finally, this can be used to stupidly search for pythagorean triples:

(defun iota ()
  (let ((n (/log
             (format t "How many integers? ")
             (parse-integer (read-line)))))
    (loop for i below n collect i)))

(defun pythagorean-triples ()
  (let* ((nats (/log (iota)))
         (x    (1+ (apply 'choose nats)))
         (y    (1+ (apply 'choose nats)))
         (z    (1+ (apply 'choose nats))))
    (if (and (< x y)
             (= (* z z) (+ (* x x) (* y y))))
        (values x y z)
        (fail))))

(execute-search #'pythagorean-triples)

This will prompt, once, for an integer, n, and then find all the x, y, z in [1, n] such that z^2 = x^2 + y^2. Note how the two non-referentially transparent operations, the IO and choose, are wrapped in /log. What happens if we replace the /log in iota with a progn? The program prompts for a number at every replay (continuation invocation). What about removing the /log around the call to iota? With /log still around the IO, iota is referentially transparent (but most probably not pure), so we can't easily see any difference. However, /log still ensures it is only called once, instead of at every replay, sensibly affecting how much the program conses.

The logging code manipulates actions, not just return values. In fact, it already manipulates two types of actions: entering an expression, and returning from it. In general, it is possible to track arbitrary effects of the logged expressions (e.g., assignment to variables), as long as the effects can be replayed.

In the context of suspending and serialising a computation's state between communications, this approach seems especially interesting. The disadvantages (slow replay of long journals, potential replay of expensive operations) can be worked around, by using continuations where they have the most to offer and by logging expensive operations when possible. More importantly, the model is simple to understand, as are the performance consequences and the way to address them. It also does not itself depend on serialising closures. Finally, it is hard to misuse! Only specific operations (those that affect global state or otherwise communicate with the outside world) have to be specially annotated for correctness; all the rest is an optimisation. The control flow isn't restructured, so third-party functions can be safely used, as long as they're referentially transparent... even higher-order functions!

posted at: 23:50 | /Lisp | permalink

Made with PyBlosxom Contact me by email: pvk@pvk.ca.