Assembly Optimizations I: (Un)Packing Structures

This is the first post in a series about optimizations that can be performed when compiling C (or possibly other languages) to x86_64 that are currently not performed by popular compilers (Clang/LLVM, GCC/GNU, ICC/Intel). While some of these may be relatively straightforward, others may require a little... creative thinking to arrive at.

Procedure

In each post, I will start out with a problem description / explanation and well written GNU11 C solution. The challenge then is to see how the performance of the optimized assembly written by the various compilers and myself stack up. In order to ensure fairness, I will write my assembly routine prior to compiling the others. I then use a simple test file (written in C, compiled by clang) to call the assembly function in a really long loop and link it with each of the assembly files individually.

The resulting linked executables are then benchmarked using perf stat -B and the results along with a basic strategy analysis of the output assembly routines are shown in a table. Finally, I walk through each of the assembly routines and analyze them in detail.

All C solution code will be compiled to Intel Syntax x86_64 assembly using Matt Godbolt's excellent Compiler Explorer with the following options:

  • Language: C
    • Standard: GNU11
    • Enable GNU platform exclusive functions
  • Optimization Level: 2 (highest safe optimizations)
    • Enable Vectorization (SIMD)
    • Enable Prefetch
  • Output Comments w/ Assembly
  • Optimize for Sandy Bridge (I have a Sandy Bridge Xeon E5-2687W)

Altogether, those options are (not in order): -xc -O2 -ftree-vectorize -D_GNU_SOURCE -std=gnu11 -fprefetch-loop-arrays -fverbose-asm -march=sandybridge. Now then, on towards the meat of this post:

Structure (Un)Packing

Parallel Arrays are a data structure used in place of arrays of structures (AoS):

typedef struct {  
        int id;
        int age;
        char *name;
} S_t;

typedef struct {  
        int *id;
        int *age;
        char **name;
} PA_t;

S_t AoS[100];        // Array of Structures

static int id[100], age[100];  
static char *name[100];  
PA_t PA[100];        // Parallel Array  

Essentially, you store all members of a parallel array in their own arrays and then keep a single structure with pointers to each of the individual arrays. For this reason, it is also often known in literature as a Structure of Arrays (SoA). The pros and cons of the parallel array are outlined in the Wikipedia article, but suffice it to say that it is usually used to improve cache locality / reduce cache pollution, allow vector (SIMD) processing of individual members, and to reduce space consumption due to alignment issues.

That being said, when was the last time you used a parallel array? How about... Never? The vast majority of code that I have seen or personally written stores things in an array of structures, usually either for ease of use (that is the default, after all) or because members are commonly accessed together. On the other hand, a parallel array can be used to accelerate processing of an individual member via vectorization as prior mentioned, so it would be useful to be able to rapidly shuffle data to and from parallel arrays and arrays of structures.

Problem

Now, for a contrived example, let's pack and unpack a structure of 32-bit RGBA values (8 bits each):

#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <pthread.h>
#include <stdio.h>

#ifdef    __ICC
#define restrict
#endif

typedef struct {  
  uint8_t r,g,b,a;
} rgba_s;

typedef struct {  
  uint8_t *restrict r;
  uint8_t *restrict g;
  uint8_t *restrict b;
  uint8_t *restrict a;
} rgba_pa;

void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n) {  
  n *= 8;

  size_t x;
  for (x = 0; x < n; x++) {
    pa->r[x] = s[x].r;
    pa->g[x] = s[x].g;
    pa->b[x] = s[x].b;
    pa->a[x] = s[x].a;
  }

  return;
}

void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n) {  
  n *= 8;

  size_t x;
  for (x = 0; x < n; x++) {
    s[x].r = pa->r[x];
    s[x].g = pa->g[x];
    s[x].b = pa->b[x];
    s[x].a = pa->a[x];
  }

  return;
}

You might have noticed an odd multiply-by-8:

$$\frac{4\;\frac{bytes}{structure}*8\;\frac{structures}{iteration}*8\;\frac{bits}{byte}}{128\;\frac{bits}{SSE\;register}}=2\;\frac{SSE\;register}{iteration}$$

Therefore, I intentionally process in sets of 8 structures and divide by 8 when calling from the test functions:

s2ap_test:

#include <stdint.h>
#include <stdlib.h>

typedef struct {  
        uint8_t r,g,b,a;
} rgba_s;

typedef struct {  
        uint8_t *restrict r;
        uint8_t *restrict g;
        uint8_t *restrict b;
        uint8_t *restrict a;
} rgba_pa;

void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n);

#define        ITEMS        256

rgba_s s[ITEMS] = { [0 ... (ITEMS - 1)] = { 'r', 'g', 'b', 'a' }};

uint8_t r[ITEMS] = { 0 }, g[ITEMS] = { 0 }, b[ITEMS] = { 0 }, a[ITEMS] = { 0 };  
rgba_pa pa = { r, g, b, a };

int main (void) {  
        size_t x, y;
        for (x = 0; x < (1000 * 1000); x++)
                for (y = 0; y < 64; y++)        // loop unroll
                        s2pa (s, &pa, ITEMS/8);

        return 0;
}

pa2s_test:

#include <stdint.h>
#include <stdlib.h>

typedef struct {  
        uint8_t r,g,b,a;
} rgba_s;

typedef struct {  
        uint8_t *restrict r;
        uint8_t *restrict g;
        uint8_t *restrict b;
        uint8_t *restrict a;
} rgba_pa;

void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n);

#define        ITEMS        256

rgba_s s[ITEMS] = { 0 };

uint8_t        r[ITEMS] = { [0 ... (ITEMS - 1)] = 'r' },  
        g[ITEMS] = { [0 ... (ITEMS - 1)] = 'g' },
        b[ITEMS] = { [0 ... (ITEMS - 1)] = 'b' },
        a[ITEMS] = { [0 ... (ITEMS - 1)] = 'a' };
rgba_pa pa = { r, g, b, a };

int main (void) {  
        size_t x, y;
        for (x = 0; x < (1000 * 1000); x++)
                for (y = 0; y < 64; y++)        // loop unroll
                        pa2s (&pa, s, ITEMS/8);

        return 0;
}

Converting Structures to Parallel Arrays

RoutineBasic Loop StrategyTime (s)Cycles (B)Instructions (B)IPCStalled Cycles / InstructionIdle Frontend (%)Idle Backend (%)Branches (B)Branch Misses (%)
hand.s
  1. Vector Load
  2. Vector Mask
  3. Vector Shift
  4. Vector Saturating Reduce
  5. Micro-Vector Extract
  6. Micro-Vector Store
2.929.0529.883.300.025.061.531.280.08
clang.s
  1. Scalar Load
  2. Scalar Store
27.6185.39246.832.890.0823.360.5716.650.39
gcc.s
  1. Reload CONSTANT Pointer?!?!?
  2. Scalar Load
  3. Scalar Store
33.01102.26262.882.570.0719.150.4216.650.39
icc.s
  1. Reload CONSTANT Pointer?!?!?
  2. Scalar Load
  3. Scalar Store
37.47116.07246.502.120.011.370.8016.650.39

Ironically, Intel (icc) was the slowest (for their own chips!), so let's see what it did (the parts prior to s2pa: were kept the same and manually inserted across files):

icc.s:

.file "icc.s"
.intel_syntax noprefix

.text

# rgba_s:        struct {
#                        uint8_t r;
#                        uint8_t g;
#                        uint8_t b;
#                        uint8_t a;
#                }

# rgba_pa:        struct {
#                        uint8_t *restrict r;
#                        uint8_t *restrict g;
#                        uint8_t *restrict b;
#                        uint8_t *restrict a;
#                }

.globl        s2pa
.type        s2pa, @function
# void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n)
# s2pa (rdi *, rsi *, rdx)
s2pa:  
        xor       eax, eax                                      #26.8
        shl       rdx, 3                                        #23.3
        test      rdx, rdx                                      #26.19
        jbe       ..B1.5        # Prob 10%                      #26.19
..B1.3:                         # Preds ..B1.1 ..B1.3
        mov       r8, QWORD PTR [rsi]                           #27.5
        mov       cl, BYTE PTR [rdi+rax*4]                      #27.16
        mov       BYTE PTR [r8+rax], cl                         #27.5
        mov       r10, QWORD PTR [8+rsi]                        #28.5
        mov       r9b, BYTE PTR [1+rdi+rax*4]                   #28.16
        mov       BYTE PTR [r10+rax], r9b                       #28.5
        mov       rcx, QWORD PTR [16+rsi]                       #29.5
        mov       r11b, BYTE PTR [2+rdi+rax*4]                  #29.16
        mov       BYTE PTR [rcx+rax], r11b                      #29.5
        mov       r9, QWORD PTR [24+rsi]                        #30.5
        mov       r8b, BYTE PTR [3+rdi+rax*4]                   #30.16
        mov       BYTE PTR [r9+rax], r8b                        #30.5
        inc       rax                                           #26.22
        cmp       rax, rdx                                      #26.19
        jb        ..B1.3        # Prob 82%                      #26.19
..B1.5:                         # Preds ..B1.3 ..B1.1
        ret                                                     #33.3

So it starts by zeroing eax, so that it can be used as a counter, followed by multiplying n by 8 (shift left by 3 binary places). After that, it checks if n is zero or not. Once the setup is complete, it keeps loading and storing the bytes individually. For some odd reason, it also decided to keep reloading the destination addresses (base pointer stored in rdi). When the loop finishes, it returns.

Now, except for the weird let's-reload-the-addresses-every-time thing, this is more or less a perfection translation of the C code into assembly. In other words, any decent C/C++/FORTRAN programmer who was just learning x86_64 could be write something along the lines of this. Nothing special here.

Here's what GCC spit out:

gcc.s:

.file "gcc.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    s2pa
.type    s2pa, @function
# void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n)
# s2pa (rdi *, rsi *, rdx)
s2pa:  
        xor     eax, eax  # x
        shl     rdx, 3    # n,        # clang wouldn't accept "sal"...?
        je      .L7 #,
.L5:
        movzx   r8d, BYTE PTR [rdi]   # D.4428, MEM[base: _2, offset: 0B]
        add     rdi, 4    # ivtmp.7,
        mov     rcx, QWORD PTR [rsi]      # *pa_6(D).r, *pa_6(D).r
        mov     BYTE PTR [rcx+rax], r8b   # *_8, D.4428
        movzx   r8d, BYTE PTR [rdi-3] # D.4428, MEM[base: _2, offset: 1B]
        mov     rcx, QWORD PTR [rsi+8]    # *pa_6(D).g, *pa_6(D).g
        mov     BYTE PTR [rcx+rax], r8b   # *_15, D.4428
        movzx   r8d, BYTE PTR [rdi-2] # D.4428, MEM[base: _2, offset: 2B]
        mov     rcx, QWORD PTR [rsi+16]   # *pa_6(D).b, *pa_6(D).b
        mov     BYTE PTR [rcx+rax], r8b   # *_19, D.4428
        movzx   r8d, BYTE PTR [rdi-1] # D.4428, MEM[base: _2, offset: 3B]
        mov     rcx, QWORD PTR [rsi+24]   # *pa_6(D).a, *pa_6(D).a
        mov     BYTE PTR [rcx+rax], r8b   # *_23, D.4428
        add     rax, 1    # x,
        cmp     rdx, rax  # n, x
        jne     .L5       #,
.L7:
        ret

One thing to note here is that while GCC emitted a sal instruction (signed shift left), Clang wouldn't accept it (I used Clang to assemble and link all of the programs), so I had to substitute in an shl instruction (unsigned shift left). This may have something to do with the fact that semantically, a signed shift left is identical to an unsigned shift left. In fact, if we use the Online x86 / x64 Assembler and Disassembler (generously provided by Defuse Computer Security R&D), we find that the instruction sal literally translates to shl:

Assembly:  
    shl rdx, 3
    sal rdx, 3
Disassembly:  
    0:  48 c1 e2 03             shl    rdx,0x3
    4:  48 c1 e2 03             shl    rdx,0x3

GCC does more or less does the same thing as x86, although the running time is a bit less (icc: 37.47s vs gcc: 33.01s). If I were to guess, I'd say that GCC's simpler address calculation gives is giving it a slight speed edge, since it has both a higher IPC (icc: 2.12 vs gcc: 2.57) and less backend (execution) idling (icc: 0.80% vs gcc: 0.42%). However, this isn't something I'm sure enough of that I'd put money on (do let me know if this is the case though).

One little curiosity to notice is that in the pre-loop code, GCC doesn't test for zero, instead taking advantage of shl setting the zero flag if the result is zero. This is a simple micro-optimization that easily eliminates an extra instruction. Neat!

Clang was the second fastest using this:

clang.s:

.file "clang.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    s2pa
.type    s2pa, @function
# void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n)
# s2pa (rdi *, rsi *, rdx)
s2pa:                                   # @s2pa  
        shl     rdx, 3
        test    rdx, rdx
        je      .LBB0_3
        mov     r8, qword ptr [rsi]
        mov     r9, qword ptr [rsi + 8]
        mov     rax, qword ptr [rsi + 16]
        mov     rsi, qword ptr [rsi + 24]
        add     rdi, 3
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        mov     cl, byte ptr [rdi - 3]
        mov     byte ptr [r8], cl
        mov     cl, byte ptr [rdi - 2]
        mov     byte ptr [r9], cl
        mov     cl, byte ptr [rdi - 1]
        mov     byte ptr [rax], cl
        mov     cl, byte ptr [rdi]
        mov     byte ptr [rsi], cl
        inc     r8
        inc     r9
        inc     rax
        inc     rsi
        add     rdi, 4
        dec     rdx
        jne     .LBB0_2
.LBB0_3:                                # %._crit_edge
        ret

Clang decided to do the sensible, straightforward thing. Multiply n by 8, check if n is zero, load in the pointers and then fire off the load/store loop. It does do an odd offset-by-3 and then loads from addresses with negative offsets (and I haven't a clue why), but when spread out across many iterations of the loop, the overhead induced by that single instruction does not have any significant impact.

It's interesting to see that Clang chooses to increment each of the pointer registers in every loop iteration as opposed to incrementing a single register and using it as an offset. In theory, using a single register would mean fewer instructions (and fewer cycles per loop, since they all use the same execution lanes). However, if my hypothesis about simpler addressing that I mentioned for GCC is true, then updating the pointer registers might indeed be faster. Perhaps I'll test that another time.

All in all, props to the Clang/LLVM developers for making Clang do the obvious and reasonable thing here.

When I hand coded mine, I mainly focused on vectorizing (removed commented out lines prior to pasting):

hand.s:

.file "hand.s"
.intel_syntax noprefix

.data

.align    16
shift:  
.zero    3
.byte    0xFF
.zero    3
.byte    0xFF
.zero    3
.byte    0xFF
.zero    3
.byte    0xFF

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    s2pa
.type    s2pa, @function
# void s2pa (const rgba_s *restrict s, rgba_pa *restrict pa, size_t n)
# s2pa (rdi *, rsi *, rdx)
s2pa:    # sandy optimized

    # free rename | free load s.r shift | prefetch vector read
    xor eax, eax
    vmovdqa xmm0, XMMWORD PTR [shift + rip]
    prefetcht0  [rdi]

    # zero check
    cmp rdx, rax
    je  .s2pa_RET

    # load in pa indexes | calculate masks s.{g,b,a} | zero pa offset
    mov r8, QWORD PTR [rsi]
    mov r9, QWORD PTR [rsi + 8]
    vpsrld  xmm1, xmm0, 8
    vpsrld  xmm2, xmm0, 16

    mov r10, QWORD PTR [rsi + 16]
    mov r11, QWORD PTR [rsi + 24]
    vpsrld  xmm3, xmm0, 24

.s2pa_LOOP:
    # load in *s
    vmovdqu     xmm15, XMMWORD PTR [rdi + rax * 8]

    # mask what we want | prefetch next pa.{r,g,b,a} for write
    vpand       xmm4, xmm15, xmm0
    vpand       xmm5, xmm15, xmm1
    prefetcht0  [r8 + rax * 2 + 4]
    prefetcht0  [r9 + rax * 2 + 4]

    vpand       xmm6, xmm15, xmm2
    vpand       xmm7, xmm15, xmm3
    prefetcht0  [r10 + rax * 2 + 4]
    prefetcht0  [r11 + rax * 2 + 4]

    # shift r,g,b (a is already in place) | prefetch next s
    vpsrld  xmm4, xmm4, 24
    vpsrld  xmm5, xmm5, 16
    vpsrld  xmm6, xmm6, 8
    prefetcht0  [rdi + rax * 8  + 32]

    # reduce 32 to 16
    vpackusdw   xmm4, xmm4, xmm4
    vpackusdw   xmm5, xmm5, xmm5
    vpackusdw   xmm6, xmm6, xmm6
    vpackusdw   xmm7, xmm7, xmm7

    # reduce 16 to 8
    vpackuswb   xmm4, xmm4, xmm4
    vpackuswb   xmm5, xmm5, xmm5
    vpackuswb   xmm6, xmm6, xmm6
    vpackuswb   xmm7, xmm7, xmm7

    # we need to write v/4 at a time (reversed due to little endian)
    vmovd   DWORD PTR [r11 + rax * 2], xmm4
    vmovd   DWORD PTR [r10 + rax * 2], xmm5
    vmovd   DWORD PTR [r9 + rax * 2], xmm6
    vmovd   DWORD PTR [r8 + rax * 2], xmm7

    # loop end
    add rax, 2
    cmp rax, rdx
    jne .s2pa_LOOP

.s2pa_RET:
    ret

Upon first reading that, you might notice that I structured the code in blocks of four or less. I do that because the Sandy Bridge microarchitecture can consume up to 4 instructions / cycle. Realistically, that doesn't generally happen (instructions are often too long), but it still also makes the code easier to read and comment on.

Prior to the loop, I load a vector byte mask for s.r and then generate offset masks for the other members of s based off of that. Then in the loop, I do the following:

  1. vmovdqu: load in a 128-bit (16-byte) vector (4 packed structures)
  2. vpand: apply the prior generated vector masks so we only have the bytes that we want for each of the members
  3. prefetcht0: preload the cache lines that we will be writing to
  4. vpsrld: shift the vectors as necessary to bring the bytes (8-bit) we want to the lowest portion of the doublewords (32-bit) in the vector
  5. prefetcht0: preload the next cache line that we will read in the next loop iteration
  6. vpackusdw: reduce each of the 4 doublewords (32-bit) in the cache line to words (16-bit)
  7. vpackuswb: reduce each of the 8 words (16-bit) in the cache line to bytes (8-bit, duh)
  8. vmovd: write out the lowest doubleword (32-bit) of each of the vectors to the approriate

It may occur to you that we shouldn't have to mask the bytes if we're going to be reducing them later. The thing is though, x86_64 doesn't actually have an instruction to reduce words in SSE vectors, so I ended up having to get creative. SSE 4.1 instead introduced PACKUS{DW,WB}, which does a saturated size reduction. By using a mask to only retrieve the bytes that we want before using the PACKUS* instructions, all the other bytes are set to zero, so when reduced with saturation, the byte values that we want stay the same.

Also, after a few of iterations of the loop, the manual prefetch becomes utterly worthless. The cache predictor will detect the sequential memory accesses and begin automatically fetching cache lines ahead of time. Despite this, manually prefetching may marginally help during the first few iterations, and since Sandy Bridge can only execute two vector arithmetic operations per cycle, so I had two remaining spots for instructions anyways (4 IPC max. - 2 V ALU IPC = 2 IPC left), and in this case, prefetching couldn't hurt.

Summary

Compilers usually generate pretty solid code, but when given tasks that aren't particularly common, they may not have been optimized appropriately and thus may not generate particularly optimal (or even half-decent) code. Also, modern compilers can only use optimizations and instructions that they have been designed to use; unlike a human, a modern compiler cannot yet create convoluted combinations of instructions that magically work (this may be possible in the future via superoptimizers, for instance see Google's Souper).

Also, instruction sets are difficult to keep adding instructions to:

  1. Instruction encoding space is limited. Adding more instructions requires you to be conservative about using encoding space and you may have to introduce encoding extensions sometimes (horrifying to have to decode; ex: REX, VEX, EVEX, etc).
  2. Having more instructions requires you to implement more functionality on the chip and/or provide chip space for a larger, more sophisticated microcode. This, in turn, can increase the complexity of designing and manufacturing the chips, which ultimately means more expensive products that customers may be less willing to purchase.
  3. The more instructions you provide, the less likely it is that any given instruction will be used. Compiler programmers have a limited amount of time to be working on things and as a general rule, modern code is rarely written directly in assembly.

In this scenario, that meant that while SSE 4.1 provided a variety of instructions for expanding integers (PMOV{S,Z}X*), they did not provide any instructions for reducing integers (since you can find ways of accomplishing this similar to how I did here). Therefore, when writing assembly, if you can't seem to find an instruction that does what you want (check Intel's Intrinsics Guide and Felix Cloutier's x86 Instruction Set Reference), try to see what combination of available instructions you might be able to use while still maintaining high speed.

If you haven't had a look already, when selecting instructions to use, Agner Fog's Instruction Tables will provide you with timing data relevant to the instructions to assist you in picking fast instructions or combinations thereof.


Converting Parallel Structures to Arrays

RoutineBasic Loop StrategyTime (s)Cycles (B)Instructions (B)IPCStalled Cycles / InstructionIdle Frontend (%)Idle Backend (%)Branches (B)Branch Misses (B)
hand.s
  1. Vector Zero Extend Load
  2. Vector Shift
  3. Vector Combine
  4. Vector Store
7.2222.3662.542.800.026.230.372.310.05
clang.s
  1. Scalar Load
  2. Scalar Store
29.3390.74246.802.720.1027.860.7416.650.39
gcc.s
  1. Vector Load
  2. Vector Unpack and Interleave
  3. Vector Store
2.507.7224.753.210.0412.681.75962.10.11
icc.s
  1. Reload CONSTANT Pointer?!?!?
  2. Scalar Load
  3. Scalar Store
37.70116.78246.562.110.011.970.7016.650.39

Once again, Intel (icc) was the slowest...

icc.s:

.file "icc.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    pa2s
.type    pa2s, @function
# void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n)
# pa2s (rdi *, rsi *, rdx)
pa2s:  
        xor       eax, eax                                      #40.8
        shl       rdx, 3                                        #37.3
        test      rdx, rdx                                      #40.19
        jbe       ..B2.5        # Prob 10%                      #40.19
..B2.3:                         # Preds ..B2.1 ..B2.3
        mov       rcx, QWORD PTR [rdi]                          #41.14
        mov       r8b, BYTE PTR [rcx+rax]                       #41.14
        mov       BYTE PTR [rsi+rax*4], r8b                     #41.5
        mov       r9, QWORD PTR [8+rdi]                         #42.14
        mov       r10b, BYTE PTR [r9+rax]                       #42.14
        mov       BYTE PTR [1+rsi+rax*4], r10b                  #42.5
        mov       r11, QWORD PTR [16+rdi]                       #43.14
        mov       cl, BYTE PTR [r11+rax]                        #43.14
        mov       BYTE PTR [2+rsi+rax*4], cl                    #43.5
        mov       r8, QWORD PTR [24+rdi]                        #44.14
        mov       r9b, BYTE PTR [r8+rax]                        #44.14
        mov       BYTE PTR [3+rsi+rax*4], r9b                   #44.5
        inc       rax                                           #40.22
        cmp       rax, rdx                                      #40.19
        jb        ..B2.3        # Prob 82%                      #40.19
..B2.5:                         # Preds ..B2.3 ..B2.1
        ret                                                     #47.3

icc generated pretty much the same code as it did with s2pa and had pretty much the same awful results. Moving on.

clang.s:

.file "clang.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    pa2s
.type    pa2s, @function
# void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n)
# pa2s (rdi *, rsi *, rdx)
pa2s:                                   # @pa2s  
        shl     rdx, 3
        test    rdx, rdx
        je      .LBB1_3
        mov     r8, qword ptr [rdi]
        mov     r9, qword ptr [rdi + 8]
        mov     rax, qword ptr [rdi + 16]
        mov     rdi, qword ptr [rdi + 24]
        add     rsi, 3
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
        mov     cl, byte ptr [r8]
        mov     byte ptr [rsi - 3], cl
        mov     cl, byte ptr [r9]
        mov     byte ptr [rsi - 2], cl
        mov     cl, byte ptr [rax]
        mov     byte ptr [rsi - 1], cl
        mov     cl, byte ptr [rdi]
        mov     byte ptr [rsi], cl
        inc     r8
        inc     r9
        inc     rax
        inc     rdi
        add     rsi, 4
        dec     rdx
        jne     .LBB1_2
.LBB1_3:                                # %._crit_edge
        ret

Once again, Clang did the rational, sane thing, and this delivered similar performance.

hand.s:

.file "hand.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    pa2s
.type    pa2s, @function
# void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n)
# pa2s (rdi *, rsi *, rdx)
pa2s:    # sandy optimized

    # free rename | prefetch vector write
    xor eax, eax
    prefetcht0  [rdi]

    # processed in blocks of 8 structures
    shl rdx, 3

    # zero check
    cmp rdx, rax
    je  .pa2s_RET

    # load in pa indexes (reversed because little endian)
    mov r11, QWORD PTR [rdi]
    mov r10, QWORD PTR [rdi + 8]
    mov r9, QWORD PTR [rdi + 16]
    mov r8, QWORD PTR [rdi + 24]

.pa2s_LOOP:
    # load and expand 8 to 32 | shift r,g,b (a is already in place) | prefetch next s write
    pmovzxbd    xmm0, DWORD PTR [rax + r8]
    pmovzxbd    xmm1, DWORD PTR [rax + r9]
    pslld       xmm0, 24
    pslld       xmm1, 16

    pmovzxbd    xmm2, DWORD PTR [rax + r10]
    pmovzxbd    xmm3, DWORD PTR [rax + r11]
    pslld       xmm2, 8
    prefetcht0  [rdi + rax * 4 + 32]

    # load 2nd batch | merge 1st batch | shift 2nd batch
    pmovzxbd    xmm4, DWORD PTR [rax + r8 + 4]
    pmovzxbd    xmm5, DWORD PTR [rax + r9 + 4]
    por     xmm0, xmm1
    por     xmm2, xmm3

    pmovzxbd    xmm6, DWORD PTR [rax + r10 + 4]
    pmovzxbd    xmm7, DWORD PTR [rax + r11 + 4]
    por     xmm0, xmm2
    pslld       xmm4, 24

    # shift 2nd batch cont. | write 1st batch | prefetch next pa.{r,g,b,a}
    pslld   xmm5, 16
    pslld   xmm6, 8
    movdqu  XMMWORD PTR [rsi + rax * 4], xmm0
    prefetcht0  [rax + r8 + 8]

    # merge 2nd batch | prefetch next pa.{r,g,b,a} cont.
    por xmm4, xmm5
    por xmm6, xmm7
    prefetcht0  [rax + r9 + 8]
    prefetcht0  [rax + r10 + 8]

    por xmm4, xmm6
    prefetcht0  [rax + r11 + 8]

    # write 2nd batch | loop end
    movdqu  XMMWORD PTR [rsi + rax * 4 + 16], xmm4
    add rax, 8
    cmp rax, rdx
    jne .pa2s_LOOP

.pa2s_RET:
    ret

You may have noticed that unlike in the s2pa code, I didn't use any VEX prefixed instructions (vp*), instead opting for the normal, unprefixed SSE instructions. In writing this code, I faced virtually no register pressure (how much you have to rearrange and in some cases, temporarily store register values elsewhere), so I had no reason to use the three-operand VEX prefixed code (usually SSE instructions are along the lines of a = operation (a, b), VEX prefixed three operand instructions are usually a = operation (b, c), which allows for better register allocation in some cases because you don't stomp all over your existing values).

Anyways, in repacking structures from parallel arrays, I built a simple loop strategy:

  1. pmovxzbd: load 4 bytes (8-bit) and expand each of them into a 4 byte doubleword (32-bit)
  2. pslld: shift the vectors as necessary so that the bytes (8-bit) that we want are in the appropriate position of the doublewords (32-bit) within the vector (128-bit)
  3. prefetcht0: preload the cache line that we will be writing to later
  4. por: merge the vectors containing the separate members so that we end up with a single vector containing the rebuilt structure with all members
  5. movdqu: store the result vector
  6. prefetcht0: preload the cache lines for the parallel arrays

Additionally, in order to make better usage of the ability to perform two loads per cycle on Sandy Bridge, I perform two batches worth of work in each iteration of the loop.

However, my implementation was not the fastest for repacking structures.

gcc.s:

.file "gcc.s"
.intel_syntax noprefix

.text

# rgba_s:    struct {
#            uint8_t r;
#            uint8_t g;
#            uint8_t b;
#            uint8_t a;
#        }

# rgba_pa:    struct {
#            uint8_t *restrict r;
#            uint8_t *restrict g;
#            uint8_t *restrict b;
#            uint8_t *restrict a;
#        }

.globl    pa2s
.type    pa2s, @function
# void pa2s (const rgba_pa *restrict pa, rgba_s *restrict s, size_t n)
# pa2s (rdi *, rsi *, rdx)
pa2s:  
        mov     rcx, rdx  # n, n
        shl     rcx, 3    # n,
        je      .L36        #,
        push    r15     #
        shl     rdx, 5    # D.4550,
        push    r14     #
        add     rdx, rsi  # D.4552, s
        push    r13     #
        push    r12     #
        push    rbp     #
        push    rbx     #
        mov     r8, QWORD PTR [rdi]       # D.4551, pa_9(D)->r
        mov     r9, QWORD PTR [rdi+8]     # D.4551, pa_9(D)->g
        mov     r10, QWORD PTR [rdi+16]   # D.4551, pa_9(D)->b
        mov     rdi, QWORD PTR [rdi+24]   # D.4551, pa_9(D)->a
        lea     rax, [r8+rcx]     # D.4546,
        cmp     rsi, rax  # s, D.4546
        setnb   bl    #, D.4547
        cmp     r8, rdx   # D.4551, D.4552
        setnb   al    #, D.4547
        or      ebx, eax    # D.4547, D.4547
        lea     rax, [r9+rcx]     # D.4546,
        cmp     rsi, rax  # s, D.4546
        setnb   al    #, D.4547
        cmp     rdx, r9   # D.4552, D.4551
        setbe   r11b  #, D.4547
        or      eax, r11d   # D.4547, D.4547
        and     eax, ebx  # D.4547, D.4547
        cmp     rcx, 15   # n,
        seta    r11b    #, D.4547
        and     eax, r11d # D.4547, D.4547
        lea     r11, [r10+rcx]    # D.4546,
        cmp     rsi, r11  # s, D.4546
        setnb   bl    #, D.4547
        cmp     rdx, r10  # D.4552, D.4551
        setbe   r11b  #, D.4547
        or      r11d, ebx   # D.4547, D.4547
        test    al, r11b        # D.4547, D.4547
        je      .L11        #,
        lea     rax, [rdi+rcx]    # D.4546,
        cmp     rsi, rax  # s, D.4546
        setnb   r11b  #, D.4547
        cmp     rdx, rdi  # D.4552, D.4551
        setbe   al    #, D.4547
        or      r11b, al    # tmp302, D.4547
        je      .L11        #,
        lea     rdx, [rcx-16]     # D.4555,
        shr     rdx, 4    # D.4555,
        lea     r12, [rdx+1]      # bnd.16,
        sub     rdx, 4    # D.4555,
        mov     rax, r12  # x, bnd.16
        shl     rax, 4    # x,
        cmp     rdx, -6   # D.4555,
        ja      .L21        #,
        shr     rdx, 2    # D.4555,
        mov     r13, rdi  # vectp_pretmp.28, D.4551
        mov     rbp, r10  # vectp_pretmp.25, D.4551
        lea     r15, [4+rdx*4]    # ivtmp.66,
        mov     rbx, r9   # vectp_pretmp.22, D.4551
        mov     rdx, rsi  # vectp_s.31, s
        mov     r11, r8   # vectp_pretmp.19, D.4551
        xor     r14d, r14d        # D.4549
.L13:
        vmovdqu xmm1, XMMWORD PTR [r11]   # MEM[base: vectp_pretmp.19_105, offset: 0B], MEM[base: vectp_pretmp.19_105, offset: 0B]
        add     r14, 4    # D.4549,
        add     r11, 64   # vectp_pretmp.19,
        prefetcht0      [rdx+560]   #
        vmovdqu xmm2, XMMWORD PTR [rbx]   # MEM[base: vectp_pretmp.22_109, offset: 0B], MEM[base: vectp_pretmp.22_109, offset: 0B]
        add     rbp, 64   # vectp_pretmp.25,
        add     rbx, 64   # vectp_pretmp.22,
        add     r13, 64   # vectp_pretmp.28,
        vmovdqu xmm5, XMMWORD PTR [r13-64]        # MEM[base: vectp_pretmp.28_117, offset: 0B], MEM[base: vectp_pretmp.28_117, offset: 0B]
        prefetcht0      [rdx+624]   #
        prefetcht0      [rdx+688]   #
        prefetcht0      [rdx+752]   #
        vmovdqu xmm0, XMMWORD PTR [rbp-64]        # MEM[base: vectp_pretmp.25_113, offset: 0B], MEM[base: vectp_pretmp.25_113, offset: 0B]
        add     rdx, 256  # vectp_s.31,
        vpunpcklbw      xmm4, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 0B], MEM[base: vectp_pretmp.28_117, offset: 0B]
        vpunpcklbw      xmm3, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 0B], MEM[base: vectp_pretmp.25_113, offset: 0B]
        vpunpckhbw      xmm0, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 0B], MEM[base: vectp_pretmp.25_113, offset: 0B]
        vpunpckhbw      xmm1, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 0B], MEM[base: vectp_pretmp.28_117, offset: 0B]
        vpunpcklbw      xmm2, xmm3, xmm4    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm3, xmm3, xmm4    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-256], xmm2       # MEM[base: vectp_s.31_121, offset: 0B], D.4548
        vpunpcklbw      xmm2, xmm0, xmm1    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm0, xmm0, xmm1    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-240], xmm3       # MEM[base: vectp_s.31_121, offset: 16B], D.4548
        vmovups XMMWORD PTR [rdx-224], xmm2       # MEM[base: vectp_s.31_121, offset: 32B], D.4548
        vmovups XMMWORD PTR [rdx-208], xmm0       # MEM[base: vectp_s.31_121, offset: 48B], D.4548
        vmovdqu xmm1, XMMWORD PTR [r11-48]        # MEM[base: vectp_pretmp.19_105, offset: 16B], MEM[base: vectp_pretmp.19_105, offset: 16B]
        vmovdqu xmm2, XMMWORD PTR [rbx-48]        # MEM[base: vectp_pretmp.22_109, offset: 16B], MEM[base: vectp_pretmp.22_109, offset: 16B]
        vmovdqu xmm5, XMMWORD PTR [r13-48]        # MEM[base: vectp_pretmp.28_117, offset: 16B], MEM[base: vectp_pretmp.28_117, offset: 16B]
        vmovdqu xmm0, XMMWORD PTR [rbp-48]        # MEM[base: vectp_pretmp.25_113, offset: 16B], MEM[base: vectp_pretmp.25_113, offset: 16B]
        vpunpcklbw      xmm4, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 16B], MEM[base: vectp_pretmp.28_117, offset: 16B]
        vpunpcklbw      xmm3, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 16B], MEM[base: vectp_pretmp.25_113, offset: 16B]
        vpunpckhbw      xmm0, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 16B], MEM[base: vectp_pretmp.25_113, offset: 16B]
        vpunpckhbw      xmm1, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 16B], MEM[base: vectp_pretmp.28_117, offset: 16B]
        vpunpcklbw      xmm2, xmm3, xmm4    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm3, xmm3, xmm4    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-192], xmm2       # MEM[base: vectp_s.31_121, offset: 64B], D.4548
        vpunpcklbw      xmm2, xmm0, xmm1    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm0, xmm0, xmm1    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-176], xmm3       # MEM[base: vectp_s.31_121, offset: 80B], D.4548
        vmovups XMMWORD PTR [rdx-160], xmm2       # MEM[base: vectp_s.31_121, offset: 96B], D.4548
        vmovups XMMWORD PTR [rdx-144], xmm0       # MEM[base: vectp_s.31_121, offset: 112B], D.4548
        vmovdqu xmm1, XMMWORD PTR [r11-32]        # MEM[base: vectp_pretmp.19_105, offset: 32B], MEM[base: vectp_pretmp.19_105, offset: 32B]
        vmovdqu xmm2, XMMWORD PTR [rbx-32]        # MEM[base: vectp_pretmp.22_109, offset: 32B], MEM[base: vectp_pretmp.22_109, offset: 32B]
        vmovdqu xmm5, XMMWORD PTR [r13-32]        # MEM[base: vectp_pretmp.28_117, offset: 32B], MEM[base: vectp_pretmp.28_117, offset: 32B]
        vmovdqu xmm0, XMMWORD PTR [rbp-32]        # MEM[base: vectp_pretmp.25_113, offset: 32B], MEM[base: vectp_pretmp.25_113, offset: 32B]
        vpunpcklbw      xmm4, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 32B], MEM[base: vectp_pretmp.28_117, offset: 32B]
        vpunpcklbw      xmm3, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 32B], MEM[base: vectp_pretmp.25_113, offset: 32B]
        vpunpckhbw      xmm0, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 32B], MEM[base: vectp_pretmp.25_113, offset: 32B]
        vpunpckhbw      xmm1, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 32B], MEM[base: vectp_pretmp.28_117, offset: 32B]
        vpunpcklbw      xmm2, xmm3, xmm4    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm3, xmm3, xmm4    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-128], xmm2       # MEM[base: vectp_s.31_121, offset: 128B], D.4548
        vpunpcklbw      xmm2, xmm0, xmm1    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm0, xmm0, xmm1    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-112], xmm3       # MEM[base: vectp_s.31_121, offset: 144B], D.4548
        vmovups XMMWORD PTR [rdx-96], xmm2        # MEM[base: vectp_s.31_121, offset: 160B], D.4548
        vmovups XMMWORD PTR [rdx-80], xmm0        # MEM[base: vectp_s.31_121, offset: 176B], D.4548
        vmovdqu xmm1, XMMWORD PTR [r11-16]        # MEM[base: vectp_pretmp.19_105, offset: 48B], MEM[base: vectp_pretmp.19_105, offset: 48B]
        vmovdqu xmm2, XMMWORD PTR [rbx-16]        # MEM[base: vectp_pretmp.22_109, offset: 48B], MEM[base: vectp_pretmp.22_109, offset: 48B]
        vmovdqu xmm0, XMMWORD PTR [rbp-16]        # MEM[base: vectp_pretmp.25_113, offset: 48B], MEM[base: vectp_pretmp.25_113, offset: 48B]
        vmovdqu xmm5, XMMWORD PTR [r13-16]        # MEM[base: vectp_pretmp.28_117, offset: 48B], MEM[base: vectp_pretmp.28_117, offset: 48B]
        vpunpcklbw      xmm3, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 48B], MEM[base: vectp_pretmp.25_113, offset: 48B]
        vpunpckhbw      xmm0, xmm1, xmm0    # D.4548, MEM[base: vectp_pretmp.19_105, offset: 48B], MEM[base: vectp_pretmp.25_113, offset: 48B]
        vpunpcklbw      xmm4, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 48B], MEM[base: vectp_pretmp.28_117, offset: 48B]
        vpunpckhbw      xmm1, xmm2, xmm5    # D.4548, MEM[base: vectp_pretmp.22_109, offset: 48B], MEM[base: vectp_pretmp.28_117, offset: 48B]
        vpunpcklbw      xmm2, xmm3, xmm4    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm3, xmm3, xmm4    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-64], xmm2        # MEM[base: vectp_s.31_121, offset: 192B], D.4548
        vpunpcklbw      xmm2, xmm0, xmm1    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm0, xmm0, xmm1    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx-48], xmm3        # MEM[base: vectp_s.31_121, offset: 208B], D.4548
        vmovups XMMWORD PTR [rdx-32], xmm2        # MEM[base: vectp_s.31_121, offset: 224B], D.4548
        vmovups XMMWORD PTR [rdx-16], xmm0        # MEM[base: vectp_s.31_121, offset: 240B], D.4548
        cmp     r15, r14  # ivtmp.66, D.4549
        jne     .L13      #,
.L12:
        xor     r14d, r14d        # ivtmp.61
.L17:
        vmovdqu xmm3, XMMWORD PTR [rbp+0+r14]     # MEM[base: vectp_pretmp.25_158, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.25_158, index: ivtmp.61_82, offset: 0B]
        add     r15, 1    # ivtmp.66,
        vmovdqu xmm4, XMMWORD PTR [r13+0+r14]     # MEM[base: vectp_pretmp.28_159, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.28_159, index: ivtmp.61_82, offset: 0B]
        vmovdqu xmm0, XMMWORD PTR [r11+r14]       # MEM[base: vectp_pretmp.19_156, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.19_156, index: ivtmp.61_82, offset: 0B]
        vmovdqu xmm1, XMMWORD PTR [rbx+r14]       # MEM[base: vectp_pretmp.22_157, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.22_157, index: ivtmp.61_82, offset: 0B]
        vpunpcklbw      xmm2, xmm0, xmm3    # D.4548, MEM[base: vectp_pretmp.19_156, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.25_158, index: ivtmp.61_82, offset: 0B]
        vpunpckhbw      xmm0, xmm0, xmm3    # D.4548, MEM[base: vectp_pretmp.19_156, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.25_158, index: ivtmp.61_82, offset: 0B]
        vpunpcklbw      xmm3, xmm1, xmm4    # D.4548, MEM[base: vectp_pretmp.22_157, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.28_159, index: ivtmp.61_82, offset: 0B]
        vpunpckhbw      xmm1, xmm1, xmm4    # D.4548, MEM[base: vectp_pretmp.22_157, index: ivtmp.61_82, offset: 0B], MEM[base: vectp_pretmp.28_159, index: ivtmp.61_82, offset: 0B]
        vpunpcklbw      xmm4, xmm2, xmm3    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm2, xmm2, xmm3    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx+r14*4], xmm4     # MEM[base: vectp_s.31_160, index: ivtmp.61_82, step: 4, offset: 0B], D.4548
        vmovups XMMWORD PTR [rdx+16+r14*4], xmm2  # MEM[base: vectp_s.31_160, index: ivtmp.61_82, step: 4, offset: 16B], D.4548
        vpunpcklbw      xmm2, xmm0, xmm1    # D.4548, D.4548, D.4548
        vpunpckhbw      xmm0, xmm0, xmm1    # D.4548, D.4548, D.4548
        vmovups XMMWORD PTR [rdx+32+r14*4], xmm2  # MEM[base: vectp_s.31_160, index: ivtmp.61_82, step: 4, offset: 32B], D.4548
        vmovups XMMWORD PTR [rdx+48+r14*4], xmm0  # MEM[base: vectp_s.31_160, index: ivtmp.61_82, step: 4, offset: 48B], D.4548
        add     r14, 16   # ivtmp.61,
        cmp     r12, r15  # bnd.16, ivtmp.66
        ja      .L17        #,
        lea     rdx, [rsi+rax*4]  # ivtmp.50,
        cmp     rcx, rax  # n, x
        je      .L33        #,
.L14:
        movzx   esi, BYTE PTR [r8+rax]        # D.4554, MEM[base: _32, index: x_80, offset: 0B]
        add     rdx, 4    # ivtmp.50,
        mov     BYTE PTR [rdx-4], sil     # MEM[base: _61, offset: 0B], D.4554
        movzx   esi, BYTE PTR [r9+rax]        # D.4554, MEM[base: _34, index: x_80, offset: 0B]
        mov     BYTE PTR [rdx-3], sil     # MEM[base: _61, offset: 1B], D.4554
        movzx   esi, BYTE PTR [r10+rax]       # D.4554, MEM[base: _36, index: x_80, offset: 0B]
        mov     BYTE PTR [rdx-2], sil     # MEM[base: _61, offset: 2B], D.4554
        movzx   esi, BYTE PTR [rdi+rax]       # D.4554, MEM[base: _38, index: x_80, offset: 0B]
        add     rax, 1    # x,
        mov     BYTE PTR [rdx-1], sil     # MEM[base: _61, offset: 3B], D.4554
        cmp     rcx, rax  # n, x
        ja      .L14        #,
.L33:
        pop     rbx       #
        pop     rbp       #
        pop     r12       #
        pop     r13       #
        pop     r14       #
        pop     r15       #
.L36:
        ret
.L11:
        xor     eax, eax  # x
.L19:
        movzx   edx, BYTE PTR [r8+rax]        # D.4554, MEM[base: _32, index: x_58, offset: 0B]
        add     rsi, 4    # ivtmp.37,
        mov     BYTE PTR [rsi-4], dl      # MEM[base: _229, offset: 0B], D.4554
        movzx   edx, BYTE PTR [r9+rax]        # D.4554, MEM[base: _34, index: x_58, offset: 0B]
        mov     BYTE PTR [rsi-3], dl      # MEM[base: _229, offset: 1B], D.4554
        movzx   edx, BYTE PTR [r10+rax]       # D.4554, MEM[base: _36, index: x_58, offset: 0B]
        mov     BYTE PTR [rsi-2], dl      # MEM[base: _229, offset: 2B], D.4554
        movzx   edx, BYTE PTR [rdi+rax]       # D.4554, MEM[base: _38, index: x_58, offset: 0B]
        add     rax, 1    # x,
        mov     BYTE PTR [rsi-1], dl      # MEM[base: _229, offset: 3B], D.4554
        cmp     rcx, rax  # n, x
        jne     .L19      #,
        pop     rbx       #
        pop     rbp       #
        pop     r12       #
        pop     r13       #
        pop     r14       #
        pop     r15       #
        jmp     .L36      #
.L21:
        mov     rdx, rsi  # vectp_s.31, s
        mov     r13, rdi  # vectp_pretmp.28, D.4551
        mov     rbp, r10  # vectp_pretmp.25, D.4551
        mov     rbx, r9   # vectp_pretmp.22, D.4551
        mov     r11, r8   # vectp_pretmp.19, D.4551
        xor     r15d, r15d        # ivtmp.66
        jmp     .L12      #
Note: sorry about the code indentation, minor bug in blogging software results in code samples overflowing to next line

In all honesty, I'm not quite sure what makes this code work. At a high level, it's unpacking the data in the vectors and then interleaving them, but there's enough dense code in there that I can't easily see precisely what GCC is doing.

That's okay though, because this code runs INSANELY FAST. Whereas Clang's straightforward implementation took about 29.33s and my hand-coded implementation took around 7.22s, GCC took only 2.5s. Chalk this up as a major victory for GCC.

Summary

Just because compilers don't always generate particularly optimal code, it doesn't mean that you should discount them, because they usually do generate fairly good code. While a compiler may not have been optimized for a particular task, it might have many, many optimizations that work across a variety of cases, that when combined, result in extremely fast code.

You know that saying that everyone throws around about not prematurely optimizing? That holds true most of the time both because you might spend time optimizing the wrong things and because the amount of time you spend optimizing something might be disproportional to the resulting performance gain. For any ahead-of-time compilers (AOT), such as the C compilers, however, each individual optimization pass that they do runs in such a short amount of time (on the order of tens of microseconds to a few milliseconds in most cases) that they can perform a large variety of optimizations in a fairly short time (a few seconds maximum per file?). While most optimizations might only provide a small performance boost, when multiplied together, the additional performance they provide can be quite considerable. In addition, there are also a few optimizations that provide massive performance gains on their own (instruction scheduling - reordering instructions, Vectorization/SIMD - rewriting loops to use vector instructions, Register Allocation - minimizing spills and restores, etc).

All of that being said, when writing code, try to write code that has basic hand-optimizations built in (these are just general rules I always follow when writing code):

  1. Loops and their accesses should go forward a[x++] when possible. The cache predictor may pick up on access patterns that go in reverse after a short while, but until that point, you've just been wasting cycles.
  2. When you have a lot of operations to do on a huge amount of data, perhaps consider doing each operation on all of your data before moving to the next operation (simplified example) as opposed to doing all operations on a subset of data and then repeating). This will allow the core to make more effective use of the instruction cache, thus providing higher performance.
  3. That being said, make each operation big enough that you don't waste time doing loads and stores. If loads and stores are inducing latency increases and/or throughput decreases, consider making your individual operations bigger so you spend more time doing actual work instead of doing memory accesses.
  4. When possible, use the restrict keyword (void m (int *a, int *b, size_t n) vs void m (int *restrict a, int *restrict b, size_t n)). This keyword tells the compiler that the two pointers are definitely not the same, which will allow a variety of optimizations (at minimum, the compiler can often reduce reloads caused by the possibility that data has been modified by a store to another pointer).

Conclusion

Modern compilers have extensive optimizations built in, but there do still exist cases that compilers don't cover. For most cases, this shouldn't matter, but when dealing with high performance (read: super low latency or super high throughput) code, if CPU is the limiting factor, it may be worthwhile to consider having a look at precisely what the compiler is spitting out for the function(s) that is/are inducing a delay (but remember, measure first!). Iff (if and only if) you find that a function is both CPU-bound and is inducing a large delay in your program's execution, do the following:

  1. Rewrite that function using the four steps I outlined above.
  2. If you are doing any kind of interesting math, consider using approximations when possible (ie: if dividing by a float constant (f / 5), instead multiply by a constant float inverse (f * (1 / 5))). If accuracy isn't too much of an issue, you might also try using -O3 or -ffast-math, which will allow the compiler to use optimizations and approximations that might reduce accuracy and/or precision.
  3. If you know that your code is vectorizable (SIMD) but the compiler isn't picking up on that, try using vector extensions to instruct the compiler as to how to use vector to accelerate your code. If you need to use a particular instruction, use intrinsics. More on both in a future post.
  4. At this point, if your code still isn't up to satisfactory speed, then try actually rewriting [the relevant parts of] the function in assembly.

As you go through this process, here are some of the resources I usually use:

Resources