Do AES-NI instructions accelerate both AES-128 and AES-256?

Yes, AES-NI accelerates both AES-128 and AES-256, and yes, there is a performance difference between hardware accelerated AES-128 and AES-256, according to https://software.intel.com/en-us/articles/intel-aes-ni-performance-enhancements-hytrust-datacontrol-case-study#_Toc397546813:

On Ivy Bridge here are the raw numbers for both Cyber-Block-Chaining (CBC) and XEX-based tweaked-codebook mode with ciphertext stealing (XTS) modes with both 128- and 256-bit keys.

Note that for XTS mode, only half the key is used, so XTS-512 essentially utilizes a 256-bit key.

# Tests are approximate using memory only (no storage IO).
#  Algorithm | Key |  Encryption |  Decryption     
     aes-cbc   128b   581.3 MiB/s  1961.8 MiB/s     
     aes-cbc   256b   431.4 MiB/s  1503.1 MiB/s     
     aes-xts   256b  1665.6 MiB/s  1642.3 MiB/s     
     aes-xts   512b  1318.3 MiB/s  1282.1 MiB/s

And for Haswell:

# Tests are approximate using memory only (no storage IO).
#  Algorithm | Key |  Encryption |  Decryption     
     aes-cbc   128b   663.8 MiB/s  2486.8 MiB/s     
     aes-cbc   256b   493.9 MiB/s  2043.6 MiB/s     
     aes-xts   256b  2265.2 MiB/s  2261.1 MiB/s     
     aes-xts   512b  1778.0 MiB/s  1778.7 MiB/s

We made the following observations:

  • For CBC encryption, we see a 40% improvement for 128-bit keys over 256-bit keys.
  • For XTS encryption, we see a 30% improvement for 256-bit keys over 512-bit keys.
  • For CBC decryption, we see a 20% improvement for 128-bit keys over 256-bit keys.
  • For XTS decryption, we see a 30% improvement for 256-bit XTS keys over 512-bit keys.

Note that this is raw performance of the AES-NI instruction. In real world, there is disk I/O or network I/O that happens while data is being encrypted/decrypted, which would affect real world performance.

Also, I believe the above numbers are for single core performance, in implementations that uses multiple cores, even AES-256 can easily saturate the entire I/O bandwidth of an SSD and most networks.

As noted here though, making sure you use the right chaining mode for your purpose actually affects throughput much more significantly than the key size.