It is true that the standard's definition of memcpy is in terms of copying a sequence of bytes. It is also true that memcpy is one of the most important and most heavily optimized library functions.
These days any credible compiler has a means of determining that an invocation of a function named 'memcpy' is actually an invocation of the standard's memcpy. E.g. gcc's exposed memcpy is an inline whose body simply calls __builtin_memcpy.
With such knowledge a compiler can bring to bear all kinds of optimizations. Dmitry's measurements seem to bear this out.
/john