MD家族
MD是Message Digest的缩写,其家族目前成员有MD2、MD4、MD5和MD6,这些算法都出自一个人:Ronald Rivest,这个人就是RSA的R!这些算法(MD2/MD4/MD5/MD6)别公布于1989、1990、1992和2008年。笔者在准备学习MD算法之前,就一直对这么多序号感到困惑:哎,是否存在MD1和MD3呢?各个算法之间的关系是怎么样的?
然后发现居然也有人好奇“MD1和MD3算法是否存在”的问题:Is the first version of the Message-Digest algorithm by Ronald Rivest publically available?。从回答来看,可以认为MD1和MD3存在的,但是没有公开发表:
Bart Preneel in his 1993 PhD thesis writes "R. Rivest of RSA Data Security Inc. has designed a series of hash functions, that were named MD for “message digest” followed by a number. MD1 is a proprietary algorithm. MD2 [175] was suggested in 1990, and was recommended to replace BMAC [193]. MD3 was never published, and it seems to have been abandoned by its designer." Since I suppose Bart as an insider should be well informed, I suppose that the algorithm exists but has never been made public.
那么,接着让我们来看看这些已经和我们见面的MD成员们,都具体是什么:
- MD6:Rivest在Crypto2008上提出的,该算法有MD6-128、MD6-256和MD6-512三种。我们可以发现这个算法目前并不比MD5普及,原因可以参考What happened to MD6?
- MD5:Rivest 于1991年对MD4的改进版本,输出是一串定长128bits的摘要/密文。该算法在2004年的时候被山东大学的王小云院士证明存在强碰撞可能,网上很多说法是”被破解“了,这种说法不严谨:王小云院士只是证明MD5不具有抗碰撞性,而不是反解了MD5!参见《MD5 和SHA-1的强抗碰撞性不是早已被王小云教授攻破了吗,为什么现在仍然在使用?》高赞回答。
- MD4:Rivest 在 1990 年设计提出的,输出是一串定长128bits的摘要/密文。
- MD2:Rivest在1989年提出的,很早就不再使用了。
MD5算法
本文重点介绍MD5算法,推荐阅读Rivest的MD5原稿:RFC1321-The MD5 Message-Digest Algorithm。下面分步骤来介绍MD5的具体做法。
1.1 数据补齐/填充(Padding)
MD5也可以认为是基于block的算法,它要求每次处理的数据长度为512bits,但是实际中要处理的明文长度并不一定是512的整数倍,怎么办呢?参考AES的做法,做数据补齐/填充(Padding)。假设原始明文消息的长度为K,MD5的Padding可以细分为2个子步骤:
512bits512-(K%512-448)448-K%512
1.3 Initialize MD Buffer 初始化MD缓冲区
MD Buffer是4个32bits(=4bytes)的向量,准确来说是word。RFC原文对MD Buffer的初始化描述如下:
A four-word buffer (A,B,C,D) is used to compute the message digest. Here each of A, B, C, D is a 32-bit register. These registers are initialized to the following values in hexadecimal, low-order bytes first):
word A: 01 23 45 67
word B: 89 ab cd ef
word C: fe dc ba 98
word D: 76 54 32 10
一个4-word的缓冲区(A、B、C、D)用来产生消息摘要,这里A、B、C、D都是32bits的寄存器,这些寄存器被初始化为以下值(十六进制,低位字节在前):
word A: 01 23 45 67
word B: 89 ab cd ef
word C: fe dc ba 98
word D: 76 54 32 10
必须要【敲黑板】的是文章明确表示:低字节在前,这也就意味着真实代码中存储的时候是(请注意顺序):
unsigned int A = 0x67452301;
unsigned int B = 0xEFCDAB89;
unsigned int C = 0x98BADCFE;
unsigned int D = 0x10325476;
1.4 Process Message in 16-Word Blocks 处理消息组
这一步绝对是整个MD5算法最绕的地方,看了很多网上的文章,越看越绕,看了md5的好多代码实现后终于明白了,先祭出Wikipedia的图:
说实话,乍看这个图并不清楚发生了什么。一步步来解释,在此之前必须了解图中每个字母所代表的的含义:
- 明文M:图中M_i是一个32bits的明文数据,跟A/B/C/D长度相同,也是一个word的大小。问题来了:一个block不是512bits吗?那怎么这里只有32bits呢?回答:所以MD5对一个block数据要进行16次(512/32)完全相同的一套完整操作,一套完整的操作包括4轮向量运算。也就是说,每32bits明文数据为一个单位,要进行4轮的向量运算;而每一个block要对16个(32bits)单位明文数据进行运算,共计64次。
- 正弦函数表(sine table)K:sine table共有64个常量数值,每个数值长度32bits,图中K_i就是就是这个sine table。笔者指出一点:图中的下标是有问题的,K_i和M_i容易让人产生误解,以为明文和sine table的循环顺序一致,实际上两者遍历的顺序不一致!sine table的内容如下:
const unsigned int k[] = {
0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf, 0x4787c62a, 0xa8304613, 0xfd469501,
0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be, 0x6b901122, 0xfd987193, 0xa679438e, 0x49b40821,
0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa, 0xd62f105d, 0x02441453, 0xd8a1e681, 0xe7d3fbc8,
0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed, 0xa9e3e905, 0xfcefa3f8, 0x676f02d9, 0x8d2a4c8a,
0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c, 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70,
0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05, 0xd9d4d039, 0xe6db99e5, 0x1fa27cf8, 0xc4ac5665,
0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039, 0x655b59c3, 0x8f0ccc92, 0xffeff47d, 0x85845dd1,
0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1, 0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391};
至于sine table是怎么生成的,推荐阅读:MD5 implementation notes。
- 位移S:MD5定义了一个shift或者说是rotate操作,RFC原文描述如下:
Let X <<< s denote the 32-bit value obtained by circularly shifting (rotating) X left by s bit positions.
令X <<< s表示通过将X向左循环移位(旋转)s个位位置而获得的32位值。
位移S共有16个常量数值:
round 1: 7, 12, 17, 22
round 2: 5, 9, 14, 20
round 3: 4, 11, 16, 23
round 4: 6, 10, 15, 21
0xd76aa47870xe9b6c7aa200xeaa127fa110xf42922446
{正弦函数表K, 明文数据组别(从0起),位移S}
// round 1:
{0xd76aa478, 0, 7}
{0xe8c7b756, 1, 12}
{0x242070db, 2, 17}
{0xc1bdceee, 3, 22}
{0xf57c0faf, 4, 7}
{0x4787c62a, 5, 12}
{0xa8304613, 6, 17}
{0xfd469501, 7, 22}
{0x698098d8, 8, 7}
{0x8b44f7af, 9, 12}
{0xffff5bb1, 10, 17}
{0x895cd7be, 11, 22}
{0x6b901122, 12, 7}
{0xfd987193, 13, 12}
{0xa679438e, 14, 17}
{0x49b40821, 15, 22}
// round 2:
{0xf61e2562, 1, 5}
{0xc040b340, 6, 9}
{0x265e5a51, 11, 14}
{0xe9b6c7aa, 0, 20}
{0xd62f105d, 5, 5}
{0x2441453, 10, 9}
{0xd8a1e681, 15, 14}
{0xe7d3fbc8, 4, 20}
{0x21e1cde6, 9, 5}
{0xc33707d6, 14, 9}
{0xf4d50d87, 3, 14}
{0x455a14ed, 8, 20}
{0xa9e3e905, 13, 5}
{0xfcefa3f8, 2, 9}
{0x676f02d9, 7, 14}
{0x8d2a4c8a, 12, 20}
// round 3:
{0xfffa3942, 5, 4}
{0x8771f681, 8, 11}
{0x6d9d6122, 11, 16}
{0xfde5380c, 14, 23}
{0xa4beea44, 1, 4}
{0x4bdecfa9, 4, 11}
{0xf6bb4b60, 7, 16}
{0xbebfbc70, 10, 23}
{0x289b7ec6, 13, 4}
{0xeaa127fa, 0, 11}
{0xd4ef3085, 3, 16}
{0xd9d4d039, 9, 4}
{0xe6db99e5, 12, 11}
{0x1fa27cf8, 15, 16}
{0xc4ac5665, 2, 23}
// round 4:
{0xf4292244, 0, 6}
{0x432aff97, 7, 10}
{0xab9423a7, 14, 15}
{0xfc93a039, 5, 21}
{0x655b59c3, 12, 6}
{0x8f0ccc92, 3, 10}
{0xffeff47d, 10, 15}
{0x85845dd1, 1, 21}
{0x6fa87e4f, 8, 6}
{0xfe2ce6e0, 15, 10}
{0xa3014314, 6, 15}
{0x4e0811a1, 13, 21}
{0xf7537e82, 4, 6}
{0xbd3af235, 11, 10}
{0x2ad7d2bb, 2, 15}
{0xeb86d391, 9, 21}
- 非线性函数F:函数F是整个MD5算法中唯一非线性变换,虽然图中只写了F,但其实F有4种形式,有的文章会分别描述为F、G、H和I函数(或者是FF\GG\HH\II):
round 1: F(X,Y,Z) = XY v not(X) Z
round 2: G(X,Y,Z) = XZ v Y not(Z)
round 3: H(X,Y,Z) = X xor Y xor Z
round 4: I(X,Y,Z) = Y xor (X v not(Z))
对于一个M_i而言,都会经历这4轮操作。
- MD Buffer:MD Buffer就是A、B、C、D,笔者认为MD5的核心就是不断对MD Buffer做线性、非线性向量变换,最终得到消息摘要。一个block共计64次的操作,指的就是对MD Buffer的操作。而整个明文操作下来应该有N*64(N是明文数据Padding之后整除512的数值),最终得到了信息摘要,过程如下图:
了解以上几个关键点,对整个MD5的加密过程应该也基本有概念了,这个时候推荐回到RFC再去阅读Rivest给出的C代码实现。这里,笔者给出一段C++的MD5实现,希望能够辅助理解:
const unsigned int k[] = {
0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf, 0x4787c62a, 0xa8304613, 0xfd469501,
0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be, 0x6b901122, 0xfd987193, 0xa679438e, 0x49b40821,
0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa, 0xd62f105d, 0x02441453, 0xd8a1e681, 0xe7d3fbc8,
0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed, 0xa9e3e905, 0xfcefa3f8, 0x676f02d9, 0x8d2a4c8a,
0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c, 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70,
0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05, 0xd9d4d039, 0xe6db99e5, 0x1fa27cf8, 0xc4ac5665,
0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039, 0x655b59c3, 0x8f0ccc92, 0xffeff47d, 0x85845dd1,
0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1, 0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391};
const unsigned int r[] = {7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 5, 9, 14, 20, 5, 9,
14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23,
4, 11, 16, 23, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21};
unsigned int h[4];
/* initial
h[0] = 0x67452301;
h[1] = 0xefcdab89;
h[2] = 0x98badcfe;
h[3] = 0x10325476;
*/
void md5(unsigned char *p, int len, unsigned char* digest) {
unsigned int a, b, c, d, f, tmp, i, len;
unsigned char *end, t;
unsigned int x[16];
// append padding bits
for (i = len + 1; i % 64 != 56; ++i)
;
p[len] = 0x80;
for (tmp = len + 1; tmp < i; ++tmp) p[tmp] = 0;
// append length
p[i] = (len << 3);
p[i + 1] = (len >> 5);
p[i + 2] = (len >> 13);
p[i + 3] = (len >> 21);
p[i + 4] = (len >> 29);
p[i + 5] = (len >> 29) >> 8;
p[i + 6] = (len >> 29) >> 16;
p[i + 7] = (len >> 29) >> 24;
// update message length
len = i;
// block loop
for (end = p + len; p < end; p += 64) {
a = h[0];
b = h[1];
c = h[2];
d = h[3];
// to int32
for (i = 0; i < 16; ++i) x[i] = p[i * 4 + 0] | (p[i * 4 + 1] << 8) | (p[i * 4 + 2] << 16) | (p[i * 4 + 3] << 24);
// loop for a block, which contains 4 round for a message-word
for (i = 0; i < 64; i++) {
if (i < 16) {
f = (b & c) | (~b & d);
t = i;
} else if (i < 32) {
f = (b & d) | (c & ~d);
t = (5 * i + 1) % 16;
} else if (i < 48) {
f = b ^ c ^ d;
t = (3 * i + 5) % 16;
} else {
f = c ^ (b | ~d);
t = (7 * i) % 16;
}
// update MD Buffer inner loop
tmp = d;
d = c;
c = b;
b = b + LEFTROTATE((a + f + k[i] + x[t]), r[i]);
a = tmp;
}
// update MD Buffer
h[0] += a;
h[1] += b;
h[2] += c;
h[3] += d;
}
// get message digest
for (i = 0; i < 4; ++i) {
digest[0 + 4 * i] = h[i];
digest[1 + 4 * i] = h[i] >> 8;
digest[2 + 4 * i] = h[i] >> 16;
digest[3 + 4 * i] = h[i] >> 24;
}
// show md
for (i = 0; i < 16; ++i) {
printf("%.2x", digest[i]);
}
putchar('\n');
}
1.5 Output 输出
输出这个小节也没啥具体可交代的,上面的代码已经非常明了。
至此,整个MD5算法细节交代完毕。任何问题,欢迎交流~