🚄 前提概要
随着 redis 的运行,AOF 会不断膨胀(对于一个 key 会有多条 AOF 日志),导致通过 aof 恢复数据时,耗费大量不必要的时间。redis 提供的解决方案是 AOF Rewrite。
根据 DB 的内容,对于每个 key,生成一条日志,AOF 触发的时机。
1)用户调用 bgrewriteaof 命令
2)AOF 日志大小超过预设的配置的阈值。
🚄 AOF Rewrite 触发时机
首先看一下,bgrewriteaof 的处理函数:
void bgrewriteaofCommand(redisClient *c) {
if (server.aof_child_pid != -1) {
addReplyError(c,"Background append only file rewriting already in progress");
} else if (server.rdb_child_pid != -1) {
server.aof_rewrite_scheduled = 1;
addReplyStatus(c,"Background append only file rewriting scheduled");
} else if (rewriteAppendOnlyFileBackground() == REDIS_OK) {
addReplyStatus(c,"Background append only file rewriting started");
} else {
addReply(c,shared.err);
}
}
复制代码
如果当前正在进行 aof rewrite,则返回客户端错误。
如果当前正在进行 rdb dump,为了避免对磁盘造成压力,将 aof_rewrite_scheduled 置为 1,随后在没有进行 aof rewrite 和 rdb dump 时,再开启 rewrite。
如果当前没有 aof rewrite 和 rdb dump 在进行,则调用 rewriteAppendOnlyFileBackground 进行 aof rewrite。
异常情况,直接返回错误。
下面,看一下 serverCron 中是如何触发 aof rewrite 的。
第一个触发点是,避免与 rdb dump 冲突,延迟触发 rewrite。
/* Start a scheduled AOF rewrite if this was requested by the user while
* a BGSAVE was in progress. */
if (server.rdb_child_pid == -1 && server.aof_child_pid == -1 &&
server.aof_rewrite_scheduled){
rewriteAppendOnlyFileBackground();
}
复制代码
需要确认当前没有 aof rewrite 和 rdb dump 在进行(-1),并且设置了 aof_rewrite_scheduled,调用 rewirteAppendOnlyFileBackground 进行 aof rewrite。
第二个触发位置是 aof 文件的大小超过预定的百分比。
/* Trigger an AOF rewrite if needed */
if (server.rdb_child_pid == -1 &&
server.aof_child_pid == -1 &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size){
long long base = server.aof_rewrite_base_size ?
server.aof_rewrite_base_size : 1;
long long growth = (server.aof_current_size*100/base) - 100;
if (growth >= server.aof_rewrite_perc) {
redisLog(REDIS_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);
rewriteAppendOnlyFileBackground();
}
}
复制代码
当 aof 文件超过了预定的最小值,并且超过了上一次 aof 文件的一定百分比,则会触发 aof rewrite。
🚄 AOF Rewrite 核心流程
rewrite 的大致流程是:
创建子进程,获取当前快照,同时将之后的命令记录到 aof_rewrite_buf_block 中,
子进程遍历 db 生成 aof 临时文件,然后退出;
子进程完成 aof 写入之后,通过管道技术或者信号量技术通知父进程。
之后将 aof_rewrite_buf_block 中的数据追加到该 aof 文件中。
最后重命名该临时文件为正式的 aof 文件。
下面看具体代码,首先是 rewriteAppendOnlyFileBackground。
pid_t childpid;
long long start;
// <MM>
// 避免同时多个进程进行rewrite
// </MM>
if (server.aof_child_pid != -1)
return REDIS_ERR;
复制代码
如果有其他 aof rewrite 进程正在进行,直接返回错误。
start = ustime();
if ((childpid = fork()) == 0) {
char tmpfile[256];
/* Child */
// <MM>
// 子进程不能接受连接
// </MM>
closeListeningSockets(0);
redisSetProcTitle("redis-aof-rewrite");
// <MM>
// 生成临时aof文件名
// </MM>
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
if (rewriteAppendOnlyFile(tmpfile) == REDIS_OK) {
size_t private_dirty = zmalloc_get_private_dirty();
if (private_dirty) {
redisLog(REDIS_NOTICE,
"AOF rewrite: %zu MB of memory used by copy-on-write",
private_dirty/(1024*1024));
}
exitFromChild(0);
} else {
exitFromChild(1);
}
复制代码
去当前时间,用于统计 fork 耗时。
然后调用 fork,进入子进程的流程。子进程首先关闭监听 socket,避免接收客户端连接。
同时设置进程的 title。然后,生成 rewrite 要写入的临时文件名。
接下来调用 rewriteAppendOnlyFile 进行 rewrite。
如果 rewrite 成功,统计 copy-on-write 的脏页并记录日志,然后以退出码 0 退出进程。
如果 rewrite 失败,则退出进程并返回 1 作为退出码。
下面看一下父进程的流程:
} else {
/* Parent */
server.stat_fork_time = ustime()-start;
server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024*1024*1024); /* GB per second. */
latencyAddSampleIfNeeded("fork",server.stat_fork_time/1000);
if (childpid == -1) {
redisLog(REDIS_WARNING,
"Can't rewrite append only file in background: fork: %s",
strerror(errno));
return REDIS_ERR;
}
redisLog(REDIS_NOTICE,
"Background append only file rewriting started by pid %d",childpid);
server.aof_rewrite_scheduled = 0;
server.aof_rewrite_time_start = time(NULL);
server.aof_child_pid = childpid;
updateDictResizePolicy();
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
replicationScriptCacheFlush();
return REDIS_OK;
}
复制代码
父进程首先统计 fork 耗时并采样。
调用 updateDictResizePolicy 调整 db 的 key space 的 rehash 策略,由于创建了子进程,避免 copy-on-write 复制大量内存页,这里会禁止 dict 的 rehash。
将 aof_selected_db 置为-1,目的是,下一条 aof 会首先生成一条 select db 的日志,同时会写到 aof_rewrite_buf 中,这样就可以将 aof_rewrite_buf 正常的追加到 rewrite 之后的文件。replicationScriptCacheFlush 暂时没看到这。
下面看一下子进程进行 aof rewrite 的过程,进入 rewriteAppendOnlyFile 函数。大体上,就是遍历所有 key,进行序列化,然后记录到 aof 文件中。
dictIterator *di = NULL;
dictEntry *de;
rio aof;
FILE *fp;
char tmpfile[256];
int j;
long long now = mstime();
/* Note that we have to use a different temp name here compared to the
* one used by rewriteAppendOnlyFileBackground() function. */
snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
fp = fopen(tmpfile,"w");
if (!fp) {
redisLog(REDIS_WARNING, "Opening the temp file for AOF rewrite in rewriteAppendOnlyFile(): %s", strerror(errno));
return REDIS_ERR;
}
复制代码
获取当前时间,生成临时文件名并创建该文件。
rioInitWithFile(&aof,fp);
if (server.aof_rewrite_incremental_fsync)
rioSetAutoSync(&aof,REDIS_AOF_AUTOSYNC_BYTES);
复制代码
rio 就是面向流的 I/O 接口,底层可以有不同实现,目前提供了文件和内存 buffer 的实现。
这里对 rio 进行初始化。如果配置了 server.aof_rewrite_incremental_fsync,则在写 aof 时会增量地进行 fsync,这里配置的是每写入 32M 就 sync 一次。避免集中 sync 导致磁盘跑满。接下来是一个循环,用于遍历 redis 的每个 db,对其进行 rewirte。直接看循环内部:
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
redisDb *db = server.db+j;
dict *d = db->dict;
if (dictSize(d) == 0) continue;
di = dictGetSafeIterator(d);
if (!di) {
fclose(fp);
return REDIS_ERR;
}
/* SELECT the new DB */
if (rioWrite(&aof,selectcmd,sizeof(selectcmd)-1) == 0) goto werr;
if (rioWriteBulkLongLong(&aof,j) == 0) goto werr;
复制代码
首先,生成对应 db 的 select 命令,然后查看如果 db 为空的话,就跳过,rewrite 下一个 db。然后获取该 db 的迭代器,如果获取失败,直接返回错误。最后将 select db 的命令写入文件。接下来还是一个循环,用于遍历 db 的每一个 key,生成相应的命令。
while ((de = dictNext(di)) != NULL) {
// ...
}
dictReleaseIterator(di);
复制代码
继续看循环内部:
sds keystr;
robj key, *o;
long long expiretime;
keystr = dictGetKey(de);
o = dictGetVal(de);
initStaticStringObject(key,keystr);
expiretime = getExpire(db,&key);
/* If this key is already expired skip it */
if (expiretime != -1 && expiretime < now) continue;
复制代码
de 是 dict 的一个 entry,包含了 key 和 value。这里,首先获取 key 和 value,并将 key 转换成 robj 类型。然后,获取 key 对应的超时时间。如果已经超时,则跳过这个 key。
/* Save the key and associated value */
if (o->type == REDIS_STRING) {
/* Emit a SET command */
char cmd[]="*3\r\n$3\r\nSET\r\n";
if (rioWrite(&aof,cmd,sizeof(cmd)-1) == 0) goto werr;
/* Key and value */
if (rioWriteBulkObject(&aof,&key) == 0) goto werr;
if (rioWriteBulkObject(&aof,o) == 0) goto werr;
} else if (o->type == REDIS_LIST) {
if (rewriteListObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_SET) {
if (rewriteSetObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_ZSET) {
if (rewriteSortedSetObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_HASH) {
if (rewriteHashObject(&aof,&key,o) == 0) goto werr;
} else {
redisPanic("Unknown object type");
}
复制代码
接下来,根据对象的类型,序列化成相应的命令。并将命令写入 aof 文件中。具体各个对象的序列化,这里不再详述。
/* Make sure data will not remain on the OS's output buffers */
if (fflush(fp) == EOF) goto werr;
if (fsync(fileno(fp)) == -1) goto werr;
if (fclose(fp) == EOF) goto werr;
/* Use RENAME to make sure the DB file is changed atomically only
* if the generate DB file is ok. */
if (rename(tmpfile,filename) == -1) {
redisLog(REDIS_WARNING,"Error moving temp append only file on the final destination: %s", strerror(errno));
unlink(tmpfile);
return REDIS_ERR;
}
redisLog(REDIS_NOTICE,"SYNC append only file rewrite performed");
return REDIS_OK;
复制代码
调用 fflush,fsync 将数据落地到磁盘,最后 close 文件。将临时文件重命名,确保生成的 aof 文件完全 ok,避免出现 aof 不完整的情况。最后,打印日志并返回。
werr:
fclose(fp);
unlink(tmpfile);
redisLog(REDIS_WARNING,"Write error writing append only file on disk: %s", strerror(errno));
if (di)
dictReleaseIterator(di);
return REDIS_ERR;
复制代码
在打开文件后,任何一个步出错,都会跳到 werr,进行错误处理。这里,需要将文件 close,删除临时文件,如果 dict 的迭代器没有释放的话,需要进行释放。最后,返回 error。
到这,子进程的 aof rewrite 任务就完成了,现在 rewrite 后的文件已经生成,但是在 rewrite 过程中得日志并没有记录到 aof 文件,所以还需部分收尾工作,这是在主进程中完成的。
🚄AOF Rewrite Buffer 追加
多进程编程中,子进程退出后,父进程需要对其进行清理,否则子进程会编程僵尸进程。同样是在 serverCron 函数中,主进程完成对 rewrite 进程的清理。
redisLog(REDIS_NOTICE,
"Parent diff successfully flushed to the rewritten AOF (%lu bytes)", aofRewriteBufferSize());
复制代码
/* Check if a background saving or AOF rewrite in progress terminated. */
if (server.rdb_child_pid != -1 || server.aof_child_pid != -1) {
int statloc;
pid_t pid;
if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
int exitcode = WEXITSTATUS(statloc);
int bysignal = 0;
if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);
if (pid == server.rdb_child_pid) {
backgroundSaveDoneHandler(exitcode,bysignal);
} else if (pid == server.aof_child_pid) {
backgroundRewriteDoneHandler(exitcode,bysignal);
} else {
redisLog(REDIS_WARNING,
"Warning, detected child with unmatched pid: %ld",
(long)pid);
}
updateDictResizePolicy();
}
} else {
复制代码
如果正在进程 rdb dump 或者 aof rewrite,主进程会非阻塞的调用 wait3 函数,以便在子进程退出后,获取其退出状态。如果退出的进程是 aof rewrite 进程的话,会调用 backgroundRewriteDoneHandler 函数进行最后的收尾工作。下面看一下这个函数。
如果正常退出的情况下,就是没有被信号 kill,并且退出码等于 0。
int newfd, oldfd;
char tmpfile[256];
long long now = ustime();
mstime_t latency;
redisLog(REDIS_NOTICE,
"Background AOF rewrite terminated with success");
/* Flush the differences accumulated by the parent to the
* rewritten AOF. */
latencyStartMonitor(latency);
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof",
(int)server.aof_child_pid);
newfd = open(tmpfile,O_WRONLY|O_APPEND);
if (newfd == -1) {
redisLog(REDIS_WARNING,
"Unable to open the temporary AOF produced by the child: %s", strerror(errno));
goto cleanup;
}
复制代码
首先是记录日志,然后打开临时写入的 rewrite 文件。
// <MM>
// 将rewrite buf追加到文件
// </MM>
if (aofRewriteBufferWrite(newfd) == -1) {
redisLog(REDIS_WARNING,
"Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
close(newfd);
goto cleanup;
}
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("aof-rewrite-diff-write",latency);
redisLog(REDIS_NOTICE,
"Parent diff successfully flushed to the rewritten AOF (%lu bytes)", aofRewriteBufferSize());
复制代码
接下来,将 aof rewrite buffer 追加到文件。
/* The only remaining thing to do is to rename the temporary file to
* the configured file and switch the file descriptor used to do AOF
* writes. We don't want close(2) or rename(2) calls to block the
* server on old file deletion.
*
* There are two possible scenarios:
*
* 1) AOF is DISABLED and this was a one time rewrite. The temporary
* file will be renamed to the configured file. When this file already
* exists, it will be unlinked, which may block the server.
*
* 2) AOF is ENABLED and the rewritten AOF will immediately start
* receiving writes. After the temporary file is renamed to the
* configured file, the original AOF file descriptor will be closed.
* Since this will be the last reference to that file, closing it
* causes the underlying file to be unlinked, which may block the
* server.
*
* To mitigate the blocking effect of the unlink operation (either
* caused by rename(2) in scenario 1, or by close(2) in scenario 2), we
* use a background thread to take care of this. First, we
* make scenario 1 identical to scenario 2 by opening the target file
* when it exists. The unlink operation after the rename(2) will then
* be executed upon calling close(2) for its descriptor. Everything to
* guarantee atomicity for this switch has already happened by then, so
* we don't care what the outcome or duration of that close operation
* is, as long as the file descriptor is released again. */
if (server.aof_fd == -1) {
// <MM>
// 没有开启AOF,由命令触发的aof rewrite
// </MM>
/* AOF disabled */
/* Don't care if this fails: oldfd will be -1 and we handle that.
* One notable case of -1 return is if the old file does
* not exist. */
oldfd = open(server.aof_filename,O_RDONLY|O_NONBLOCK);
} else {
/* AOF enabled */
oldfd = -1; /* We'll set this to the current AOF filedes later. */
}
/* Rename the temporary file. This will not unlink the target file if
* it exists, because we reference it with "oldfd". */
latencyStartMonitor(latency);
if (rename(tmpfile,server.aof_filename) == -1) {
redisLog(REDIS_WARNING,
"Error trying to rename the temporary AOF file: %s", strerror(errno));
close(newfd);
if (oldfd != -1) close(oldfd);
goto cleanup;
}
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("aof-rename",latency);
if (server.aof_fd == -1) {
/* AOF disabled, we don't need to set the AOF file descriptor
* to this new file, so we can close it. */
close(newfd);
} else {
/* AOF enabled, replace the old fd with the new one. */
oldfd = server.aof_fd;
server.aof_fd = newfd;
if (server.aof_fsync == AOF_FSYNC_ALWAYS)
aof_fsync(newfd);
else if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
aof_background_fsync(newfd);
server.aof_selected_db = -1; /* Make sure SELECT is re-issued */
aofUpdateCurrentSize();
server.aof_rewrite_base_size = server.aof_current_size;
/* Clear regular AOF buffer since its contents was just written to
* the new AOF from the background rewrite buffer. */
sdsfree(server.aof_buf);
server.aof_buf = sdsempty();
}
复制代码
然后,将临时文件重命名为最终的 aof 文件。
server.aof_lastbgrewrite_status = REDIS_OK;
redisLog(REDIS_NOTICE, "Background AOF rewrite finished successfully");
/* Change state from WAIT_REWRITE to ON if needed */
if (server.aof_state == REDIS_AOF_WAIT_REWRITE)
server.aof_state = REDIS_AOF_ON;
/* Asynchronously close the overwritten AOF. */
if (oldfd != -1) bioCreateBackgroundJob(REDIS_BIO_CLOSE_FILE,(void*)(long)oldfd,NULL,NULL);
redisLog(REDIS_VERBOSE,
"Background AOF rewrite signal handler took %lldus", ustime()-now);
复制代码
最后,更新状态,异步关闭之前的 aof 文件。如果 rewrite 子进程异常退出,由信号 kill 或者退出码非 0,则只是记录 日志。
} else if (!bysignal && exitcode != 0) {
server.aof_lastbgrewrite_status = REDIS_ERR;
redisLog(REDIS_WARNING,
"Background AOF rewrite terminated with error");
} else {
server.aof_lastbgrewrite_status = REDIS_ERR;
redisLog(REDIS_WARNING,
"Background AOF rewrite terminated by signal %d", bysignal);
}
复制代码
在追加 rewrite buffer 或者重命名文件失败时,需要进行清理工作,有 cleanup 分支处理:
cleanup:
aofRewriteBufferReset();
aofRemoveTempFile(server.aof_child_pid);
server.aof_child_pid = -1;
server.aof_rewrite_time_last = time(NULL)-server.aof_rewrite_time_start;
server.aof_rewrite_time_start = -1;
/* Schedule a new rewrite if we are waiting for it to switch the AOF ON. */
if (server.aof_state == REDIS_AOF_WAIT_REWRITE)
server.aof_rewrite_scheduled = 1;
复制代码
评论 (1 条评论)