写点什么

redis 精讲系列介绍八 - 淘汰策略

作者:Nick
  • 2022 年 6 月 22 日
  • 本文字数:12772 字

    阅读完需:约 42 分钟

redis 精讲系列介绍八 - 淘汰策略

LRU

说完了过期策略再说下淘汰策略,redis 使用的策略是近似的 lru 策略,为什么是近似的呢,先来看下什么是 lru,看下 wiki 的介绍



,图中一共有四个槽的存储空间,依次访问顺序是 A B C D E D F,当第一次访问 D 时刚好占满了坑,并且值是 4,这个值越小代表越先被淘汰,当 E 进来时,看了下已经存在的四个里 A 是最小的,代表是最早存在并且最早被访问的,那就先淘汰它了,E 占领了 A 的位置,并设置值为 4,然后又访问 D 了,D 已经存在了,不过又被访问到了,得更新值为 5,然后是 F 进来了,这时 B 是最老的且最近未被访问,所以就淘汰它了。以上是一个 lru 的简要说明,但是 redis 没有严格按照这个去执行,理由跟前面过期策略一致,最严格的过期策略应该是每个 key 都有对应的定时器,当超时时马上就能清除,但是问题是这样的 cpu 消耗太大,所换来的内存效率不太值得,淘汰策略也是这样,类似于上图,要维护所有 key 的一个有序 lru 值,并且遍历将最小的淘汰,redis 采用的是抽样的形式,最初的实现方式是随机从 dict 抽取 5 个 key,淘汰一个 lru 最小的,这样子勉强能达到淘汰的目的,但是效果不是特别好,后面在 redis 3.0 开始,将随机抽取改成了维护一个 pool,pool 的大小默认是 16,每次放入的都是按 lru 值有序排列好,每一次放入的必须是 lru 小于 pool 中最小的 lru 才允许放入,直到放满,后面再有新的就会将大的踢出。redis 针对这个策略的改进做了一个实验,这里借用下图



首先背景是这图中的所有点都对应一个 redis 的 key,灰色部分加入后被顺序访问过一遍,然后又加入了绿色部分,那么按照理论的 lru 算法,应该是图左上中,浅灰色部分全都被淘汰,那么对比来看看图右上,左下和右下,左下表示 2.8 版本就是随机抽样 5 个 key,淘汰其中 lru 最小的一个,发现是灰色和浅灰色的都有被淘汰的,右下的 3.0 版本抽样数量不变的情况下,稍好一些,当 3.0 版本的抽样数量调整成 10 后,已经较为接近理论上的 lru 策略了,通过代码来简要分析下


typedef struct redisObject {    unsigned type:4;    unsigned encoding:4;    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or                            * LFU data (least significant 8 bits frequency                            * and most significant 16 bits access time). */    int refcount;    void *ptr;} robj;
复制代码


对于 lru 策略来说,lru 字段记录的就是redisObj 的 LRU time,redis 在访问数据时,都会调用lookupKey方法


/* Low level key lookup API, not actually called directly from commands * implementations that should instead rely on lookupKeyRead(), * lookupKeyWrite() and lookupKeyReadWithFlags(). */robj *lookupKey(redisDb *db, robj *key, int flags) {    dictEntry *de = dictFind(db->dict,key->ptr);    if (de) {        robj *val = dictGetVal(de);
/* Update the access time for the ageing algorithm. * Don't do it if we have a saving child, as this will trigger * a copy on write madness. */ if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){ if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) { // 这个是后面一节的内容 updateLFU(val); } else { // 对于这个分支,访问时就会去更新 lru 值 val->lru = LRU_CLOCK(); } } return val; } else { return NULL; }}/* This function is used to obtain the current LRU clock. * If the current resolution is lower than the frequency we refresh the * LRU clock (as it should be in production servers) we return the * precomputed value, otherwise we need to resort to a system call. */unsigned int LRU_CLOCK(void) { unsigned int lruclock; if (1000/server.hz <= LRU_CLOCK_RESOLUTION) { // 如果服务器的频率server.hz大于 1 时就是用系统预设的 lruclock lruclock = server.lruclock; } else { lruclock = getLRUClock(); } return lruclock;}/* Return the LRU clock, based on the clock resolution. This is a time * in a reduced-bits format that can be used to set and check the * object->lru field of redisObject structures. */unsigned int getLRUClock(void) { return (mstime()/LRU_CLOCK_RESOLUTION) & LRU_CLOCK_MAX;}
复制代码


redis 处理命令是在这里processCommand


/* If this function gets called we already read a whole * command, arguments are in the client argv/argc fields. * processCommand() execute the command or prepare the * server for a bulk read from the client. * * If C_OK is returned the client is still alive and valid and * other operations can be performed by the caller. Otherwise * if C_ERR is returned the client was destroyed (i.e. after QUIT). */int processCommand(client *c) {    moduleCallCommandFilters(c);

/* Handle the maxmemory directive. * * Note that we do not want to reclaim memory if we are here re-entering * the event loop since there is a busy Lua script running in timeout * condition, to avoid mixing the propagation of scripts with the * propagation of DELs due to eviction. */ if (server.maxmemory && !server.lua_timedout) { int out_of_memory = freeMemoryIfNeededAndSafe() == C_ERR; /* freeMemoryIfNeeded may flush slave output buffers. This may result * into a slave, that may be the active client, to be freed. */ if (server.current_client == NULL) return C_ERR;
/* It was impossible to free enough memory, and the command the client * is trying to execute is denied during OOM conditions or the client * is in MULTI/EXEC context? Error. */ if (out_of_memory && (c->cmd->flags & CMD_DENYOOM || (c->flags & CLIENT_MULTI && c->cmd->proc != execCommand && c->cmd->proc != discardCommand))) { flagTransaction(c); addReply(c, shared.oomerr); return C_OK; } }}
复制代码


这里只摘了部分,当需要清理内存时就会调用, 然后调用了freeMemoryIfNeededAndSafe


/* This is a wrapper for freeMemoryIfNeeded() that only really calls the * function if right now there are the conditions to do so safely: * * - There must be no script in timeout condition. * - Nor we are loading data right now. * */int freeMemoryIfNeededAndSafe(void) {    if (server.lua_timedout || server.loading) return C_OK;    return freeMemoryIfNeeded();}/* This function is periodically called to see if there is memory to free * according to the current "maxmemory" settings. In case we are over the * memory limit, the function will try to free some memory to return back * under the limit. * * The function returns C_OK if we are under the memory limit or if we * were over the limit, but the attempt to free memory was successful. * Otehrwise if we are over the memory limit, but not enough memory * was freed to return back under the limit, the function returns C_ERR. */int freeMemoryIfNeeded(void) {    int keys_freed = 0;    /* By default replicas should ignore maxmemory     * and just be masters exact copies. */    if (server.masterhost && server.repl_slave_ignore_maxmemory) return C_OK;
size_t mem_reported, mem_tofree, mem_freed; mstime_t latency, eviction_latency; long long delta; int slaves = listLength(server.slaves);
/* When clients are paused the dataset should be static not just from the * POV of clients not being able to write, but also from the POV of * expires and evictions of keys not being performed. */ if (clientsArePaused()) return C_OK; if (getMaxmemoryState(&mem_reported,NULL,&mem_tofree,NULL) == C_OK) return C_OK;
mem_freed = 0;
if (server.maxmemory_policy == MAXMEMORY_NO_EVICTION) goto cant_free; /* We need to free memory, but policy forbids. */
latencyStartMonitor(latency); while (mem_freed < mem_tofree) { int j, k, i; static unsigned int next_db = 0; sds bestkey = NULL; int bestdbid; redisDb *db; dict *dict; dictEntry *de;
if (server.maxmemory_policy & (MAXMEMORY_FLAG_LRU|MAXMEMORY_FLAG_LFU) || server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) { struct evictionPoolEntry *pool = EvictionPoolLRU;
while(bestkey == NULL) { unsigned long total_keys = 0, keys;
/* We don't want to make local-db choices when expiring keys, * so to start populate the eviction pool sampling keys from * every DB. */ for (i = 0; i < server.dbnum; i++) { db = server.db+i; dict = (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) ? db->dict : db->expires; if ((keys = dictSize(dict)) != 0) { evictionPoolPopulate(i, dict, db->dict, pool); total_keys += keys; } } if (!total_keys) break; /* No keys to evict. */
/* Go backward from best to worst element to evict. */ for (k = EVPOOL_SIZE-1; k >= 0; k--) { if (pool[k].key == NULL) continue; bestdbid = pool[k].dbid;
if (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) { de = dictFind(server.db[pool[k].dbid].dict, pool[k].key); } else { de = dictFind(server.db[pool[k].dbid].expires, pool[k].key); }
/* Remove the entry from the pool. */ if (pool[k].key != pool[k].cached) sdsfree(pool[k].key); pool[k].key = NULL; pool[k].idle = 0;
/* If the key exists, is our pick. Otherwise it is * a ghost and we need to try the next element. */ if (de) { bestkey = dictGetKey(de); break; } else { /* Ghost... Iterate again. */ } } } }
/* volatile-random and allkeys-random policy */ else if (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM || server.maxmemory_policy == MAXMEMORY_VOLATILE_RANDOM) { /* When evicting a random key, we try to evict a key for * each DB, so we use the static 'next_db' variable to * incrementally visit all DBs. */ for (i = 0; i < server.dbnum; i++) { j = (++next_db) % server.dbnum; db = server.db+j; dict = (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM) ? db->dict : db->expires; if (dictSize(dict) != 0) { de = dictGetRandomKey(dict); bestkey = dictGetKey(de); bestdbid = j; break; } } }
/* Finally remove the selected key. */ if (bestkey) { db = server.db+bestdbid; robj *keyobj = createStringObject(bestkey,sdslen(bestkey)); propagateExpire(db,keyobj,server.lazyfree_lazy_eviction); /* We compute the amount of memory freed by db*Delete() alone. * It is possible that actually the memory needed to propagate * the DEL in AOF and replication link is greater than the one * we are freeing removing the key, but we can't account for * that otherwise we would never exit the loop. * * AOF and Output buffer memory will be freed eventually so * we only care about memory used by the key space. */ delta = (long long) zmalloc_used_memory(); latencyStartMonitor(eviction_latency); if (server.lazyfree_lazy_eviction) dbAsyncDelete(db,keyobj); else dbSyncDelete(db,keyobj); latencyEndMonitor(eviction_latency); latencyAddSampleIfNeeded("eviction-del",eviction_latency); latencyRemoveNestedEvent(latency,eviction_latency); delta -= (long long) zmalloc_used_memory(); mem_freed += delta; server.stat_evictedkeys++; notifyKeyspaceEvent(NOTIFY_EVICTED, "evicted", keyobj, db->id); decrRefCount(keyobj); keys_freed++;
/* When the memory to free starts to be big enough, we may * start spending so much time here that is impossible to * deliver data to the slaves fast enough, so we force the * transmission here inside the loop. */ if (slaves) flushSlavesOutputBuffers();
/* Normally our stop condition is the ability to release * a fixed, pre-computed amount of memory. However when we * are deleting objects in another thread, it's better to * check, from time to time, if we already reached our target * memory, since the "mem_freed" amount is computed only * across the dbAsyncDelete() call, while the thread can * release the memory all the time. */ if (server.lazyfree_lazy_eviction && !(keys_freed % 16)) { if (getMaxmemoryState(NULL,NULL,NULL,NULL) == C_OK) { /* Let's satisfy our stop condition. */ mem_freed = mem_tofree; } } } else { latencyEndMonitor(latency); latencyAddSampleIfNeeded("eviction-cycle",latency); goto cant_free; /* nothing to free... */ } } latencyEndMonitor(latency); latencyAddSampleIfNeeded("eviction-cycle",latency); return C_OK;
cant_free: /* We are here if we are not able to reclaim memory. There is only one * last thing we can try: check if the lazyfree thread has jobs in queue * and wait... */ while(bioPendingJobsOfType(BIO_LAZY_FREE)) { if (((mem_reported - zmalloc_used_memory()) + mem_freed) >= mem_tofree) break; usleep(1000); } return C_ERR;}
复制代码


这里就是根据具体策略去淘汰 key,首先是要往 pool 更新 key,更新 key 的方法是evictionPoolPopulate


void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {    int j, k, count;    dictEntry *samples[server.maxmemory_samples];
count = dictGetSomeKeys(sampledict,samples,server.maxmemory_samples); for (j = 0; j < count; j++) { unsigned long long idle; sds key; robj *o; dictEntry *de;
de = samples[j]; key = dictGetKey(de);
/* If the dictionary we are sampling from is not the main * dictionary (but the expires one) we need to lookup the key * again in the key dictionary to obtain the value object. */ if (server.maxmemory_policy != MAXMEMORY_VOLATILE_TTL) { if (sampledict != keydict) de = dictFind(keydict, key); o = dictGetVal(de); }
/* Calculate the idle time according to the policy. This is called * idle just because the code initially handled LRU, but is in fact * just a score where an higher score means better candidate. */ if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) { idle = estimateObjectIdleTime(o); } else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) { /* When we use an LRU policy, we sort the keys by idle time * so that we expire keys starting from greater idle time. * However when the policy is an LFU one, we have a frequency * estimation, and we want to evict keys with lower frequency * first. So inside the pool we put objects using the inverted * frequency subtracting the actual frequency to the maximum * frequency of 255. */ idle = 255-LFUDecrAndReturn(o); } else if (server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) { /* In this case the sooner the expire the better. */ idle = ULLONG_MAX - (long)dictGetVal(de); } else { serverPanic("Unknown eviction policy in evictionPoolPopulate()"); }
/* Insert the element inside the pool. * First, find the first empty bucket or the first populated * bucket that has an idle time smaller than our idle time. */ k = 0; while (k < EVPOOL_SIZE && pool[k].key && pool[k].idle < idle) k++; if (k == 0 && pool[EVPOOL_SIZE-1].key != NULL) { /* Can't insert if the element is < the worst element we have * and there are no empty buckets. */ continue; } else if (k < EVPOOL_SIZE && pool[k].key == NULL) { /* Inserting into empty position. No setup needed before insert. */ } else { /* Inserting in the middle. Now k points to the first element * greater than the element to insert. */ if (pool[EVPOOL_SIZE-1].key == NULL) { /* Free space on the right? Insert at k shifting * all the elements from k to end to the right. */
/* Save SDS before overwriting. */ sds cached = pool[EVPOOL_SIZE-1].cached; memmove(pool+k+1,pool+k, sizeof(pool[0])*(EVPOOL_SIZE-k-1)); pool[k].cached = cached; } else { /* No free space on right? Insert at k-1 */ k--; /* Shift all elements on the left of k (included) to the * left, so we discard the element with smaller idle time. */ sds cached = pool[0].cached; /* Save SDS before overwriting. */ if (pool[0].key != pool[0].cached) sdsfree(pool[0].key); memmove(pool,pool+1,sizeof(pool[0])*k); pool[k].cached = cached; } }
/* Try to reuse the cached SDS string allocated in the pool entry, * because allocating and deallocating this object is costly * (according to the profiler, not my fantasy. Remember: * premature optimizbla bla bla bla. */ int klen = sdslen(key); if (klen > EVPOOL_CACHED_SDS_SIZE) { pool[k].key = sdsdup(key); } else { memcpy(pool[k].cached,key,klen+1); sdssetlen(pool[k].cached,klen); pool[k].key = pool[k].cached; } pool[k].idle = idle; pool[k].dbid = dbid; }}
复制代码


Redis随机选择maxmemory_samples数量的 key,然后计算这些key的空闲时间idle time,当满足条件时(比 pool 中的某些键的空闲时间还大)就可以进poolpool更新之后,就淘汰pool中空闲时间最大的键。


estimateObjectIdleTime用来计算 Redis 对象的空闲时间:


/* Given an object returns the min number of milliseconds the object was never * requested, using an approximated LRU algorithm. */unsigned long long estimateObjectIdleTime(robj *o) {    unsigned long long lruclock = LRU_CLOCK();    if (lruclock >= o->lru) {        return (lruclock - o->lru) * LRU_CLOCK_RESOLUTION;    } else {        return (lruclock + (LRU_CLOCK_MAX - o->lru)) *                    LRU_CLOCK_RESOLUTION;    }}
复制代码


空闲时间第一种是 lurclock 大于对象的 lru,那么就是减一下乘以精度,因为 lruclock 有可能是已经预生成的,所以会可能走下面这个

LFU

上面介绍了 LRU 的算法,但是考虑一种场景


~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~|~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~|~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~|~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D|
复制代码


可以发现,当采用 lru 的淘汰策略的时候,D 是最新的,会被认为是最值得保留的,但是事实上还不如 A 跟 B,然后 antirez 大神就想到了 LFU (Least Frequently Used) 这个算法, 显然对于上面的四个 key 的访问频率,保留优先级应该是 B > A > C = D


那要怎么来实现这个 LFU 算法呢,其实像 LRU,理想的情况就是维护个链表,把最新访问的放到头上去,但是这个会影响访问速度,注意到前面代码的应该可以看到,redisObject 的 lru 字段其实是两用的,当策略是 LFU 时,这个字段就另作他用了,它的 24 位长度被分成两部分


        16 bits      8 bits  +----------------+--------+  + Last decr time | LOG_C  |  +----------------+--------+
复制代码


前 16 位字段是最后一次递减时间,因此 Redis 知道 上一次计数器递减,后 8 位是 计数器 counter。LFU 的主体策略就是当这个 key 被访问的次数越多频率越高他就越容易被保留下来,并且是最近被访问的频率越高。这其实有两个事情要做,一个是在访问的时候增加计数值,在一定长时间不访问时进行衰减,所以这里用了两个值,前 16 位记录上一次衰减的时间,后 8 位记录具体的计数值。Redis4.0 之后为 maxmemory_policy 淘汰策略添加了两个 LFU 模式:


volatile-lfu:对有过期时间的 key 采用 LFU 淘汰策略


allkeys-lfu:对全部 key 采用 LFU 淘汰策略


还有 2 个配置可以调整 LFU 算法:


lfu-log-factor 10lfu-decay-time 1
复制代码


lfu-log-factor 可以调整计数器 counter 的增长速度,lfu-log-factor 越大,counter 增长的越慢。


lfu-decay-time是一个以分钟为单位的数值,可以调整 counter 的减少速度这里有个问题是 8 位大小够计么,访问一次加 1 的话的确不够,不过大神就是大神,才不会这么简单的加一。往下看代码


/* Low level key lookup API, not actually called directly from commands * implementations that should instead rely on lookupKeyRead(), * lookupKeyWrite() and lookupKeyReadWithFlags(). */robj *lookupKey(redisDb *db, robj *key, int flags) {    dictEntry *de = dictFind(db->dict,key->ptr);    if (de) {        robj *val = dictGetVal(de);
/* Update the access time for the ageing algorithm. * Don't do it if we have a saving child, as this will trigger * a copy on write madness. */ if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){ if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) { // 当淘汰策略是 LFU 时,就会调用这个updateLFU updateLFU(val); } else { val->lru = LRU_CLOCK(); } } return val; } else { return NULL; }}
复制代码


updateLFU 这个其实个入口,调用了两个重要的方法


/* Update LFU when an object is accessed. * Firstly, decrement the counter if the decrement time is reached. * Then logarithmically increment the counter, and update the access time. */void updateLFU(robj *val) {    unsigned long counter = LFUDecrAndReturn(val);    counter = LFULogIncr(counter);    val->lru = (LFUGetTimeInMinutes()<<8) | counter;}
复制代码


首先来看看LFUDecrAndReturn,这个方法的作用是根据上一次衰减时间和系统配置的 lfu-decay-time 参数来确定需要将 counter 减去多少


/* If the object decrement time is reached decrement the LFU counter but * do not update LFU fields of the object, we update the access time * and counter in an explicit way when the object is really accessed. * And we will times halve the counter according to the times of * elapsed time than server.lfu_decay_time. * Return the object frequency counter. * * This function is used in order to scan the dataset for the best object * to fit: as we check for the candidate, we incrementally decrement the * counter of the scanned objects if needed. */unsigned long LFUDecrAndReturn(robj *o) {    // 右移 8 位,拿到上次衰减时间    unsigned long ldt = o->lru >> 8;    // 对 255 做与操作,拿到 counter 值    unsigned long counter = o->lru & 255;    // 根据lfu_decay_time来算出过了多少个衰减周期    unsigned long num_periods = server.lfu_decay_time ? LFUTimeElapsed(ldt) / server.lfu_decay_time : 0;    if (num_periods)        counter = (num_periods > counter) ? 0 : counter - num_periods;    return counter;}
复制代码


然后是加,调用了LFULogIncr


/* Logarithmically increment a counter. The greater is the current counter value * the less likely is that it gets really implemented. Saturate it at 255. */uint8_t LFULogIncr(uint8_t counter) {    // 最大值就是 255,到顶了就不加了    if (counter == 255) return 255;    // 生成个随机小数    double r = (double)rand()/RAND_MAX;    // 减去个基础值,LFU_INIT_VAL = 5,防止刚进来就被逐出    double baseval = counter - LFU_INIT_VAL;    // 如果是小于 0,    if (baseval < 0) baseval = 0;    // 如果 baseval 是 0,那么 p 就是 1了,后面 counter 直接加一,如果不是的话,得看系统参数lfu_log_factor,这个越大,除出来的 p 越小,那么 counter++的可能性也越小,这样子就把前面的疑问给解决了,不是直接+1 的    double p = 1.0/(baseval*server.lfu_log_factor+1);    if (r < p) counter++;    return counter;}
复制代码


大概的变化速度可以参考


+--------+------------+------------+------------+------------+------------+| factor | 100 hits   | 1000 hits  | 100K hits  | 1M hits    | 10M hits   |+--------+------------+------------+------------+------------+------------+| 0      | 104        | 255        | 255        | 255        | 255        |+--------+------------+------------+------------+------------+------------+| 1      | 18         | 49         | 255        | 255        | 255        |+--------+------------+------------+------------+------------+------------+| 10     | 10         | 18         | 142        | 255        | 255        |+--------+------------+------------+------------+------------+------------+| 100    | 8          | 11         | 49         | 143        | 255        |+--------+------------+------------+------------+------------+------------+
复制代码


简而言之就是 lfu_log_factor 越大变化的越慢

总结

总结一下,redis 实现了近似的 lru 淘汰策略,通过增加了淘汰 key 的池子(pool),并且增大每次抽样的 key 的数量来将淘汰效果更进一步地接近于 lru,这是 lru 策略,但是对于前面举的一个例子,其实 lru 并不能保证 key 的淘汰就如我们预期,所以在后期又引入了 lfu 的策略,lfu 的策略比较巧妙,复用了 redis 对象的 lru 字段,并且使用了 factor 参数来控制计数器递增的速度,防止 8 位的计数器太早溢出。


本文使用署名 4.0 国际 (CC BY 4.0)许可协议,欢迎转载、或重新修改使用,但需要注明来源。

本文作者: Nicksxs

创建时间: 2020-04-18

本文链接: redis精讲系列介绍八-淘汰策略

发布于: 刚刚阅读数: 3
用户头像

Nick

关注

还未添加个人签名 2017.12.22 加入

写代码的阿森 https://nicksxs.me https://nicksxs.com 也可以看我的博客

评论

发布
暂无评论
redis 精讲系列介绍八 - 淘汰策略_Redis 核心技术与实战_Nick_InfoQ写作社区