PostgreSQL 基于 Patroni 方案的高可用启动流程分析
什么是 Patroni
在很多生产环境中,分布式数据库以高可用性、数据分布性、负载均衡等特性,被用户广泛应用。而作为高可用数据库的解决方案——Patroni,是专门为 PostgreSQL 数据库设计的,一款以 Python 语言实现的高可用架构模板。该架构模板,旨在通过外部共享存储软件(kubernetes、etcd、etcd3、zookeeper、aws 等),实现 PostgreSQL 集群的自动故障恢复、自动故障转移、自动备份等能力。主要特点:1.自动故障检测和恢复:Patroni 监视 PostgreSQL 集群的健康状态,一旦检测到主节点故障,它将自动执行故障恢复操作,将其中一个从节点晋升为主节点。2.自动故障转移:一旦 Patroni 定义了新的主节点,它将协调所有从节点和客户端,以确保它们正确地切换到新的主节点,从而实现快速、无缝的故障转移。3.一致性和数据完整性:Patroni 高度关注数据一致性和完整性。在故障切换过程中,它会确保在新主节点接管之前,数据不会丢失或受损。4.外部共享配置存储:Patroni 使用外部键值存储(如 ZooKeeper、etcd 或 Consul)来存储配置和集群状态信息。这确保了配置的一致性和可访问性,并支持多个 Patroni 实例之间的协作。5.支持多种云环境和物理硬件:Patroni 不仅可以在云环境中运行,还可以部署在物理硬件上,提供了广泛的部署选项。
Patroni 架构解析
●DCS(Distributed Configuration Store ):是指分布式配置信息的存储位置,可支持 kubernetes、etcd、etcd3、zookeeper、aws 等存储媒介,由 Patroni 进行分布式配置信息的读写。●核心 Patroni:负责将分布式配置信息写入 DCS 中,并设置 PostgreSQL 节点的角色以及 PostgreSQL 配置信息,管理 PostgreSQL 的生命周期。●PostgreSQL 节点:各 PostgreSQL 节点,根据 Patroni 设置的 PostgreSQL 配置信息,生成主从关系链,以流复制的方式进行数据同步,最终生成一个 PostgreSQL 集群。Patroni 高可用源码分析 Patroni 高可用启动流程
流程说明:●加载集群信息,通过 DCS 支持的 API 接口,获取集群信息,主要内容如下:○config:记录 pg 集群 ID 以及配置信息(包括 pg 参数信息、一些超时时间配置等),用于集群校验、节点重建等;○leader:记录主节点选举时间、心跳时间、选举周期、最新的 lsn 等,用于主节点完成竞争后的信息记录;○sync: 记录主节点和同步节点信息,由主节点记录,用于主从切换、故障转移的同步节点校验;○failover: 记录最后一次故障转移的时间。●集群状态检测,主要检测集群配置信息的内容校验,当前集群的整体状态及节点状态,判断通过什么方式来启动 PostgreSQL;●启动 PostgreSQL,用于初始化 PostgreSQL 目录,根据集群信息设置相应的 PostgreSQL 配置信息,并启动;●生成 PostgreSQL 集群,指将完成启动的 PostgreSQL 节点,通过设置主从角色,关联不同角色的 PostgreSQL 节点,最终生成完整的集群。Patroni 高可用启动流程分析加载集群信息加载集群信息,是高可用流程启动的第一步,也是生成 PostgreSQL 集群的最关键信息。
第一步,记载集群信息
......try:self.load_cluster_from_dcs()self.state_handler.reset_cluster_info_state(self.cluster, self.patroni.nofailover)except Exception:self.state_handler.reset_cluster_info_state(None, self.patroni.nofailover)raise......
通过 DCS 接口加载集群信息
def load_cluster_from_dcs(self):cluster = self.dcs.get_cluster()
集群接口
def get_cluster(self, force=False):if force:self._bypass_caches()try:cluster = self._load_cluster()except Exception:self.reset_cluster()raise
@abc.abstractmethoddef _load_cluster(self):"""Internally this method should build Cluster
object whichrepresents current state and topology of the cluster in DCS.this method supposed to be called only by get_cluster
method.
以 Kubernetes 作为 DCS 为例 def _load_cluster(self):stop_time = time.time() + self._retry.deadlineself._api.refresh_api_servers_cache()try:with self._condition:self._wait_caches(stop_time)
上述集群信息中,主要以 xxx-config、xxx-leader、xxx-failover、xxx-sync 作为配置信息,具体内容如下:●xxx-config% kubectl get cm pg142-1013-postgresql-config -oyamlapiVersion: v1kind: ConfigMapmetadata:annotations:config: '{"loop_wait":10,"maximum_lag_on_failover":33554432,"postgresql":{"parameters":{"archive_command":"/bin/true","archive_mode":"on","archive_timeout":"1800s","autovacuum":"on","autovacuum_analyze_scale_factor":0.02,"autovacuum_max_workers":"3","autovacuum_naptime":"5min","autovacuum_vacuum_cost_delay":"2ms","autovacuum_vacuum_cost_limit":"-1","autovacuum_vacuum_scale_factor":0.05,"autovacuum_work_mem":"128MB","backend_flush_after":"0","bgwriter_delay":"200ms","bgwriter_flush_after":"256","bgwriter_lru_maxpages":"100","bgwriter_lru_multiplier":"2","checkpoint_completion_target":"0.9","checkpoint_flush_after":"256kB","checkpoint_timeout":"5min","commit_delay":"0","constraint_exclusion":"partition","datestyle":"iso,mdy","deadlock_timeout":"1s","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"32768","fsync":"on","full_page_writes":"on","hot_standby":"on","hot_standby_feedback":"off","huge_pages":"off","idle_in_transaction_session_timeout":"600000","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","listen_addresses":"*","log_autovacuum_min_duration":"0","log_checkpoints":"on","log_connections":"off","log_disconnections":"off","log_error_verbosity":"default","log_line_prefix":"%t[%p]: [%l-1] %c %x %d %u %a %h","log_lock_waits":"on","log_min_duration_statement":"500","log_rotation_size":"0","log_statement":"none","log_temp_files":0,"log_timezone":"Asia/Shanghai","maintenance_work_mem":"32768","max_connections":"170","max_parallel_maintenance_workers":"2","max_parallel_workers":"2","max_parallel_workers_per_gather":"2","max_replication_slots":"10","max_standby_archive_delay":"30s","max_standby_streaming_delay":"30s","max_wal_senders":"10","max_wal_size":"2048","max_worker_processes":"8","old_snapshot_threshold":"-1","pg_stat_statements.max":"10000","pg_stat_statements.save":"on","pg_stat_statements.track":"all","pgaudit.log":"NONE","pgaudit.log_catalog":"on","pgaudit.log_client":"off","pgaudit.log_level":"log","pgaudit.log_parameter":"off","pgaudit.log_relation":"off","pgaudit.log_rows":"off","pgaudit.log_statement":"on","pgaudit.log_statement_once":"off","pgaudit.role":"","random_page_cost":"4","restart_after_crash":"on","synchronous_commit":"on","tcp_keepalives_count":"0","tcp_keepalives_idle":"900","tcp_keepalives_interval":"100","temp_buffers":"8MB","timezone":"Asia/Shanghai","track_activity_query_size":"1kB","track_functions":"all","track_io_timing":"off","unix_socket_directories":"/var/run/postgresql","vacuum_cost_delay":"0ms","vacuum_cost_limit":"200","wal_buffers":"2048","wal_compression":"on","wal_keep_segments":"128","wal_keep_size":"2048MB","wal_level":"replica","wal_log_hints":"on","wal_receiver_status_interval":"10s","wal_sender_timeout":"1min","wal_writer_delay":"200ms","wal_writer_flush_after":"1MB","work_mem":"4MB"},"use_pg_rewind":true,"use_slots":true},"retry_timeout":10,"synchronous_mode":true,"ttl":30}'initialize: "7289263672843878470"creationTimestamp: "2023-10-13T02:25:51Z"labels:application: spilocluster-name: pg142-1013-postgresqlname: pg142-1013-postgresql-confignamespace: defaultresourceVersion: "22858249"uid: dfa64d28-e939-4bdd-8db1-a3485fa09637
上述例子中,下有和 2 个参数,1.定义集群的整体配置信息,这里包含了 PostgreSQL 配置参数以及集群参数(选举等待时间、允许的最大 WAL 延迟量、是否开启同步模式等)等;2.定义了集群的 ID,该值对应 pg_controldata 命令内的值,因此,所有集群内的 PostgreSQL 节点有相同的 sys_id。
root@pg142-1013-postgresql-1:/home/postgres# pg_controldata | grep "Database system identifier"Database system identifier: 7289263672843878470●xxx-leader% kubectl get cm pg142-1013-postgresql-leader -oyamlapiVersion: v1kind: ConfigMapmetadata:annotations:acquireTime: "2023-10-13T02:26:06.973552+00:00"leader: pg142-1013-postgresql-0optime: "67109192"renewTime: "2023-10-16T07:02:57.418940+00:00"transitions: "0"ttl: "30"creationTimestamp: "2023-10-13T02:26:07Z"labels:application: spilocluster-name: pg142-1013-postgresqlname: pg142-1013-postgresql-leadernamespace: defaultresourceVersion: "23286847"uid: cb235c85-6a21-454d-8320-222205eaa77f
上述下,各参数含义:1.acquireTime:获取集群 leader 锁时间;2.leader:集群 leader 锁的拥有者,这里表示某个 PostgreSQL 节点名称;3.optime:集群 leader 的最新 LSN 的十进制数,这里;4.renewTime:集群 leader 锁的拥有者心跳时间,心跳周期与 xxx-config 中的对应;5.transitions:集群 leader 锁占用次数,一般发生在主从切换或故障转移场景,依次累加;6.ttl:故障转移前的选举时间,即超过 TTL 时间下,没有获取到 renewTime 值更新,便触发选举,由新的节点占用 leader 锁。
●xxx-sync% kubectl get cm pg142-1013-postgresql-sync -oyamlapiVersion: v1kind: ConfigMapmetadata:annotations:leader: pg142-1013-postgresql-1sync_standby: pg142-1013-postgresql-0creationTimestamp: "2023-10-16T06:54:39Z"labels:application: spilocluster-name: pg142-1013-postgresqlname: pg142-1013-postgresql-syncnamespace: defaultresourceVersion: "23288352"uid: 1c46e63b-8b90-4fc6-9596-8e2f71fba2ab
上述内容记录了 2 个信息:1.leader:显示 leader 节点的名称;2.sync_standby:显示同步节点的名称,多个同步节点以逗号分隔。
●xxx-failover% kubectl get cm pg142-1013-postgresql-failover -oyamlapiVersion: v1kind: ConfigMapmetadata:creationTimestamp: "2023-10-16T07:16:03Z"labels:application: spilocluster-name: pg142-1013-postgresqlmanagedFields:
apiVersion: v1fieldsType: FieldsV1fieldsV1:f:metadata:f:labels:.: {}f:application: {}f:cluster-name: {}manager: Patronioperation: Updatetime: "2023-10-16T07:36:56Z"name: pg142-1013-postgresql-failovernamespace: defaultresourceVersion: "23290596"uid: 72d50c58-bc65-4b77-8870-93d0b8f8b7a2
上述内容,主要记录最后一次故障转移发生的时间。集群状态检测
if
self.is_paused():self.watchdog.disable()self._was_paused = Trueelse:if self._was_paused:self.state_handler.schedule_sanity_checks_after_pause()self._was_paused = False
if not self.cluster.has_member(self.state_handler.name):self.touch_member()
cluster has leader key but not initialize key
if not (self.cluster.is_unlocked() or self.sysid_valid(self.cluster.initialize)) and self.has_lock():self.dcs.initialize(create_new=(self.cluster.initialize is None), sysid=self.state_handler.sysid)
if not (self.cluster.is_unlocked() or self.cluster.config and self.cluster.config.data) and self.has_lock():self.dcs.set_config_value(json.dumps(self.patroni.config.dynamic_configuration, separators=(',', ':')))self.cluster = self.dcs.get_cluster()
if self._async_executor.busy:return self.handle_long_action_in_progress()
msg = self.handle_starting_instance()if msg is not None:return msg
we've got here, so any async action has finished.
if self.state_handler.bootstrapping:return self.post_bootstrap()
if self.recovering:self.recovering = False
检测集群是否暂停集群暂停,是指集群中的 PostgreSQL 节点不由 Patroni 管理,当集群异常时,不再出发故障转移等措施。集群暂停一般由用户主动出发,可以用在单个 PostgreSQL 节点的维护上,触发方式:root@pg142-1013-postgresql-0:/home/postgres# patronictl list
Cluster: pg142-1013-postgresql (7289263672843878470) ---+---------+----+-----------+| Member | Host | Role | State | TL | Lag in MB |+-------------------------+----------------+--------------+---------+----+-----------+| pg142-1013-postgresql-0 | 10.244.117.143 | Leader | running | 3 | || pg142-1013-postgresql-1 | 10.244.165.220 | Sync Standby | running | 3 | 0 |+-------------------------+----------------+--------------+---------+----+-----------+root@pg142-1013-postgresql-0:/home/postgres# patronictl pauseSuccess: cluster management is pausedroot@pg142-1013-postgresql-0:/home/postgres# patronictl list
Cluster: pg142-1013-postgresql (7289263672843878470) ---+---------+----+-----------+| Member | Host | Role | State | TL | Lag in MB |+-------------------------+----------------+--------------+---------+----+-----------+| pg142-1013-postgresql-0 | 10.244.117.143 | Leader | running | 3 | || pg142-1013-postgresql-1 | 10.244.165.220 | Sync Standby | running | 3 | 0 |+-------------------------+----------------+--------------+---------+----+-----------+Maintenance mode: on上述,即表示当前集群已停止。此时,PostgreSQL进程仍然存活,如果故障,将需要用户自行启动。集群暂停恢复方式:root@pg142-1013-postgresql-0:/home/postgres# patronictl list
Cluster: pg142-1013-postgresql (7289263672843878470) ---+---------+----+-----------+| Member | Host | Role | State | TL | Lag in MB |+-------------------------+----------------+--------------+---------+----+-----------+| pg142-1013-postgresql-0 | 10.244.117.143 | Leader | running | 3 | || pg142-1013-postgresql-1 | 10.244.165.220 | Sync Standby | running | 3 | 0 |+-------------------------+----------------+--------------+---------+----+-----------+Maintenance mode: onroot@pg142-1013-postgresql-0:/home/postgres# patronictl resumeSuccess: cluster management is resumedroot@pg142-1013-postgresql-0:/home/postgres# patronictl list
Cluster: pg142-1013-postgresql (7289263672843878470) ---+---------+----+-----------+| Member | Host | Role | State | TL | Lag in MB |+-------------------------+----------------+--------------+---------+----+-----------+| pg142-1013-postgresql-0 | 10.244.117.143 | Leader | running | 3 | || pg142-1013-postgresql-1 | 10.244.165.220 | Sync Standby | running | 3 | 0 |+-------------------------+----------------+--------------+---------+----+-----------+
通过命令,即可恢复集群。在恢复集群后,需要对集群中 PostgreSQL 节点进行处理:1.重新配置 PostgreSQL 的参数;2.根据 xxx-sync 中最后一次记录的主、同步节点名称信息,在主节点上设置同步复制槽信息;3.检测恢复后的 PostgreSQL 节点的是否变更,与最后一次 xxx-config 中的值,是否一致,否则将无法恢复集群。集群初始化检测
cluster has leader key but not initialize key
if not (self.cluster.is_unlocked() or self.sysid_valid(self.cluster.initialize)) and self.has_lock():self.dcs.initialize(create_new=(self.cluster.initialize is None), sysid=self.state_handler.sysid)
if not (self.cluster.is_unlocked() or self.cluster.config and self.cluster.config.data) and self.has_lock():self.dcs.set_config_value(json.dumps(self.patroni.config.dynamic_configuration, separators=(',', ':')))self.cluster = self.dcs.get_cluster()
集群初始化检测,主要检测 2 个方面的信息:●集群当前存在 leader 节点,但 xxx-config 中的不存在,此时,需要将 leader 节点上 PostgreSQL 的 sysid 设置到 xxx-config 中;●集群当前存在 leader 节点,但未获取到 xxx-config 信息,需要将 leader 节点上的配置信息和 sysid 都设置到 xxx-config 中,并重新获取集群信息。该步骤的用途是,防止 xxx-config 文件被删除,导致从节点加载集群信息失败。节点状态检测
检测当前 PostgreSQL 的进程启动到了什么阶段
if self._async_executor.busy:return self.handle_long_action_in_progress()
msg = self.handle_starting_instance()if msg is not None:return msg
节点状态检测,是通过检测 PostgreSQL 节点的当前运行状态,来确定是否需要进行具体的操作,节点状态检测的方式可分为 2 种:1.通过 PostgreSQL 的运行状态确定;2.通过异步进程(_async_executor)监听,当前节点处于什么阶段。
节点检测通过后基础操作
we've got here, so any async action has finished.
if self.state_handler.bootstrapping:return self.post_bootstrap()
if self.recovering:self.recovering = False
节点状态检测通过后,需要对 PostgreSQL 进行操作:1.PostgreSQL 启动后操作 def post_bootstrap(self):with self._async_response:result = self._async_response.result# bootstrap has failed if postgres is not runningif not self.state_handler.is_running() or result is False:self.cancel_initialization()
上述操作,包括 pg_rewind 后的 checkpoint 检测、初始化 DCS 的 xxx-config 资源、生成 xxx-leader 资源、加载集群信息等。2.恢复中的 PostgreSQL 检测是否需要执行 pg_rewindif self.recovering:self.recovering = False
pg_rewind 命令用于将从节点的 WAL 与主节点的 WAL 拉齐,一般用于从节点 WAL 因异常后滞后于主节点 WAL。启动 PostgreSQL
is data directory empty?
if self.state_handler.data_directory_empty():self.state_handler.set_role('uninitialized')self.state_handler.stop('immediate', stop_timeout=self.patroni.config['retry_timeout'])# In case datadir went away while we were master.self.watchdog.disable()
else:# check if we are allowed to joindata_sysid = self.state_handler.sysidif not self.sysid_valid(data_sysid):# data directory is not empty, but no valid sysid, cluster must be broken, suggest reinitreturn ("data dir for the cluster is not empty, ""but system ID is invalid; consider doing reinitialize")
无数据目录启动无数据目录启动,是指在执行初始化目录异常、恢复节点异常、WAL 拉齐异常等场景下,会触发的流程:1.设置角色,用于后续重新初始化集群;2.立即停止当前 PostgreSQL 进程;3.判断当前节点是否为主节点,主动释放主节点锁;4.执行启动操作。def bootstrap(self):if not self.cluster.is_unlocked(): # cluster already has leaderclone_member = self.cluster.get_clone_member(self.state_handler.name)member_role = 'leader' if clone_member == self.cluster.leader else 'replica'msg = "from {0} '{1}'".format(member_role, clone_member.name)ret = self._async_executor.try_run_async('bootstrap {0}'.format(msg), self.clone, args=(clone_member, msg))return ret or 'trying to bootstrap {0}'.format(msg)
no initialize key and node is allowed to be master and has 'bootstrap' section in a configuration file
elif self.cluster.initialize is None and not self.patroni.nofailover and 'bootstrap' in self.patroni.config:if self.dcs.initialize(create_new=True): # race for initializationself.state_handler.bootstrapping = Truewith self._async_response:self._async_response.reset()
else:create_replica_methods = self.get_standby_cluster_config().get('create_replica_methods', [])
if self.is_standby_cluster() else Noneif self.state_handler.can_create_replica_without_replication_connection(create_replica_methods):msg = 'bootstrap (without leader)'return self.async_executor.try_run_async(msg, self.clone) or 'trying to ' + msgreturn 'waiting for {0}leader to bootstrap'.format('standby' if self.is_standby_cluster() else '')
上述代码,表示启动的几种方式:1.当前集群已有 leader 节点,当前 PostgreSQL 将以从节点从主节点上同步数据启动;2.当前集群没有 leader 节点,当前 PostgreSQL 将以主节点启动,如果是备用集群,将以备用集群主节点启动;3.当前集群为备用集群且没有主节点,从节点通过方式,一般通过协议流方式从主集群上进行数据同步。有数据目录启动有数据目录启动,主要校验集群 ID 与 PostgreSQL 节点 sysid 的一致性,触发的主要流程:1.校验 PostgreSQL 节点 sysid 是否有效,如果无效,表示 PostgreSQL 出现了异常需要重启;2.校验校验集群 ID 与 PostgreSQL 节点 sysid 是否一致,不一致将无法加入集群,如果集群已暂停,将会释放 leader 锁占用;3.检验集群没有 leader 节点,当前节点将重新初始化集群,将 sysid 作为新的集群 ID 启动。
生成 PostgreSQL 集群 try:if self.cluster.is_unlocked():ret = self.process_unhealthy_cluster()else:msg = self.process_healthy_cluster()ret = self.evaluate_scheduled_restart() or msgfinally:# we might not have a valid PostgreSQL connection here if another thread# stops PostgreSQL, therefore, we only reload replication slots if no# asynchronous processes are running (should be always the case for the master)if not self._async_executor.busy and not self.state_handler.is_starting():create_slots = self.state_handler.slots_handler.sync_replication_slots(self.cluster,self.patroni.nofailover)if not self.state_handler.cb_called:if not self.state_handler.is_leader():self._rewind.trigger_check_diverged_lsn()self.state_handler.call_nowait(ACTION_ON_START)if create_slots and self.cluster.leader:err = self._async_executor.try_run_async('copy_logical_slots',self.state_handler.slots_handler.copy_logical_slots,args=(self.cluster.leader, create_slots))if not err:ret = 'Copying logical slots {0} from the primary'.format(create_slots)生成 PostgreSQL 集群,主要根据当前集群是否存在主节点,判断走健康的集群流程还是非健康的集群流程。非健康的集群流程 def process_unhealthy_cluster(self):"""Cluster has no leader key"""
if self.is_healthiest_node():if self.acquire_lock():failover = self.cluster.failoverif failover:if self.is_paused() and failover.leader and failover.candidate:logger.info('Updating failover key after acquiring leader lock...')self.dcs.manual_failover('', failover.candidate, failover.scheduled_at, failover.index)else:logger.info('Cleaning up failover key after acquiring leader lock...')self.dcs.manual_failover('', '')self.load_cluster_from_dcs()
else:# when we are doing manual failover there is no guaranty that new leader is ahead of any other node# node tagged as nofailover can be ahead of the new leader either, but it is always excluded from electionsif bool(self.cluster.failover) or self.patroni.nofailover:self._rewind.trigger_check_diverged_lsn()time.sleep(2) # Give a time to somebody to take the leader lock
非健康的集群流程,是确定 leader 节点的候选,首要条件必须找到一个健康的节点,如何判断健康的节点,主要有以下几个条件:1.PostgreSQL 集群状态非暂停;2.PostgreSQL 节点状态非启动中;3.PostgreSQL 节点允许故障转移;4.PostgreSQL 节点 WAL 与集群缓存中的(最后一次主节点同步的 lsn 值)的滞后量在允许的范围内。def is_healthiest_node(self):if time.time() - self._released_leader_key_timestamp < self.dcs.ttl:logger.info('backoff: skip leader race after pre_promote script failure and releasing the lock voluntarily')return False
......
def _is_healthiest_node(self, members, check_replication_lag=True):"""This method tries to determine whether I am healthy enough to became a new leader candidate or not."""
当前节点为健康节点,因当前集群没有主节点,需要执行 leader 锁抢占。如果当前节点抢占 leader 锁失败,将作为从节点加入到集群中。当前节点为异常节点,则会一直等待 PostgreSQL 节点正常后,参与集群的选举行为。健康的集群流程 def process_healthy_cluster(self):if self.has_lock():if self.is_paused() and not self.state_handler.is_leader():if self.cluster.failover and self.cluster.failover.candidate == self.state_handler.name:return 'waiting to become master after promote...'
else:logger.debug('does not have lock')lock_owner = self.cluster.leader and self.cluster.leader.nameif self.is_standby_cluster():return self.follow('cannot be a real primary in a standby cluster','no action. I am ({0}), a secondary, and following a standby leader ({1})'.format(self.state_handler.name, lock_owner), refresh=False)return self.follow('demoting self because I do not have the lock and I was a leader','no action. I am ({0}), a secondary, and following a leader ({1})'.format(self.state_handler.name, lock_owner), refresh=False)
@健康的集群流程,是指当前的集群存在 leader 节点,对该流程的处理,主要有 2 个方向:1.检测当前节点为主节点,进行更新 leader 锁操作,保持主节点心跳,避免从节点竞争锁,如果更新锁失败,将立即释放锁,让其他从节点抢占;2.检测当前节点非主节点,作为从节点加入集群。
总结
综上所述,Patroni 是一个用于管理 PostgreSQL 数据库集群的高可用性(HA)管理工具,旨在确保数据库系统的连续可用性,以应对节点故障和维护操作等挑战。Patroni 提供了一系列关键功能和特点,使得它成为强大的高可用性解决方案。总之,在很多场景中,Patroni 能够保持 PostgreSQL 集群友好的运行,保证在集群异常的情况下,通过自动故障转移、数据同步和备份策略等功能,确保数据库集群的稳定性和可用性,使得应用程序能够持续访问数据,即使在节点故障或维护时也不会中断服务。
参考资源
Patroni 配置参数https://patroni.readthedocs.io/en/latest/patroni_configuration.htmlPatroni基于 2.1.5 分支源码https://github.com/zalando/patroni/tree/v2.1.5
评论