News & Blog
Stay informed with the latest insights, trends, and updates from SolanaLink.
Stay informed with the latest insights, trends, and updates from SolanaLink.

The Cosmos SDK provides a framework for building sovereign, application-specific blockchains, utilizing a sophisticated, structured state management system. At its foundation, the SDK relies on the IAVL store (Immutable AVL tree) to maintain cryptographic state commitment and manage primary storage operations.1 The IAVL tree structure is central to state verification, serving as the core mechanism that ensures strict determinism—the non-negotiable principle requiring every correct validator to compute an identical state transition result given identical input.2
While the IAVL tree covers the majority of conventional application state, the modular and sovereign nature of custom app-chains often introduces complexities. Application modules, particularly those supporting highly specialized or large data structures like CosmWasm smart contract blobs, may necessitate utilizing storage outside of the IAVL tree for structural or performance optimization.1 A standard backup procedure focused exclusively on archiving the IAVL database (typically located in the data/tendermint directory) is fundamentally incomplete for a custom application chain. If module-specific external state is not accounted for and archived, the restored node's application layer will become non-deterministic upon restart, leading to a critical consensus failure due to hash mismatch.2 This requirement to identify and secure both IAVL and non-IAVL state forms the primary challenge for robust custom app-chain backups.
CometBFT (the consensus engine used by the Cosmos SDK) is a Byzantine Fault Tolerance (BFT) system where a defined set of validators is responsible for committing new blocks.4 Validators participate by cryptographically signing and broadcasting votes (Prevote and Precommit messages) using their private key, which is controlled by the application through the Application Blockchain Interface (ABCI).5
The safety of the entire network rests on the absolute guarantee that a validator identity will never sign two conflicting blocks at the same height. This guarantee is enforced by the validator's static private key in conjunction with a dynamic state file that tracks the last block signed. Loss of control or simultaneous use of these two components—the validator's identity and its signing history—leads directly to a catastrophic network failure initiated by the validator itself. Therefore, any failure in the backup or recovery control sequence that allows the same private key to potentially sign two different blocks at the same height results in an immediate and existential threat to the validator.
Validator misbehavior is governed by explicit slashing conditions. Operators must clearly distinguish between the risks of downtime and the catastrophic consequences of double signing (equivocation). If a validator suffers from network connectivity issues or a crash, resulting in missed blocks and downtime (missing more than 95% of 10,000 blocks), the punishment is a minimal slash of 0.01% of bonded stake.6 This is a recoverable event.
Conversely, double signing, which occurs if the same validator identity signs two different blocks at the same height, is treated as the network's most severe infraction.7 Detection of this fault results in a mandatory, severe slash of 5% of the validator’s and its delegators’ bonded stake.6 Most importantly, the validator is permanently deactivated, a state known as tombstoning (jailed until $9999-12-31\text{T}23:59:59\text{Z}$).7 A tombstoned validator cannot be unjailed and is permanently excluded from the active set, necessitating the creation of an entirely new validator identity to resume participation.7
The overwhelming disparity between the minimal, recoverable penalty for downtime and the devastating, permanent penalty for equivocation dictates a strict Precautionary Mandate for all backup and recovery protocols. Procedures must prioritize absolute safety—guaranteeing zero risk of double signing—even if it necessitates accepting a brief period of downtime. To further secure operations against network failures that might necessitate emergency recovery, validators are strongly advised to adopt a Sentry Node Architecture, which isolates the validator node in a private network, mitigating the risk of Denial-of-Service (DDoS) attacks by routing traffic through trusted, disposable sentry nodes.4
A Cosmos SDK application’s persistent state and configuration are centralized within a dedicated home directory, typically denoted as $\sim/.\<appd>$ (where \<appd> is the application binary name, such as gaiad). This directory is structurally organized into the config/ subdirectory, which holds static identity and configuration files (keys, network configuration, genesis file), and the data/ subdirectory, which contains the dynamic state of the blockchain (the IAVL database) and the highly sensitive dynamic signing record.
For operational recovery, three specific files define the validator's identity and control its signing safety:
In addition to these files, the mnemonic seed phrase associated with the validator's operator wallet key—used for on-chain management transactions—must be secured through proven, independent, and secure methods.10
The bulk of the archive consists of the blockchain's database state, primarily located within the data/tendermint directory, which houses the IAVL tree.11 Archiving the full database eliminates the need for State Sync or prolonged block replay, significantly accelerating recovery time. However, the size of this data can be substantial. While the database state can theoretically be rebuilt from the network via lengthy block replay or State Sync 12, the three critical identity keys remain non-reproducible and are therefore the highest priority for secure archival.
The critical assets and their associated risks are summarized in Table 1 (see Chapter 2.2).
This protocol outlines the directive steps to achieve a true "cold" state before archival, ensuring the integrity of the signing record.
Before initiating any maintenance, backup, or migration, the operator must verify the integrity and security of the recovery seed phrase for the validator operator wallet. This confirmation serves as the ultimate safeguard for the validator's on-chain identity.
The first mandatory step in a cold backup is the graceful termination of the node daemon. This is typically executed using system-level commands, such as systemctl stop \<appd>d.service. A graceful halt is vital because it guarantees that the CometBFT consensus engine commits the absolute final signing height, round, and step to the priv\_validator\_state.json file before the process terminates.10 Following the stop command, the process must be confirmed as completely inactive.
Immediate archival of the sensitive files must commence once the node is verified as cold. The accepted method for bundling and compression is the use of the tar utility.11 The archival process must be meticulous, capturing all necessary components:
Bash
cd ~/.\<appd>
tar -czvf validator_cold_backup_$(date +%Y%m%d_%H%M).tar.gz config data custom_state_dir
The archival process must begin immediately after the daemon stops. The priv\_validator\_state.json file is a height-locked record, and while the core content should be stable post-halt, immediate archival ensures cryptographic consistency with the last committed state.
Because the resulting archive contains the validator's private key, it represents a highly sensitive security asset. The resulting tarball must be encrypted using strong, audited encryption standards (e.g., GPG or industry-standard key management systems) and stored securely in an offsite location. While specialized tools like Restic or Cosmos Manager can automate scheduled, encrypted backups, the responsibility for securing the cold, cryptographic archive remains strictly with the validator operator.13
Application-specific blockchains are architected for sovereignty, enabling them to implement complex, custom business logic.2 This often involves module development that places data outside the standard IAVL key-value store to manage specific resource types or optimize performance.1 For example, modules handling large contract blobs may utilize dedicated file system paths or external database connections. If these external state locations are not captured during a cold backup, the recovered application will suffer from internal state inconsistency, leading to a consensus failure and non-deterministic behavior.
The Cosmos SDK addressed the issue of external state management through ADR 049, introducing State Sync Hooks.14 This mechanism allows application modules to explicitly include their non-IAVL state in a snapshot stream by utilizing SnapshotExtensionMeta and SnapshotExtensionPayload messages.16
The implementation of ADR 049 provides a critical diagnostic tool for backup strategists. If an application utilizes these hooks, it serves as a strong indication that consensus-critical state is being maintained in file system locations or external databases separate from the IAVL tree. These external locations must be explicitly identified by the operator and added to the cold backup script in Phase 2. For chains leveraging CosmWasm, for instance, confirming the backup of the specific directory containing compiled Wasm contract blobs is essential for achieving a deterministic recovery.16
To execute a complete backup, operators must review the custom application's source code (e.g., app.go or module configuration files) to determine which modules register the ExtensionSnapshotter interfaces. This investigation reveals the exact file paths or database connection strings used by modules that handle external state. Any identified file system paths must be included as part of the custom\_state\_dir variable in the archival command (Chapter 3.3), guaranteeing that the restored node has a complete, deterministic view of the entire application state.
The safe re-activation of a validator identity is the most sensitive step, demanding a coordinated procedure that eliminates any potential for equivocation.
The recovery process begins by provisioning the target server (Server B) and ensuring all necessary dependencies and the application binary are correctly installed. The cold backup archive (.tar.gz) is then extracted to the application home directory on Server B.
This protocol is the guaranteed mechanism for zero-risk migration, prioritizing the cryptographic confirmation of Server A's inactivity over minimizing downtime.10
State Sync is an alternative recovery method used when transferring the full database state is impractical or if the backed-up database is corrupted.12 This method accelerates recovery time from days to minutes by downloading snapshots from trusted peers.12
The recovery procedure is incomplete until the restored node is independently verified as safely participating in consensus.
Post-recovery integrity is verified using the application CLI's status command, which provides aggregated information including NodeInfo, SyncInfo, and ValidatorInfo.17
If the integrity check reveals inconsistencies or if the node fails to stabilize, it suggests a database corruption or state non-determinism, even if the cryptographic keys are correct. The established mitigation strategy in this scenario is to perform an unsafe-reset-all and execute Protocol 5B (State Sync) to rebuild the database state from the network's known trusted state.19
While integrity checks confirm synchronization, continuous, real-time monitoring of block signing activity (precommit rates) is essential to minimize the window for downtime slashing (0.01%).6 Rapid detection of a non-signing validator allows for immediate technical remediation, ensuring high availability.
After a successful cold recovery, the recovered validator should adopt or reinforce the Sentry Node Architecture. This design protects the high-value validator node by isolating it within a secure, private network and connecting it only to trusted, publicly facing sentry nodes.4 This mitigation shifts the burden of external network-level attacks, such as DDoS, away from the core signing component, thereby bolstering the long-term security posture and preventing future forced outages.
The meticulous cold backup and recovery of a custom Cosmos SDK validator is governed by the absolute necessity of preventing double signing. The analysis underscores that validator resilience requires a comprehensive archival strategy that captures the entire deterministic application state, encompassing not only the IAVL database but also any external state managed by custom modules identified via the architecture signals of ADR 049.
The absolute priority for safe recovery is Protocol 5A: Zero-Risk Migration via Key Decoupling. This coordinated shutdown and cryptographic revocation process, which demands the permanent destruction of the private key on the old instance before activation on the new instance, is the only methodology that provides a guaranteed defense against the permanent tombstoning associated with equivocation. A successful cold recovery must therefore adhere to a security posture where temporary liveness is always sacrificed in favor of safety and identity preservation.