HybridCluster’s High Availability features emerge from the combination of our core architectural components shown below.

simple-system-diagram

The storage subsystem harnesses the power of ZFS to create and manage a set of snapshots and backups that are streamed and synchronised across the cluster. The interval between these backups is typically configured in HybridCluster to be a few minutes apart. This means that any recovery is done from a backup that is just a few minutes old – so data loss is kept to an absolute minimum. And the fact that the data is held in a distributed fashion on local storage means that retrieving and restoring that data and restarting the processes that use it is automated and typically only takes a few minutes.

The use of ZFS also means that we protect data from silent data corruption, bit rot, bugs in disk firmware, errors in hardware RAID systems etc. All of these can cause faults and crashes on existing systems and avoiding them increases your system uptime and availability.

The distributed Control and Management system performs the continuous monitoring of the cluster at multiple levels and detects errors and triggers and manages the recovery processes. It monitors everything from, for example, Apache and MySQL processes to network connectivity between machines in the cluster and can detect crashes and failures in all these components.

Finally, the distributed proxy, AwesomeProxy, is used by the Control and Management system to ensure that requests are routed to the correct machine at the correct time. So for example if a particular machine in the cluster fails then requests can be buffered and re-directed to a different server once the best slave server with a local replica of the data has been automatically promoted to a master.

A key part of HybridClusters Intellectual Property is around the detection, handling and recovery from errors by the Control and Management system. For example a classic issue within a distributed system is what to do to recover when there is a network fault between two parts of the system and the two sides are no longer in sync. HybridCluster maintains an available and partitioned network and lets you manage the consistency issue that occurs when the network fault is repaired by, for example, automatically electing the most valuable data so operations can continue, while storing the least valuable for a network administrator to review.

— Back to top —