This work studies gradient coding (GC) in the context of distributed training problems with unreliable communication. We propose cooperative GC (CoGC), a novel gradient-sharing-based GC framework that leverages cooperative communication among clients. This approach eliminates the need for dataset replication, making it communication- and computation-efficient and suitable for federated learning (FL). By employing the standard GC decoding mechanism, CoGC yields strictly binary outcomes: the global model is either recovered exactly or the recovery is meaningless, with no intermediate outcomes. This characteristic ensures the optimality of the training and demonstrates strong resilience to client-to-server communication failures. However, due to the limited flexibility of the recovery outcomes, the decoding mechanism may also result in communication inefficiency and hinder convergence, especially when communication channels among clients are in poor condition. To overcome this limitation and further exploit the potential of GC matrices, we propose a complementary decoding mechanism, termed GC<sup>+</sup>, which leverages information that would otherwise be discarded during GC decoding failures. This approach significantly improves system reliability against unreliable communication, as the full recovery<sup>1</sup> of the global model dominates in GC<sup>+</sup>. To conclude, this work establishes solid theoretical frameworks for both CoGC and GC<sup>+</sup>. We assess the system reliability by outage analyses and convergence analyses for each decoding mechanism, along with a rigorous investigation of how outages affect the structure and performance of GC matrices. Finally, the effectiveness of CoGC and GC<sup>+</sup> is validated through extensive simulations.
QC 20260123