Open this publication in new window or tab >>2025 (English)In: IEEE Transactions on Communications, ISSN 0090-6778, E-ISSN 1558-0857, Vol. 73, no 12, p. 13087-13102Article in journal (Refereed) Published
Abstract [en]
This work studies gradient coding (GC) in the context of distributed training problems with unreliable communication. We propose cooperative GC (CoGC), a novel gradient-sharing-based GC framework that leverages cooperative communication among clients. This approach eliminates the need for dataset replication, making it communication- and computation-efficient and suitable for federated learning (FL). By employing the standard GC decoding mechanism, CoGC yields strictly binary outcomes: the global model is either recovered exactly or the recovery is meaningless, with no intermediate outcomes. This characteristic ensures the optimality of the training and demonstrates strong resilience to client-to-server communication failures. However, due to the limited flexibility of the recovery outcomes, the decoding mechanism may also result in communication inefficiency and hinder convergence, especially when communication channels among clients are in poor condition. To overcome this limitation and further exploit the potential of GC matrices, we propose a complementary decoding mechanism, termed GC<sup>+</sup>, which leverages information that would otherwise be discarded during GC decoding failures. This approach significantly improves system reliability against unreliable communication, as the full recovery<sup>1</sup> of the global model dominates in GC<sup>+</sup>. To conclude, this work establishes solid theoretical frameworks for both CoGC and GC<sup>+</sup>. We assess the system reliability by outage analyses and convergence analyses for each decoding mechanism, along with a rigorous investigation of how outages affect the structure and performance of GC matrices. Finally, the effectiveness of CoGC and GC<sup>+</sup> is validated through extensive simulations.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
Complementary decoding mechanism, Convergence, Cooperative gradient coding, Federated learning, Secure Aggregation, Straggler mitigation, Unreliable communication
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-371622 (URN)10.1109/TCOMM.2025.3612589 (DOI)001649704400032 ()2-s2.0-105017454960 (Scopus ID)
Note
QC 20260123
2025-10-172025-10-172026-01-23Bibliographically approved