2025年7月,国科大密码学院郑昉昱副教授作为通信作者指导学生撰写的论文“ML-Cube: Accelerating Module-Lattice-Based Cryptography using Machine Learning Accelerators with a Memory-Less Design“,被32nd ACM Conference on Computer and Communications Security (ACM CCS)会议接受。
论文延续了课题组在CHES 2024的研究工作,提出ML-Cube方案。该方案采用一种近乎无内存的方法,基于机器学习加速器Tensor Core来实现基于模格的ML-KEM和ML-DSA密码算法,也就是NIST标准化的密钥封装算法Kyber和数字签名算法Dilithium,性能大幅领先于已有的研究工作。本方案的“无内存(Memory-Less)×机器学习(Machine Learning)×模格(Module-Lattice)”特色也是方案名ML-Cube(ML^3)的得名原因。该研究工作得到国家密码科学基金面上项目“AI加速器融合的格密码算法高速实现”(No. 2025NCSF02005)的支持。
ACM CCS于1993年首次举办,已有三十多年历史,是国际公认的信息安全领域旗舰会议,与IEEE S&P、USENIX Security、NDSS并称为信息安全领域国际四大顶级学术会议,也是中国计算机学会(CCF)推荐的A类会议。ACM CCS 2025在2025年10月13-17日于中国台湾召开。
摘要:The rapid advancement of AI technologies has led to a dramatic surge in computational demands, driving significant breakthroughs in ML accelerators. The powerful performance of these accelerators has attracted the attention of cryptography researchers, and recent studies have begun to explore their use in accelerating cryptographic operations. However, treating these accelerators as black boxes leads to high latency, and strict concurrency requirements, which hinder their practical deployment.
In this paper, we go beyond the black-box treatment of ML accelerators and introduce ML-Cube (ML^3), a novel memory-less framework that leverages ML accelerators to implement modulelattice-based PQC, FIPS 203 ML-KEM, and FIPS 204 ML-DSA. The performance benefits of ML-Cube arise from our thorough analysis of ML accelerator internals. Rather than treating the accelerators as black boxes, we dissect their operating mechanisms and design tailored mathematical transformations for cryptographic acceleration. This enables memory-less (I)NTT and polynomial multiplication that minimizes external memory dependencies and reduces latency. We further address the high latency and excessive parallelism demands of traditional SIMT-based implementations by fully parallelizing both ML-KEM and ML-DSA schemes. Our experiments show that our Tensor Core-based (I)NTT achieves a 2.03×–3.56× speedup over a highly-optimized CUDA-core implementation. Moreover, our memory-less polynomial multiplication attains a 10× speedup, and the full ML-KEM reaches up to a 3.58× speedup with only less than one-tenth of the latency compared with SOTA approach (CHES ’24). Additionally, our enhanced MLDSA implementation offers a 30% to 55% throughput improvement over the previous SOTA methods (TDSC ’24) under the serveroriented model. Importantly, by confining core computations within registers, our approach inherently mitigates memory disclosure and cache-based side-channel attacks, thereby enhancing overall security.
论文信息:Tian Zhou, Fangyu Zheng, Zhuoyu Xie, Wenxu Tang, Guang Fan, Yijing Ning, Yi Bian, Jingqiang Lin, Jiwu Jing, “ML-Cube: Accelerating Module-Lattice-Based Cryptography using Machine Learning Accelerators with a Memory-Less Design”, 32nd ACM Conference on Computer and Communications Security (CCS), 2025. (CCF-A,四大安全顶会)