FPGA Accelerator for 3DES Algorithm Based on OpenCL: Design and Performance Analysis

1. Introduction & Overview

This paper presents the design and implementation of a high-performance Field-Programmable Gate Array (FPGA) accelerator for the Triple Data Encryption Standard (3DES) algorithm, utilizing the Open Computing Language (OpenCL) framework. The core motivation addresses critical limitations in current data encryption practices within domains like digital currency, blockchain, and cloud data security. Traditional software-based encryption suffers from slow computation speeds, high host resource consumption, and significant power draw. Conversely, FPGA implementations using hardware description languages (HDLs) like Verilog/VHDL, while performant, entail long development cycles and pose challenges for maintenance and upgrades.

The proposed solution leverages OpenCL's high-level synthesis (HLS) capabilities to bridge this gap. It aims to achieve near-hardware performance while significantly improving developer productivity and design flexibility. The accelerator employs a 48-iteration pipeline parallel structure (matching 3DES's three 16-round DES operations) and integrates multiple optimization strategies across data transmission and computation kernels to maximize throughput and energy efficiency on an Intel Stratix 10 GX2800 FPGA platform.

111.8 Gb/s

Peak Throughput

372x

vs. CPU (i7-9700)

9x

Energy Eff. vs. GPU (GTX 1080 Ti)

48

Pipeline Stages

2. Technical Background & Related Work

2.1 The 3DES Algorithm

3DES (Triple DES) is a symmetric-key block cipher constructed from the Data Encryption Standard (DES). To enhance security against brute-force attacks exploiting DES's 56-bit key vulnerability, 3DES applies the DES cipher three times with either two or three independent keys. The variant discussed here uses three keys (K1, K2, K3) for an effective key length of 168 bits, following the Encrypt-Decrypt-Encrypt (EDE) sequence: $Ciphertext = E_{K3}(D_{K2}(E_{K1}(Plaintext)))$. This results in 48 rounds of Feistel network operations (3 * 16). Despite being superseded by AES for new applications, 3DES remains widely deployed in legacy financial systems, electronic payment networks (as noted in the People's Bank of China specifications), and other areas where backward compatibility is crucial.

2.2 OpenCL for FPGA Acceleration

OpenCL provides a standardized framework for parallel programming across heterogeneous platforms (CPUs, GPUs, FPGAs, DSPs). For FPGAs, tools like the Intel FPGA SDK for OpenCL act as HLS compilers, translating kernel code written in a C-like language into efficient hardware circuits. This abstracts away low-level HDL complexities, enabling software engineers to exploit FPGA parallelism. Prior research demonstrates its efficacy: optimized MD5 hashing achieved 6.1x speedup [4], Kuznyechik cipher reached 41 Gb/s [5], and AES implementations leveraged OpenCL's SIMD capabilities [6]. This work extends the paradigm to 3DES, focusing on architecture-aware optimizations specific to the algorithm's structure and FPGA memory hierarchy.

3. Accelerator Architecture & Design

The core innovation lies in a co-designed architecture that optimizes both data movement and computation.

3.1 Overall Pipeline Parallel Structure

The accelerator is designed around a deep pipeline that maps directly to the 48 rounds of 3DES. Each round's operations are finely pipelined, allowing a new data block to enter the pipeline every clock cycle after an initial latency. This spatial parallelism is crucial for sustaining high throughput.

3.2 Data Transmission Module Optimizations

To avoid the processor becoming memory-bound, the design implements two key strategies:

Data Storage Adjustment: Reorganizing data structures in host and device memory to enable efficient burst transfers and align with the FPGA's memory controller capabilities, reducing access overhead.
Data Bit-width Improvement: Increasing the width of data paths between memory and the kernel processing elements. This maximizes the utilization of the available external memory bandwidth, feeding the computational units more efficiently.

3.3 Algorithm Kernel Optimizations

Within the OpenCL kernel implementing the 3DES cipher, several optimizations are applied:

Instruction Stream Optimization: Restructuring the kernel code to minimize dependencies and create a continuous, efficient pipeline in the generated hardware. This involves techniques like loop unrolling and operation scheduling.
Kernel Vectorization: Using OpenCL vector data types (e.g., `uint16`) to process multiple data elements simultaneously within a single kernel instance, exploiting data-level parallelism.
Compute Unit Replication: Instantiating multiple copies of the optimized kernel (Compute Units or CUs) on the FPGA fabric. This enables processing multiple independent data streams in parallel, scaling throughput with available resources.

4. Experimental Results & Performance

The accelerator was implemented and tested on an Intel Stratix 10 GX2800 FPGA. The key performance metrics are transformative:

Throughput: Achieved a peak throughput of 111.801 Gb/s.
vs. CPU (Intel Core i7-9700): The FPGA accelerator delivers a 372x performance improvement and a staggering 644x improvement in energy efficiency.
vs. GPU (Nvidia GeForce GTX 1080 Ti): While GPUs are strong competitors in parallel compute, this FPGA design still achieves a 20% higher performance and a 9x better energy efficiency.

Chart Description (Imagined): A bar chart would vividly illustrate this comparison. The x-axis lists platforms: i7-9700 CPU, GTX 1080 Ti GPU, Stratix 10 FPGA. The left y-axis (log scale) shows Normalized Throughput, with the FPGA bar towering over the others. The right y-axis shows Energy Efficiency (Throughput per Watt), where the FPGA bar would again be the tallest, emphasizing its superior performance-per-watt characteristic, a critical metric for data center deployment.

5. Key Insights & Analyst Perspective

Core Insight: This paper isn't just about making 3DES fast; it's a blueprint for pragmatically modernizing legacy cryptographic workloads for the heterogeneous computing era. The real breakthrough is demonstrating that OpenCL-based HLS can deliver specialized hardware efficiency (beating a high-end GPU) with general-purpose programming agility. It validates the "software-defined hardware" promise for a critical, well-defined domain.

Logical Flow: The authors correctly identify the dual problem: software is too slow/inefficient, and traditional HDL is too rigid. Their solution flow is logical: 1) Choose a mature HLS framework (OpenCL SDK) for accessibility. 2) Architect a deep pipeline matching the algorithm's innate structure (48 rounds). 3) Attack both data I/O and computation bottlenecks with targeted optimizations (bit-width, vectorization, replication). The result is a balanced design that doesn't just compute quickly but also moves data efficiently—a common oversight in naive HLS projects.

Strengths & Flaws: The strength is in the compelling, quantified result: 372x over CPU and 9x energy gain over a GPU is a powerful data point for FPGA advocates. The use of a real, deployed algorithm (3DES) adds practical weight versus toy benchmarks. However, the analysis has flaws. First, it compares against a consumer GPU (GTX 1080 Ti) rather than a modern data-center GPU (e.g., A100) or a dedicated cryptographic accelerator. Second, while resource utilization is mentioned as "low," no absolute figures (LUTs, BRAM, DSPs) are provided, making cost-effectiveness hard to assess. Third, the focus on 3DES, while justified, is inherently limiting; the methodology's applicability to modern algorithms like AES-GCM or post-quantum candidates is the true test of its value, akin to how the foundational CycleGAN paper [1] provided a general framework for image-to-image translation beyond its specific examples.

Actionable Insights: For industry practitioners, this work provides a clear playbook. When facing batch-oriented, latency-tolerant, legacy cryptographic workloads (payment gateways, legacy data encryption), an OpenCL-based FPGA approach is now a proven, high-efficiency option. The priority should be to apply this methodology to AES and SHA-3 families. For researchers, the next step is to abstract these optimization patterns—pipeline-depth matching, memory access coalescing, compute unit scaling—into a semi-automated design tool or template library for cryptographic HLS, reducing the expert knowledge required. The ultimate goal should be a system as adaptable as software but with hardware performance, a direction highlighted by DARPA's DSSoC program [2] which seeks to create agile, domain-specific system-on-chip platforms.

6. Technical Deep Dive

6.1 Mathematical Formulation of DES Core

The DES round function $F(R, K)$, central to both DES and 3DES, is a Feistel function. For a 32-bit input half-block $R$ and a 48-bit round key $K$:

Expansion: $R$ is expanded to 48 bits via a fixed permutation table $E$: $E(R)$.
Key Mixing: XOR with the round key: $A = E(R) \oplus K$.
Substitution (S-boxes): The 48-bit $A$ is split into eight 6-bit chunks. Each chunk enters a different non-linear 6-to-4 bit substitution box ($S_1$ to $S_8$), producing eight 4-bit outputs concatenated into a 32-bit result $B$.
Permutation (P-box): A final fixed permutation $P$ is applied: $F(R, K) = P(B)$.

The FPGA implementation heavily optimizes steps 1, 3, and 4. The S-boxes, traditionally implemented as lookup tables (LUTs), map perfectly to the FPGA's native LUT resources. The paper's "instruction stream optimization" likely involves flattening this sequence of operations into a highly pipelined, direct hardware datapath to minimize intermediate staging.

6.2 Analysis Framework: A Non-Code Case Study

Consider analyzing a similar HLS-based accelerator for the SHA-256 hash function. The analyst's framework would mirror this paper's dissection:

Algorithm-Structure Mapping: SHA-256 has 64 rounds. Does the design use a 64-stage pipeline? If not, is it folded due to resource constraints?
Dataflow Bottleneck Identification: SHA-256 operates on a 512-bit message block and a 256-bit state. Is the host-to-FPGA bandwidth sufficient? Is the internal bit-width optimized for the 32-bit word operations of SHA-256?
Optimization Strategy Audit: Are the core functions (Ch, Maj, Σ0, Σ1) inlined and pipelined? Is loop unrolling applied to the round loop? Are multiple message blocks processed concurrently via compute unit replication?
Comparative Benchmarking: Compare results not just to a CPU, but to other HLS implementations [7], dedicated IP cores, and modern GPUs. Normalize for process technology and frequency.

This structured approach moves beyond reporting performance to explaining why that performance was achieved and how the design choices trade off area, speed, and power.

7. Future Applications & Research Directions

The methodology pioneered here has significant future potential:

Modern Cryptography Suite Acceleration: The immediate extension is applying these OpenCL optimization techniques to AES-GCM (for authenticated encryption), SHA-3, and ChaCha20-Poly1305. The pipeline and vectorization strategies are broadly applicable.
Post-Quantum Cryptography (PQC): NIST-standardized PQC algorithms (e.g., CRYSTALS-Kyber, CRYSTALS-Dilithium) are computationally intensive. FPGA accelerators using HLS will be vital for their practical deployment in networks. The compute-unit replication strategy is particularly relevant for lattice-based operations.
Agile, Software-Defined Security: Imagine a data center FPGA that can be dynamically reconfigured via OpenCL kernels to switch between being a 3DES accelerator for legacy banking traffic, an AES accelerator for general web traffic, and a PQC accelerator for secure updates, all based on load. This work is a step towards that vision of "cryptography as a service" in hardware.
Integration with Broader Systems: Future work should explore integrating such accelerators into frameworks like Intel's oneAPI or OpenStack, managing orchestration and scheduling alongside CPU and GPU resources, as envisioned in heterogeneous computing research from institutions like ETH Zurich's Systems Group [3].

8. References

Zhu, J., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). [External Source - Foundational ML Paper]
DARPA. (n.d.). Domain-Specific System on Chip (DSSoC) Program. Retrieved from https://www.darpa.mil/program/domain-specific-system-on-chip. [External Source - Research Program]
ETH Zurich, Systems Group. (n.d.). Research in Heterogeneous and Parallel Computing. Retrieved from https://systems.ethz.ch/research.html. [External Source - Academic Research]
[4] From PDF: Reference on OpenCL-based MD5 acceleration.
[5] From PDF: Reference on Kuznyechik algorithm pipeline design.
[6] From PDF: Reference on AES evaluation on multi-FPGA OpenCL.
[7] From PDF: Reference on OpenCL-based SHA-1 design.
[8] From PDF: NIST DES standard reference.
[9] From PDF: People's Bank of China specification reference.
Wu, J., Zheng, B., Nie, Y., & Chai, Z. (2021). FPGA Accelerator for 3DES Algorithm Based on OpenCL. Computer Engineering, 47(12), 147-155,162. [Primary Paper]