SHA-256 Algorithm Acceleration

Introduction

We want to know if using FPGA/OpenCL and GPU/CUDA is a possible solution to accelerate blockchain applications (specifically the SHA-256 hashing algorithm during the bitcoin mining process) and we implemented the algorithm in different ways and compare the cost-benefit/performance with them. To do so we want to examine the hash algorithm with files and try to parallelize the implementation for acceleration.

Motivation/Background

In the process of bitcoin mining, it is important to have a cryptography or hash algorithm to help to keep the Bitcoin’s blockchain secure. The hash algorithm is considered as the building blocks of the cryptocurrency industry in the present.

With Bitcoin mining, the hardware is doing the SHA-256 Cryptographic Hash Algorithm twice on the Proof of work. For every try of a new hash value, the software will use a different number as the nonce until it hits the correct value that lower the target bits. Thus, SHA-256 Cryptographic Hash Algorithm is one of the most significant elements for Bitcoin mining speed.

SHA-256 Hashing Algorithm

SHA-256 is one of the most popular hashing algorithms in SHA-2 which is a set of cryptographic hash functions designed by the United States Security Agency. Simply, for a message with an arbitrary length, SHA-256 will generate a unique 256 bits hash, which is called the message digest. In most cases, it is represented by a hex string of length 64.

SHA-256 algorithm can be divided into three modules: constant value initialization, message pre-processing, and message digest calculation.

For the constant value initialization module, there are eight initial hash values (Figure 1) and 64 hash constants (Figure 2). These eight initial hash values all come from the decimal part of the square root of the first eight prime numbers. Similarly, these hash constants come from the decimal part of the cube root of the first 64 prime numbers. These values will be used in the message digest calculation part.

Figure 1

Figure 2

The purpose of the pre-processing module is to meet the required input structure so that the message digest calculation module can deal with the input properly. The first step is to fill in bits (The first filling bit is 1, and the rest is all 0s) until the modulo of the input length equals to 448. The second step is to add length information following the filling bits, which is 64 bits. The length of the input will be a multiple of 512 (448 + 64)
after passing the pre-processing module.

In the message digest calculation module, the first step is to break the input message from the pre-processing module into 512-bit blocks. The initial value of the message digest is H0. Then, passing H0 into the calculation of block 1, it generates H1, and so on. After going through all blocks, it will generate a Hn, which is the final message digest output.

In our previous plan, the optimization will focus on the message digest calculation model (Function void transform(const unsigned char *, unsigned in)). For the other two modules, since the operations are almost serial and low complexity, there will not be obvious speed-up when we use FPGA or GPU.

System architecture and Solutions

As we learned about the differences in the architecture between CPU and other hardware, it’s possible to use FPGA (Field Programmable Gate Array) and GPU to accelerate SHA-256.

Even though there are parallel circuits on the framework of CPU, one the software level, it can only return serial execution because of the Operation Systems and other high-level languages.

FPGA is a semiconductor device comprised of an array of “Look up tables” - LUT. A LUT can be viewed as a signal calculation unit or a simple logic gates which accepts a few inputs, calculates the output and stores the output in SRAM memory cells. There are routing channels that run between logic blocks, used to connect LUTs together. So, FPGA can work as a combination of logic gates. As a result, FPGA can execute multiple inputs and multiple outputs at the same time without interference.

GPU, Graphics Processing Unit, is originally for displaying images. When it processes graphics, there are a large number of calculations needed to be done at the same time during different stages. For example, in the rendering stage, GPU needs to calculate for Vertex Shader, Geometry Shader, Pixel Shader so that it can generate the most accurate output. GPU is comprised of many ALUs to process multiple pixels simultaneously. Each of the ALUs can be viewed as a “thread”. Also, the bandwidth of GPU is high to support ALUs. Even though the ALU is not very powerful, the normal add, multiply and divide are supported. Similar to the FPGA, GPU is also made for parallel calculation from the hardware level.

This character of FPGA and GPU can be helpful for SHA-256 Algorithm. It will be a great improvement to replace parts of serial calculation in the SHA-256 Algorithm by parallel calculation with hardware support.

We will use the GPGPU application for both devices. The GPGPU contains a host program and a kernel. The host program is a C++ CPU program use to deal with inputs, outputs and to call kernel APIs. The kernel is a program running on the devices (FPGA or GPU). For GPU, we will use CUDA. For FPGA, we will use OpenCL.

Code Analysis

After the careful examination on the sha256 algorithm, we find out that parallelizing the execution of a single sha256 hash algorithm cannot be completed because of the way the SHA transform function is implemented. This section is the incorrect CUDA version of SHA-256. The OpenCL version is not provided since it’ll also be a parallel algorithm that SHA-256 does not support.

The transform operates in the transform function on blocks of fixed size but each new transformation requires the output of the last transformation so there is no way to run the computations in parallel. As the following code shows, the SHA_UNPACK32 function and SHA_PACK32 function in the header of the cpp version algorithm requires the value from the last output. If we use CUDA or FPGA to do this step in parallel, the str output will not be correct.

In our code, we tried to rearrange the for loop in the transform function so that we can use as less kind of kernels as possible. We put the for loop used to store data into each Word before the encryption for loop and put them into kernel 1 as the code shown below.

Kernel 1

1
2
3
4
5
6
7
8
__global__ void kernel1(unsigned char* sub_block_d, unsigned int *w_d, unsigned int *wv, unsigned int *m_h) {
int j = threadIdx.x;

wv[j] = m_h[j];
if ( j >= 8 ) {
SHA2_PACK32(&sub_block_d[j << 2], &w_d[j]);
}
}

In kernel2, we put the for loop used to encrypt the Word n times. In fact, this part has to be run in order.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

__global__ void kernel2(unsigned int *w_d, unsigned int *wv, unsigned int *sha256_k) {

int j = threadIdx.x;
if (j >= 16){
w_d[j] = SHA256_F4(w_d[j - 2]) + w_d[j - 7] + SHA256_F3(w_d[j - 15]) + w_d[j - 16];
}

t11 = wv[7] + SHA256_F2(wv[4]) + SHA2_CH(wv[4], wv[5], wv[6])
+ sha256_k[j] + w_d[j];
t12 = SHA256_F1(wv[0]) + SHA2_MAJ(wv[0], wv[1], wv[2]);
wv[7] = wv[6];
wv[6] = wv[5];
wv[5] = wv[4];
wv[4] = wv[3] + t11;
wv[3] = wv[2];
wv[2] = wv[1];
wv[1] = wv[0];
wv[0] = t11 + t12;

}

The image above shows one iteration of sha256 algorithm transformation. The original array is sent to different operations and then become the input for the next transformation. The four blue components stand for the computation:

The third kernel was used to combine the eight Word into a single hash for the block i.

Kernel 3

1
2
3
4
__global__ void kernel3(unsigned int *wv, unsigned int *m_h) {
int j = threadIdx.x
m_h[j] += wv[j];
}

Results

Sha256 in c++ version:

1
sha256('apple'):3a7bd3e2360a3d29eea436fcfb7e44c735d117c42d1c1835420b6b9942dd4f1b

Sha256 in cuda version:

1
sha256('apple'):e9f9b813257194ecf7d6a1f7e1bee8ac6dbd7e80ec13bb0bba8942377b64a6c4

Possible Solution

Since the sha256 hash algorithm cannot be accelerated by using parallel computing, there is no way we can make the algorithm faster with one single input string. However, we think that the algorithm could still be run in parallel if there are multiple input strings and the speed up can still be tested. Also, for a long string, we will be able to divide it into multiple substrings and run the parallel hash algorithm on them separately. In this way, it will allow us to hash different parts of the string and combine them back to a long string. However, it is uncertain what effect it could cause by putting all of the hash functions in a single kernel, considering that there are already different transformation and update functions. Because it takes time to copy and pass the required data arrays into the kernel, running everything in multiple kernels could possibly slow down the process of hashing. It requires further research and implementation.

Check out on Github

Resources

User:Kockmeyer
C++ sha256 function

-------------EOFThanks for reading:)-------------
0%