Table of Contents
tl;dr;
I tracked all of my time, from analyzing the algorithm itself, a C++ implementation of it, implementing my own version of the implementation using LabVIEW non-FPGA, designing a state machine, implementing the LabVIEW FPGA version of it
It took me (an experienced hobbyist LabVIEW FPGA Developer) a total of around 40 hours to implement the SHA-3 256 and 512 bit Keccak algorithms, depending on, of course what you include everything I did.
What is SHA-3/Keccak?
From the wikipedia page: https://en.wikipedia.org/wiki/SHA-3
“SHA-3 is a subset of the broader cryptographic primitive family Keccak”
NIST has defined six instances of the SHA-3 algorithm, 2 of which I have implemented for this exercise. I can easily extend my implementation to support the other two SHA3 algorithms, but I am not sure about the SHAKE algorithms, mostly because I know nothing about them. Perhaps I will do that after I finish writing this page up. Anyway, here is a list of all six instances and which ones I have implemented:
Instance | LabVIEW FPGA Implementation Status |
SHA3-224 | no |
SHA3-256 | yes |
SHA3-384 | no |
SHA3-512 | yes |
SHAKE128 | no |
SHAKE256 | no |
Why?
I was looking at the Ethereum algorithm implementation, thinking about how one could accelerate this algorithm by using an FPGA. Since Ethereum uses a memory-hard algorithm for mining an FPGA would not make sense to improve the mining performance, but what about for other tasks such as validating transactions or blocks or whatever it is that Ethereum does?
So I dug through the code and saw that the CPU implementation was using a library called ‘ethash’ for the calculation of the hashing algorithm:
The search algorithm gets called from line 281:
And it calls into the ethash library through this #include:
Which gets added to the package via CMake here:
Ethereum mining source code:
C/C++ Implementation of Ethash source code:
Now if you take a look at the ‘ethash-test’ project you will see a bunch of tests that will validate the source code. What I did was take the test cases from the ‘test_keccak.cpp’ file, which includes the test_text variable and used them for my LabVIEW implementation. You can place a breakpoint inside the ‘unaligned’ test case and see it in action, including stepping through all the code that gets executed.
Open the properties of the ethash-test project and add the following argument under the Debugging->Command Arguments property to have it run just the unaligned test case ‘–gtest_filter=keccak.unaligned’.
Method
What follows are the general steps that I took for this where I tracked the time spent on each task. All numbers are based on my best recollection and
Analyze C++ Implementation
5 hours
I built the C++ implementation (https://github.com/chfast/ethash) as follows:
- From a Windows command prompt:
- git clone git@github.com:chfast/ethash.git
- cd ethash
- mkdir build
- cd build
- cmake .. -G “Visual Studio 16 2019“
- From Visual Studio 2019:
- Open the solution file: ‘build\ethash.sln’
- Click Build->Build Solution
If you get any errors, make sure you have installed and configured git , cmake, and Visual Studio 2019 Community Edition.
I found the test cases in project ethash-test inside the file test_keccak.cpp very useful in understanding the algorithm. I recommend you put a breakpoint inside the ‘unaligned’ test case, roughly on line 235 and then stepping through the calculation of an entire hash. Each test uses the data in the variable ‘test_text‘ and calculates the 256-bit and 512-bit versions of SHA-3 using a subset of the test data size, starting with size/length of 0 all the way up to a size/length of 166.
Define Test Cases
3 hours
I took the input data for the tests from the test_keccap.cpp file from the previous step and wrote a python script to generate the same exact results by using the pysha3 library – https://pypi.org/project/pysha3/. Then I wrote these results, and the input data in a binary compatible form that can be read from a LabVIEW test harness as well as from a Vivado Simulation testbench. See the code here, the two generated files are called ‘exp_results.dat’, and ‘test_data.dat’.
See the python script sha.py here: https://github.com/fpganow/keccak/blob/main/python/sha.py
Create LabVIEW Test Framework
3 hours
Then I used LabVIEW (non-FPGA) to read the two files mentioned in the previous step, loop over all of the test cases and to call a regular, non-FPGA LabVIEW vi that will calculate the expected hash based on the input data and the specified SHA-3 algorithm.
Implement LabVIEW Host Version
4 hours and 30 minutes
Using the C++ version as a guide, I implemented a crude, as simple as possible LabVIEW version of the algorithm. I used the test framework from the previous step to verify that it worked for each test case.
The LabVIEW host version does not use a state machine, it is just a simple implementation that sits inside of a for loop. The code starts in keccak.host.top.vi:
https://github.com/fpganow/keccak/blob/main/labview/host.impl/keccak.host.top.vi
Modify Test Framework to Support LabVIEW FPGA
2 hours and 30 minutes
Then I modified the test runner to also load a LabVIEW FPGA vi, send the data in using two Host-to-Target FIFOs, and to receive the results via another Target-to-Host FIFO. To do this, I implemented a very bare bones LabVIEW FPGA vi that reads parameters in and immediately sends back an empty result of all zeros.
Implement LabVIEW FPGA Version
Design State Machine
5 hours and 30 minutes
Now for implementing the LabVIEW FPGA version, I decided to not code at first and to spend a lot of time designing the appropriate FPGA state machine. After some thought I came up with the following states:
- Wait.for.Start
- Read.Blocks
- Read.Remainder
- Keccak
- Send.Results
Then I spent a lot of time fleshing out each state and deciding what variable name and types to use, and I came up with the following descriptions for each state:
Wait.for.Start
- Wait for valid data on HT_PARMS and set:
- hash.input_size
- hash.bits
- calculate:
- read.block_words = read.block_size / 8
- read.block_size = (1600 – hash.bits x 2) / 8
- reset to zero the counters/iterators:
- read.i
- read.block_i
- keccak.i
- stop reading from HT_PARMS FIFO
- start reading from HT_DATA FIFO only if hash.input_size > 0
- if hash.input_size >= read.block_size
- next_state = Read.Blocks
- else
- next_state = Read.Remainder
Read.Blocks
- if HT_DATA is not valid, skip execution for this clock cycle
- else
- XOR state[read.block_i] with the current Word read from HT_DATA
- increment:
- read.i += 8
- read.block_i += 1
- Now check if we are done reading a block of data which is:
- hash size of 512, block_size is 72
- hash size of 256, block_Size is 136
- If read.block_i + 1 == hash.block_words, meaning we just read the final block of words:
- next_state = Keccak
- If hash.input_size – (read.i+8) > hash.block_size then
- next_next_state = Read.Blocks
- else if hash.input_size – (read.i+8) == 0 then
- next_next_state = Send.Results
- else
- next_next_state = Read.Remainder
Read.Remainder
- if HT_DATA is not valid, and hash.input_size > 0 skip execution for this clock cycle
- else
- if HT_DATA is not valid, and hash.input_size == 0
- OR
- (hash.input_size – read.i) == 0
- OR
- HT_DATA is valid
- then
- state[block_i] XOR the remaining bytes padded with zeros to the left with the first byte before the data set to 0x1, i.e.
-
Remaining Bytes Value to XOR with 0 0x01 1 0x01 XX 2 0x01 XX XX
… 01 XX XX …
7 0x 01 XX XX XX YY YY YY YY
- then XOR state[hash.block_words – 1] with 0x 80000000 00000000
- set next state to Keccak
-
- state[block_i] XOR the remaining bytes padded with zeros to the left with the first byte before the data set to 0x1, i.e.
- if HT_DATA is valid AND (hash.input_size – read.i) >= 8 then
- state[read.block_i] XOR HT_DATA.Data
- read.i += 98
- read.block_i += 1
- state.next = Read.Remainder
Keccak
- modify state using round_contants and round iterator keccak.i
- if keccak.i + 1 == 12 then
- state.next = state.after_next
- (state.after_next is either Read.Blocks or Send.Result)
- read.block_i = 0
- if state.after_next == Send.Result
- read.i = 0
- Read.HT_DATA = False
- else
- Read.HT_DATA = True
- state.next = state.after_next
- else
- state.next = Keccak
Send.Results
- if hash.bits == 256 and (read.i + 1) == 4 or hash.bits == 512 AND (read.i + 1) == 8 then
- read.i == 0
- state.next = Wait.for.Start
- else
- read.i + 1
- state.next = Send.Result
- TH_RESULTS.data = state[read.i]
- TD_Results.Valid = True
FPGA Implementation – Data Size Less Than Block Size
11 hours and 45 minutes
The way the algorithm works is that it reads the test data in one block at a time, and when there is less than one block of data remaining, it pads the data with zeros and a single 0x1 is added to the end of the data. So the first thing I did was implement the remainder logic assuming that the data coming in is less than one block in size.
In this scenario, there are 3 cases as well. The cases are:
- 0 bytes remaining (for the test case of size 0)
- 1 to 7 bytes remaining
- 8 to block size – 1
This covered the first 71 test cases, and after they all passed I spent some time cleaning up the code.
FPGA Implementation – Data Size Greater Than or Equal to Block Size
5 hours and 30 minutes
Code Cleanup
3 hours
Just your regular code cleanup, re-factoring to make things cleaner and neater. Easier to read and understand. I also made some changes where I added some more variables into the state, which is kept from one clock cycle to the next.
Totals
Total development time:
Export to Netlist
Now if you create a build specification in LabVIEW FPGA you will see a new option ‘Export to Netlist’. This will generate a Xilinx Design Checkpoint and a VHDL wrapper for using that IP. These files are placed in a sub-directory of C:\NIFPGA\compilation. Since this tool is still in its infancy, nothing else is provided. So I wrote a python script that will find the last created netlist export, copy the files to the current directory
Create Vivado Test Bench
Now I created a Vivado project and added the dcp and vhd files from the previous step.
Then I wrote a test bench that reads in the exp_results.dat file and the test_data.dat file and run through every single test case and assert all of the results.
Pass All Test Cases
I had some issues getting the test cases to pass at first, but after…. RTFM… I realized that I should be placing the IP that I export from LabVIEW inside of a Single-Cycle Timed Loop. LabVIEW FPGA does in fact support the exporting of IP that just ‘runs once’.