I spent some time analyzing the Monero CryptoCurrency source code to understand the algorithm, how it works and to see if it is doable with an FPGA via LabVIEW for FPGA, our secret weapon.
I learned that there are 4 steps to the Monero “CryptoNight” algorithm and that step 3 is the part that does the heavy lifting, with around 500k reads and writes to a small section of memory that is 2 megabytes in size. This section of memory was specifically selected to be a size that coincides with the size of most processor Level 3 caches. This is supposed to be what makes the algorithm “memory-hard”.
Locks are meant to be broken, codes cracked… and secrets revealed.
I am thinking – what if I put step 3 inside an FPGA have it use Block RAM?
- Block RAM is limited on an FPGA, so this may not be worthwhile
Okay, what about DRAM?
- My FPGA may have DDR3 RAM, but other FPGAs have faster RAM. If my implementation works well on DDR3 RAM, then I can move it to another FPGA with faster RAM.
- Will an FPGA user of DRAM be faster than a CPU usage of L3 Cache? Taking in to account of course that the FPGA is the only user of this DRAM controller? What about an FPGA with multiple DRAM controllers?
Well, I know that DRAM is “slow” when compared to other types of memory, but the difference here is that the FPGA is the only user of the DRAM controller. On any operating system, there are many users, i.e. programs, processes, kernel threads. So would doing this from an FPGA make the cut? Would it make that much of a difference?
Well, there is only one way to find out. Try it out!
I have created a github repository with my work so far here:
I went in to the Monero c++ source code (https://github.com/monero-project/monero/blob/master/src/crypto/slow-hash.c#L581) and saved to a binary file the following variables before the loop with 500k iterations starts (as of this date lines 591 and 600)
- uint64_t a[2]
- uint64_t b[2]
- uint8_t *hp_state (<= this is the scratch pad of 2 megabytes of data)
- uint8_t *hp_state_out (same scratch pad after CryptoNight Step 3 has run)
I implemented a sandboxed c++ version of this code that does CryptoNight Step 3 in an isolated program that runs with the same values each time.
This c++ program works on OSX and Windows (and probably linux), it uses gradle as its build tool and you can see the source code here:
I then implemented the same algorithm, based on the same source file by using LabVIEW for Windows. The values match, so we have a working C++ version, a working LabVIEW for Windows version, and now we can determine if an FPGA version will be worth it.
Please note that the LabVIEW version is not optimized code, and I am not a LabVIEW for Windows Developer, and that is probably why it runs so slow… for now. Yes, it takes over an hour to create one hash. However, I have consulted with some LabVIEW experts, and they have told me what I should do to make it faster. I will start working on that, and in the meantime, you can take a look at the ever-changing source code to see what the algorithm involves. Remember, LabVIEW code is very easy to understand, so this may be the “flow-chart” explanation of what a cryptocurrency miner looks like.
See the LabVIEW code here:
(Requires LabVIEW 2017 to view…) I will add some png versions of the code soon, but first I want to do some cleaning…