Coding Standards Matter…

I have wired up the components of my 10 Gigabit FPGA Accelerated Network card with great care, and I decided to have my “tester” application skip the lwIP stack and to pass the received packet directly to the host for testing/verification purposes.

Everything was checking out fine, the LabVIEW code looked flawless, the interface to the 10 Gigabit Transceiver was perfect.  All looked fine, but for some reason I was not receiving the packets on the host.

I analyzed the code, inserted probes and what not.  And finally, I was reading through the actual C++ code (MicroBlaze C++ that is) I found the bug.

A very simple bug hidden in plain sight!


// Now echo the data back out
if (XLlFifo_iTxVacancy(fifo_1)) {
XGpio_DiscreteWrite(&gpio_2, 2, 0xF001);
for ( i = 0; i < recv_len_bytes; i++) {
XGpio_DiscreteWrite(&gpio_2, 2, buffer[i]);

XLlFifo_Write(fifo_1, buffer, recv_len_bytes);

}

XLlFifo_iTxSetLen(fifo_1, recv_len_bytes);
}


Do you see the error?  Well, neither did I, until I read the documentation for XLlFifo_Write again, for the umteenth time… I was writing the data of the packet to the buffer (length of packet) squared times! Why? Because the single call to XLlFifo_Write is writing the entire packet on each call.

Anyway, I am now re-synthesizing my code and we will see what happens when I run it in around 2 hours time.

Also, I added the TKEEP signal to my AXI Stream FIFO, and it worked exactly as expected, meaning that:

  • If I send 12 bytes from the LabVIEW FPGA FIFO in to the MicroBlaze, it detects 12 bytes
  • If I send 13 bytes, with the TKEEP signal being 0b0001 for the last word only, and 0xF for the rest, I get 13 bytes in the MicroBlaze code.
  • If I send 14 bytes… and so on and so forth, MicroBlaze recognized only that many bytes.

However, everything was aligned to 32 bit words.

Maybe I will work on cleaning up and pushing some of my code to github while I wait…

AXI4 + MicroBlaze != 64-bit

The 10 Gigabit MAC/transceiver gives me 64 bit data words.  I currently think I am giving and getting  64 bit data words, but I am really only using 32 bits.  I came to this conclusion after I tried reading a 64 bit word and saw the data was simply two repeated 32 bit words.  Additionally some random person on the internet said that the MicroBlaze data bus is 32-bit and you have to use some sort of data width converter ip.

Out of luck… I don’t know how to use the converter, but I am sure there is a way to properly convert this by using LabVIEW FPGA.  So for starters, this means I can remove my AXI4 Stream Data FIFOs and keep the two 32-bit versions.  I’ll also throw in support for TKEEP while I am at it.

So the “Receive Ethernet Frame” code from the 10 Gigabit transceiver/MAC looks like this:

I have to convert this 64-bit data stream in to a 32-bit data stream before I send it in to the MicroBlaze.  Here is the current/erroneous implementation:

So what do I have to do? I have to read one element from the LabVIEW FIFO – the FIFO on the left, write the upper half of the 32 bit word in one cycle, and not read from the LabVIEW FIFO for the next clock cycle and to write the lower half of the 32 bit word.  Want to see the power of LabVIEW? It is 7:22 AM right now… [elevator music/jeopardy music starts playing in the background]

Now it is 8:07 AM and I have finished re-factoring this loop.  I am writing the upper half of each 64 bit word in one cycle, and am writing the bottom half during the next clock cycle.  I am also keeping the logic that appends an extra word which contains the “EndOfGoodFrame”, and “EndOfBadFrame” boolean values.  Since I am writing 32-bit words now, I am only appending one word.

Here is the full loop:

And a close up of Case 0 of the innermost Case Structure:

And a close up of Case 1:

I now have to do this for the other direction – convert a LabVIEW FIFO packet to an AXI 32-bit stream. Here is the current implementation:

The signal on AXI_STR_TXD_data is a U32 and I have to collect 2 of these values and insert them in to the FIFO on the right side.  I am going to have to think about this for a bit, but I have to get ready and go to work.  So I may not finish this before leaving.

Thanks and have a nice day!

Update: Okay, this is not that pretty, but here is my first-cut “20 minutes” version:

Now I have to go and get ready! But I’ll be sure to set everything to synthesize before I leave…

IP Integration Node vs CLIP

I wired up the 10 gigabit ethernet MAC to my MicroBlaze instance to my host computer and compiled/synthesized everything.  I then turn on my “quiet” PXIe-1062Q and fire up my tester application and it did not work…  I open up an isolated tester – “Fpga-Mac-Top.vi”, and it worked.  I open up the isolated MicroBlaze tester – “Fpga-MicroBlaze-Top.vi”, and nothing.  Not even a read from the GPIO.

This is quite strange… why is it not working? I spend some time looking over everything, re-generating output products, synthesizing from Vivado, bringing the design back in to LabVIEW, and long story short I was not setting the MicroBlaze Reset to ACTIVE_LOW, whereas in all of my previous designs I was setting it to ACTIVE_HIGH.  Anyway, while I wait for it to compile, I have something to say.  Which do you prefer? Using an IP Integration Node or a CLIP (Component Level IP) for using a MicroBlaze Processor from LabVIEW?

Well, first off, let me link to some National Instruments documentation on both:

And now let me show you some screen shots.  Here is a close up of what using an IP Integration Node looks like: (right-click to open in a new window for a larger version until I figure out how to modify this wordpress theme to be wider)

Here is a zoomed out version of this same VI:

And finally, what it looks like without an IP Integration node, but with a CLIP (Component Level-IP):

 

Can you see the difference? I can… for starters, I can read the full name of each signal when using a CLIP.  Additionally, with a CLIP I can split up my nodes in to separate locations, so that I can organize my VI in a much cleaner way.  And finally, since I can read the full signal name when using a CLIP node, I no longer have to hover over each signal to get the signal name, thus removing any reason for having comments as in the IP Integration Node version.

Anyway.  CLIP node is my recommended method of using LabVIEW FPGA to import Xilinx Vivado IP.

Also, this code was from a project that I implemented in order to learn how to use the AXI Stream FIFO inside of LabVIEW via a MicroBlaze.  In other words, how to communicate with a MicroBlaze processor via an AXI Stream FIFO from LabVIEW FPGA.

See the source code here:

https://github.com/JohnStratoudakis/LabVIEW_Fpga/tree/master/06_MicroBlaze/03_MicroBlaze_AXI_Stream

Pros and Cons of LabVIEW FPGA

Ever since I started developing this LabVIEW FPGA project that uses a MicroBlaze soft processor to process TCP streams, I have learned a lot and can comment on the pros and cons of using LabVIEW FPGA vs using a traditional Xilinx/Altera based FPGA development approach.

For starters, LabVIEW FPGA blows every single other FPGA development system out of the water when it comes to developing prototypes.  I made a prototype for implementing a Monero miner in record time.  I don’t remember how long it took, but you can see my commit history here: https://github.com/JohnStratoudakis/CryptoCurrencies

Then I was able to implement a UDP based orderbook proof of concept, again in record time, see my commit history here: https://github.com/JohnStratoudakis/LabVIEW_Fpga/tree/master/MarketData/MarketData_02/Fpga

Then I decided that I wanted to make my orderbook support TCP/IP, which is what most Market Data Feeds are using, so I embarked on learning how to make LabVIEW FPGA play well with Xilinx Vivado.  I did not realize it at the time, but the knowledge I have gained over the past year is enough to make one not have to live with any of the cons that LabVIEW FPGA comes with.

  • I have learned how to integrate basic VHDL/Verilog IP in to a LabVIEW FPGA project.
  • I have learned how to integrate more complex Xilinx IP such as Adder/Subtractors, Fast Fourier Transforms, and AXI Stream FIFOs.
  • I have learned how to integrate an entire soft-core processor based system in as well.  Including both the simplified MicroBlaze MCS, and the more complex MicroBlaze processors developed by Xilinx.
  • Furthermore, I have been able to communicate between LabVIEW FPGA and the MicroBlaze processor via AXI Stream FIFOs, General Purpose Input/Output registers, and have implemented Interrupt handlers.

Using all of this together, I can develop in a very efficient manner the perfect prototype that uses existing Xilinx IP, IP from opencores.org, or proprietary IP that can use a MicroBlaze soft-core processor all from within LabVIEW FPGA.  This serves a great risk-mitigating factor in that one can tell if an FPGA will be a viable solution for a particular type of problem.  Then, one can choose to keep the LabVIEW FPGA implementation and scale it out, or one can rewrite the portions written in LabVIEW in another language such as Verilog or VHDL.

Usually, the first product that works is what makes it to market and is successful, not because it is the best, but because it is the most adaptable to change. Think Evolution… think VHS, think DVDs, think about the iPod.  These products were market leading because they got the job done right now, not later when all of the features were fully implemented.  Additionally these products were easy to use.

Anyway, I have fully wired up the 10 Gigabit Transceiver in to my MicroBlaze, and have wired the MicroBlaze to my host application, and I am anxiously awaiting my FPGA synthesizer to complete so I can test it out…

10 Gigabit FPGA-based Network Card

So here is the most simple, FPGA-based Network Interface Card that I know of.

This application will start Port 0 of the 10 Gigabit Network interface that is provided by the PXIe-6592R (http://www.ni.com/en-us/support/model.pxie-6592.html) board by National Instruments, and will allow you to do any of the following:

  • Check if any new ethernet frames have been received, and display the information, including the raw bytes of any such received frame
  • Send a raw ethernet frame out of Port 0

I have included the necessary code to parse and generate the following types of packets, enabling you to communicate with another computer on your network that supports:

  • Ethernet II
  • ARP
  • ICMP
  • IPv4
  • UDP

The VI’s to do this are located in the directory “Tests/MAC/Protocols”, simply wire the incoming frame data to the “Parse” VI’s, or write the parameters in to the “Create” VI’s.

How to Parse Incoming Ethernet Frames

For an example of how to parse an incoming frame see the “Poll RX” case inside the bottom While Loop of the “MAC-Tester” vi:

How to Create Ethernet Frames

For an example of how to create a valid outgoing ethernet frame with a valid CRC32 on the end, see the “Transmit Packet” case inside the bottom While Loop of the “MAC-Tester” vi:

This vi calls the “UDP-Create.vi” and wires the size – in bytes – and the frame data in 64-bit words to the transmit FIFO.

Full Source Code

See the source code on GitHub here:

https://github.com/JohnStratoudakis/LabVIEW_Fpga/tree/master/07_10_Gigabit_CLIP

See the README.md for more documentation.

Next?

Now I have to take this code and wire it up to my MicroBlaze implementation that also sits inside the FPGA project.  Only problem right now is that I have only figured out how to configure a 32-bit FIFO, and not a 64-bit FIFO.  So I can either do some sort of translation inside the FPGA or hope and get lucky by configuring the FIFO to be 64 bits wide.  Note: by FIFO, I am referring to an AXI-Stream FIFO.

Screen Shot Generator for LabVIEW

I finished writing an application that exercises the first Port of the 10 Gigabit Ethernet Interface that is provided with the National Instruments PXIe-6592R board and as I started taking manual screenshots via the LabVIEW “File->Print” option I began to ponder, can this be done more easily? Or dare I say it “programmatically”?

The LabVIEW Report Generation Palette has a VI named “Easy Print VI Panel and Documentation”.  In addition to the plethora of options, this VI also is hard to use and proved to be unstable for my purposes.  If you want to try it in your application, see the documentation here:

http://zone.ni.com/reference/en-XX/help/371361H-01/lvreport/easy_print_panel_doc/

I ended up finding a way to manually save a png file with the Front Panel and the Block Diagram of a VI.  I then wrote a program that will recursively generate both a front panel and block diagram screenshot for each vi it encounters.  This makes is easy for me to quickly create and update any vi images so that you can view the source code directly from github, without having to wait until you get home and open the code in LabVIEW.

See the github project here:

https://github.com/JohnStratoudakis/ScreenShotGen

Here is a screenshot of the top level vi of the application:

 

 

 

10 Gigabit FPGA-based Network Code Coming Soon

I am getting real close to finishing my proof-of-concept FPGA-based network card that is based on the PXIe-6592 National Instruments Board which uses the Kinex-7 410t FPGA chip by Xilinx, and has 2GB of DDR3 RAM.

Using the Arty Arix board, I was able to make sure that the MicroBlaze code running the lwIP TCP/IP stack works fine, and I was able to use a NI example to make the 10 Gigabit Ethernet MAC part.  Only issue is that the NI code is quite complex and uses features and ideas that I have never seen before.

Nevertheless, I am iterating over some modifications to the example to allow for a LabVIEW Host network stack that uses the FPGA only for the sending and receiving of ethernet frames.  Once I get that working, I will just switch the connection from LabVIEW Host to the on-board MicroBlaze.

How to Multiply 64 bit Numbers in LabVIEW

What is the product of 0x9D0BF6FDAC70AB52 and 0x6408F6540A1384CB?  Well, according to LabVIEW for Windows, the answer is 0x2D90DE07C0C42206.  According to C++ on OSX (without any optimizations, usage of Intel Intrinsic functions), the answer is also 0x2D90DE07C0C42206.

The real answer is…  0x3D5E2BF7DCBCA6622D90DE07C0C42206.

How do you get this number? You have to use compiler intrinsics, or calculate this value yourself.  LabVIEW does not make it easy to call an Intel Compiler intrinsic, so I took it upon myself to implement this myself.  Here is a screenshot of the implementation in LabVIEW for Windows:

To download and use this code in your project, see:

https://github.com/JohnStratoudakis/CryptoCurrencies/blob/master/Monero/lv-monero/CryptoNight-Step-3/Host-Implementation/Step-3-Multiply-U64.vi

Note: FPGA version is coming soon, but I am busy working on something else right now

 

Some Time with the Arty Arix-7 35T Digilent Board

So I wanted to implement a simple, stripped down version of the open-source lightweight IP stack “lwIP” (https://savannah.nongnu.org/projects/lwip/) inside my LabVIEW FPGA project that I can handle TCP and UDP data streams.

I do not have a lot of experience with this, and I found that building such a project inside Vivado would take around 3 hours to simulate with all of the source code of the lwIP project embedded in the elf file.

I ended up purchasing a $99 board from Digilent that uses an Artix-7 35T board: https://www.xilinx.com/products/boards-and-kits/arty.html.

On this board I was able to run and debug the lwIP source code so that I could figure out how to use it with my configuration.  I creatd a public github repository with this source code, so if you happen to be trying to learn how to use the MicroBlaze processor with this board, check out:

https://github.com/JohnStratoudakis/artix7-35t

Enjoy and I will be working on integrating this lwIP source code in to my LabVIEW FPGA project now.

A Diversion for CryptoCurrencies

I spent some time analyzing the Monero CryptoCurrency source code to understand the algorithm, how it works and to see if it is doable with an FPGA via LabVIEW for FPGA, our secret weapon.
I learned that there are 4 steps to the Monero “CryptoNight” algorithm and that step 3 is the part that does the heavy lifting, with around 500k reads and writes to a small section of memory that is 2 megabytes in size.  This section of memory was specifically selected to be a size that coincides with the size of most processor Level 3 caches.  This is supposed to be what makes the algorithm “memory-hard”.
Locks are meant to be broken, codes cracked… and secrets revealed.
I am thinking – what if I put step 3 inside an FPGA have it use Block RAM?
  • Block RAM is limited on an FPGA, so this may not be worthwhile

Okay, what about DRAM?

  • My FPGA may have DDR3 RAM, but other FPGAs have faster RAM.  If my implementation works well on DDR3 RAM, then I can move it to another FPGA with faster RAM.
  • Will an FPGA user of DRAM be faster than a CPU usage of L3 Cache? Taking in to account of course that the FPGA is the only user of this DRAM controller? What about an FPGA with multiple DRAM controllers?
Well, I know that DRAM is “slow” when compared to other types of memory, but the difference here is that the FPGA is the only user of the DRAM controller.  On any operating system, there are many users, i.e. programs, processes, kernel threads.  So would doing this from an FPGA make the cut?  Would it make that much of a difference?
Well, there is only one way to find out.  Try it out!
I have created a github repository with my work so far here:
I went in to the Monero c++ source code (https://github.com/monero-project/monero/blob/master/src/crypto/slow-hash.c#L581) and saved to a binary file the following variables before the loop with 500k iterations starts (as of this date lines 591 and 600)
  • uint64_t a[2]
  • uint64_t b[2]
  • uint8_t *hp_state (<= this is the scratch pad of 2 megabytes of data)
  • uint8_t *hp_state_out (same scratch pad after CryptoNight Step 3 has run)
I implemented a sandboxed c++ version of this code that does CryptoNight Step 3 in an isolated program that runs with the same values each time.
This c++ program works on OSX and Windows (and probably linux), it uses gradle as its build tool and you can see the source code here:
I then implemented the same algorithm, based on the same source file by using LabVIEW for Windows.  The values match, so we have a working C++ version, a working LabVIEW for Windows version, and now we can determine if an FPGA version will be worth it.
Please note that the LabVIEW version is not optimized code, and I am not a LabVIEW for Windows Developer, and that is probably why it runs so slow… for now.  Yes, it takes over an hour to create one hash.  However, I have consulted with some LabVIEW experts, and they have told me what I should do to make it faster.  I will start working on that, and in the meantime, you can take a look at the ever-changing source code to see what the algorithm involves.  Remember, LabVIEW code is very easy to understand, so this may be the “flow-chart” explanation of what a cryptocurrency miner looks like.
See the LabVIEW code here:
(Requires LabVIEW 2017 to view…)  I will add some png versions of the code soon, but first I want to do some cleaning…