Revisit of my NVDLA adaption project in Novumind

Introduction to NVDLA

The NVIDIA Deep Learning Accelerator (NVDLA) is an open-source hardware project designed to accelerate deep learning inference tasks. Its flexible, scalable architecture enables developers to implement AI acceleration on various platforms, from FPGAs to ASICs. By offering a complete stack of hardware and software, NVDLA has become a cornerstone for many projects in the AI hardware space, making it an invaluable tool for prototyping and development.

Project Motivation

At NovuMind, a company focused on developing innovative convolutional chips for AI applications, we sought to benchmark our hardware against existing solutions like NVDLA. However, NovuMind’s architecture differed significantly: it lacked a traditional CPU or microcontroller, relying instead on Kernel Mode Driver (KMD) control for direct register management. To ensure a fair comparison, we adapted NVDLA to operate under similar conditions, eliminating its CPU and microcontroller to align with our unique architecture. This effort was crucial for demonstrating our chip’s performance in a controlled, apples-to-apples manner.

Technical Approach

The project’s technical focus was on modifying the NVDLA architecture to remove the embedded CPU and microcontroller and shift all register control responsibilities to the KMD. Here’s how we achieved this:

  1. Removing Embedded Controllers: We stripped the CPU and microcontroller from the NVDLA’s design, leaving a bare-bones configuration that required external control.
  2. Building Layer-Specific Register Controllers: We implemented custom KMD logic to handle direct register read/write operations for each neural network layer, such as convolution and pooling. This ensured precise management of data flow and operations.
  3. Integrating with Existing Tools: We adapted the compiler provided by NVDLA to generate configurations compatible with our KMD-driven approach. This involved modifying the software stack to streamline the translation of high-level neural network models into hardware instructions.

Challenges and Solutions

Adapting NVDLA to this new architecture posed several challenges:

  • Compatibility with the Existing Compiler: NVDLA’s compiler was originally designed to work with its embedded CPU. We re-engineered parts of the compiler to align with our KMD-driven architecture, ensuring a smooth integration.
  • Complex Register Management: Managing registers for multiple layers directly from the KMD was intricate. We resolved this by designing modular, layer-specific logic within the KMD to simplify control and reduce errors.
  • Validation Without Native Simulation: Since NovuMind’s chip lacked native support for NVDLA’s simulation tools, we relied heavily on direct hardware testing and adjustments to ensure correctness.

Validation with Caffe

To validate our modifications, we used Caffe, a popular deep learning framework, to run neural network models on the adapted NVDLA. We focused on benchmarking tasks such as image recognition and convolution-heavy operations, analyzing performance metrics like throughput and latency. By successfully running these models, we confirmed the functionality and accuracy of our custom KMD-driven implementation.

Results and Implications

The project achieved its goal of enabling a direct comparison between NovuMind’s convolutional chip and NVDLA. Key outcomes included:

  • Performance Insights: The adapted NVDLA provided critical data on throughput and power efficiency, highlighting the strengths of our architecture in convolution-heavy workloads.
  • Scalable Design: The modifications demonstrated the flexibility of NVDLA’s open-source design, showing its potential for adaptation to unique hardware environments.
  • Future Applications: This project laid the groundwork for further innovations in AI hardware, illustrating how open-source accelerators can be customized for diverse architectures.

Conclusion

Adapting NVDLA for direct KMD control was a challenging but rewarding experience. It underscored the versatility of open-source hardware and its potential for fostering innovation in AI. At NovuMind, this project not only validated our chip’s performance but also provided valuable lessons in hardware-software co-design. As AI hardware continues to evolve, such projects will remain essential for pushing the boundaries of what’s possible.

Losing target

When started to write blogs my plan was to prepare & practice for posting more important staff in the next year or two to somewhere with higher influence, to prepare for stronger background for O1 visa..

But luckily my O1 visa application got approved earlier this year, and I don’t have anything I need to do to improve my writing skills.. So kind of losing the target.

I have wrote a lot of random things saved as “draft” on this blog..

Later I will rethink what I could share, maybe small projects or some state-of-art learning topics, also consider of continuing the previous step-by-step deep dive.


Don’t want to work

Don’t want to rest

Don’t want to game

Don’t want to think

Don’t want to talk

Don’t want to do anything

Personal Software Configuration

Code Editor: Sublime

C++ compiler on Windows: MSVC(Visual Studio) on Windows Terminal

Doc/Slides Editor: Google Docs

Video Recorder/Streamer: OBS Studio

Chat & Social: Discord, Skype

Virtual Machine Software: Virtual Box

Recommend Open Source & Free Edit Software:

Picture Editor: Gimp(for jpeg, NO recommend to build from source),

Raw Picture Editor: Darktable (for Raw)

Audio Editor: LMMS

[TTTA] Episode 2 Hello Color Triangle

Hope you get all the points from the first episode!

If you are familiar with the real DX12 Hello Triangle sample, you might have noticed that the sample I created in episode 1 was incomplete. In this episode, we’ll delve into the full “Hello Triangle” sample and explore some new concepts to understand how to draw this more complete version.

Screenshots of this new sample will be included in this blog, but feel free to download latest version of Intel GPA and open S2-HelloColorTriangle.

From the result, we can find that the triangle is colorful, and background has been painted to blue. Also, the number of API log has been changed from 1 to 4, adding 3 new calls.

In this blog series, we’ll discuss “ResourceBarrier” in detail in a future episode. For now, it’s important to understand that a ResourceBarrier in DirectX 12 is used to synchronize resource states between different GPU operations. It ensures that resources are not read or written in an inappropriate state, which is critical for correct rendering and performance optimization. Think of it as a traffic controller that manages how different parts of the GPU access memory resources.

Let’s start with “ClearRenderTargetView“.

ClearRenderTargetView is a function from DX12 command list. Here from GPA we can easily observe what parameters this function need.

RenderTargetView: Represent the start of the heap to be cleaned

ColorRGBA[4]: a vector of float to represent a color

NumRect/PRects: The target vector of D3D12_RECT objects to be cleaned

If PRects is NULL, ClearRenderTargetView clears the entire resource view instead.

The purpose of this function to fill the entire render target with a single color. This is often used at the beginning of a new frame to clear the remnants of the previous frame’s rendering. Here, by having nullptr in PRects, we are setting up the entire frame to color [0, 0.2, 0.4, 1], a mixed color of blue and green. Since ClearRenderTargetView is before “DrawInstanced”, the cleared color becomes a background automatically.

Then, let’s go to the next section, DrawInstanced. It keeps almost everything the same as the one introduced in episode 1.

Shader Source Code(SH): Same as E1

Pipeline State/Non pipeline state: Same as E1

The only difference is the input vertices buffer.

Position0: Same as E1

Color0: In this example, 3 vertices of the triangle now has different color. Since color values are stored as the order of RGBA, we can tell that:

Index0: top middle (0, 0.44, 0), is red(1, 0, 0)

Index1: bottom left (0.25, -0.44, 0), is blue(0, 1, 0)

Index2: bottom right (-0.25, -0.44, 0), is green(0, 0, 1)

Thus, when we are drawing this triangle on the screen, we are drawing 3 vertices with different color. A question just comes to us: what will happen to the pixels that are in the middle? To well understand this, we will introduce one of the most important terms in computer graphics – Rasterization.

In a word, rasterization is responsible for converting geometry/vector images into pixel information. In our case, the geometry we input is the triangle. We input position and color information of 3 vertices here as a geometry of a triangle, and as for output we will get a set of all pixels that we find belong to the geometry shape, including each pixel’s position and color.

Rasterization process involves combinations of algorithms, and as a series focusing on communications between applications and GPUs, I am not planning to introduce all the math details here in this episode. For now, let’s only remember that a rasterization process can find all necessary pixels through a complex edge determination process, and interpolate all pixels’ color using the given colors.

Then let’s get back to the original question: Why having vertices of 3 different color will give us different colors in different position?

This is because of the algorithm we use in interpolation is Linear interpolation. specifically barycentric interpolation in modern 3D APIs. This method calculates the contribution of each vertex to the target point. For example, if a point is closer to a red vertex, the output will be more red. Each pixel has different distances to the three vertices, resulting in varied colors.

On the right is the process from ChatGPT, on calculating interpolation color value on pixel position (0.1, 0.1,0) in our case.

After calculation, we find the color in that area should be around (0.61,0.39,0,1).

In a graphics pipeline, before getting the final rendering result, there is another step calculating the pixel shader; however in pixel shader we just pass through the value to be the final result.

Here, let’s have a quick peek of the final result in the beginning. The white dot in the middle is (0,0,0,1). Thus, (0.1, 0.1, 0) is on the top right of the white dot, in Quadrant I. We can see that the color is between red and green, and more red; which indicates that our calculation result is correct.

This concludes this episode. I hope it gives you a better understanding of this colorful triangle. Rasterization is a significant topic in computer graphics, but grasping the concept and scope of rasterization is more important than understanding the underlying mathematics. In later episodes, we will continue this learning journey, exploring increasingly complex samples and introducing more concepts step by step.

December Travel & First Camera Experience

Flight: Sacramento – Las Vegas, Dec 13th; Las Vegas – Sacramento, Dec 17th

Rout:

Day1: Las Vegas Airport – Bellagio Hotel – Hell’s Kitchen – High Roller

Day2: Bellagio – Zion National Park(Night sky) – Page Town

Day3: Page Town – Upper Antelope Canyon – Lower Antelope Canyon – Horseshoe Bend – Lake Powell night sky

Day4: Page Town – The Grand Canyon – Kingman

Day5: Kingman – Hoover Dam – LAS Chinatown -The Sphere(Postcard from Earth) – Archery training

Some photos:

Blog at WordPress.com.

Up ↑