Revisit of my NVDLA adaption project in Novumind

Introduction to NVDLA

The NVIDIA Deep Learning Accelerator (NVDLA) is an open-source hardware project designed to accelerate deep learning inference tasks. Its flexible, scalable architecture enables developers to implement AI acceleration on various platforms, from FPGAs to ASICs. By offering a complete stack of hardware and software, NVDLA has become a cornerstone for many projects in the AI hardware space, making it an invaluable tool for prototyping and development.

Project Motivation

At NovuMind, a company focused on developing innovative convolutional chips for AI applications, we sought to benchmark our hardware against existing solutions like NVDLA. However, NovuMind’s architecture differed significantly: it lacked a traditional CPU or microcontroller, relying instead on Kernel Mode Driver (KMD) control for direct register management. To ensure a fair comparison, we adapted NVDLA to operate under similar conditions, eliminating its CPU and microcontroller to align with our unique architecture. This effort was crucial for demonstrating our chip’s performance in a controlled, apples-to-apples manner.

Technical Approach

The project’s technical focus was on modifying the NVDLA architecture to remove the embedded CPU and microcontroller and shift all register control responsibilities to the KMD. Here’s how we achieved this:

  1. Removing Embedded Controllers: We stripped the CPU and microcontroller from the NVDLA’s design, leaving a bare-bones configuration that required external control.
  2. Building Layer-Specific Register Controllers: We implemented custom KMD logic to handle direct register read/write operations for each neural network layer, such as convolution and pooling. This ensured precise management of data flow and operations.
  3. Integrating with Existing Tools: We adapted the compiler provided by NVDLA to generate configurations compatible with our KMD-driven approach. This involved modifying the software stack to streamline the translation of high-level neural network models into hardware instructions.

Challenges and Solutions

Adapting NVDLA to this new architecture posed several challenges:

  • Compatibility with the Existing Compiler: NVDLA’s compiler was originally designed to work with its embedded CPU. We re-engineered parts of the compiler to align with our KMD-driven architecture, ensuring a smooth integration.
  • Complex Register Management: Managing registers for multiple layers directly from the KMD was intricate. We resolved this by designing modular, layer-specific logic within the KMD to simplify control and reduce errors.
  • Validation Without Native Simulation: Since NovuMind’s chip lacked native support for NVDLA’s simulation tools, we relied heavily on direct hardware testing and adjustments to ensure correctness.

Validation with Caffe

To validate our modifications, we used Caffe, a popular deep learning framework, to run neural network models on the adapted NVDLA. We focused on benchmarking tasks such as image recognition and convolution-heavy operations, analyzing performance metrics like throughput and latency. By successfully running these models, we confirmed the functionality and accuracy of our custom KMD-driven implementation.

Results and Implications

The project achieved its goal of enabling a direct comparison between NovuMind’s convolutional chip and NVDLA. Key outcomes included:

  • Performance Insights: The adapted NVDLA provided critical data on throughput and power efficiency, highlighting the strengths of our architecture in convolution-heavy workloads.
  • Scalable Design: The modifications demonstrated the flexibility of NVDLA’s open-source design, showing its potential for adaptation to unique hardware environments.
  • Future Applications: This project laid the groundwork for further innovations in AI hardware, illustrating how open-source accelerators can be customized for diverse architectures.

Conclusion

Adapting NVDLA for direct KMD control was a challenging but rewarding experience. It underscored the versatility of open-source hardware and its potential for fostering innovation in AI. At NovuMind, this project not only validated our chip’s performance but also provided valuable lessons in hardware-software co-design. As AI hardware continues to evolve, such projects will remain essential for pushing the boundaries of what’s possible.

Losing target

When started to write blogs my plan was to prepare & practice for posting more important staff in the next year or two to somewhere with higher influence, to prepare for stronger background for O1 visa..

But luckily my O1 visa application got approved earlier this year, and I don’t have anything I need to do to improve my writing skills.. So kind of losing the target.

I have wrote a lot of random things saved as “draft” on this blog..

Later I will rethink what I could share, maybe small projects or some state-of-art learning topics, also consider of continuing the previous step-by-step deep dive.


Don’t want to work

Don’t want to rest

Don’t want to game

Don’t want to think

Don’t want to talk

Don’t want to do anything

Personal Software Configuration

Code Editor: Sublime

C++ compiler on Windows: MSVC(Visual Studio) on Windows Terminal

Doc/Slides Editor: Google Docs

Video Recorder/Streamer: OBS Studio

Chat & Social: Discord, Skype

Virtual Machine Software: Virtual Box

Recommend Open Source & Free Edit Software:

Picture Editor: Gimp(for jpeg, NO recommend to build from source),

Raw Picture Editor: Darktable (for Raw)

Audio Editor: LMMS

[TTTA] Episode 2 Hello Color Triangle

Hope you get all the points from the first episode!

If you are familiar with the real DX12 Hello Triangle sample, you might have noticed that the sample I created in episode 1 was incomplete. In this episode, we’ll delve into the full “Hello Triangle” sample and explore some new concepts to understand how to draw this more complete version.

Screenshots of this new sample will be included in this blog, but feel free to download latest version of Intel GPA and open S2-HelloColorTriangle.

From the result, we can find that the triangle is colorful, and background has been painted to blue. Also, the number of API log has been changed from 1 to 4, adding 3 new calls.

In this blog series, we’ll discuss “ResourceBarrier” in detail in a future episode. For now, it’s important to understand that a ResourceBarrier in DirectX 12 is used to synchronize resource states between different GPU operations. It ensures that resources are not read or written in an inappropriate state, which is critical for correct rendering and performance optimization. Think of it as a traffic controller that manages how different parts of the GPU access memory resources.

Let’s start with “ClearRenderTargetView“.

ClearRenderTargetView is a function from DX12 command list. Here from GPA we can easily observe what parameters this function need.

RenderTargetView: Represent the start of the heap to be cleaned

ColorRGBA[4]: a vector of float to represent a color

NumRect/PRects: The target vector of D3D12_RECT objects to be cleaned

If PRects is NULL, ClearRenderTargetView clears the entire resource view instead.

The purpose of this function to fill the entire render target with a single color. This is often used at the beginning of a new frame to clear the remnants of the previous frame’s rendering. Here, by having nullptr in PRects, we are setting up the entire frame to color [0, 0.2, 0.4, 1], a mixed color of blue and green. Since ClearRenderTargetView is before “DrawInstanced”, the cleared color becomes a background automatically.

Then, let’s go to the next section, DrawInstanced. It keeps almost everything the same as the one introduced in episode 1.

Shader Source Code(SH): Same as E1

Pipeline State/Non pipeline state: Same as E1

The only difference is the input vertices buffer.

Position0: Same as E1

Color0: In this example, 3 vertices of the triangle now has different color. Since color values are stored as the order of RGBA, we can tell that:

Index0: top middle (0, 0.44, 0), is red(1, 0, 0)

Index1: bottom left (0.25, -0.44, 0), is blue(0, 1, 0)

Index2: bottom right (-0.25, -0.44, 0), is green(0, 0, 1)

Thus, when we are drawing this triangle on the screen, we are drawing 3 vertices with different color. A question just comes to us: what will happen to the pixels that are in the middle? To well understand this, we will introduce one of the most important terms in computer graphics – Rasterization.

In a word, rasterization is responsible for converting geometry/vector images into pixel information. In our case, the geometry we input is the triangle. We input position and color information of 3 vertices here as a geometry of a triangle, and as for output we will get a set of all pixels that we find belong to the geometry shape, including each pixel’s position and color.

Rasterization process involves combinations of algorithms, and as a series focusing on communications between applications and GPUs, I am not planning to introduce all the math details here in this episode. For now, let’s only remember that a rasterization process can find all necessary pixels through a complex edge determination process, and interpolate all pixels’ color using the given colors.

Then let’s get back to the original question: Why having vertices of 3 different color will give us different colors in different position?

This is because of the algorithm we use in interpolation is Linear interpolation. specifically barycentric interpolation in modern 3D APIs. This method calculates the contribution of each vertex to the target point. For example, if a point is closer to a red vertex, the output will be more red. Each pixel has different distances to the three vertices, resulting in varied colors.

On the right is the process from ChatGPT, on calculating interpolation color value on pixel position (0.1, 0.1,0) in our case.

After calculation, we find the color in that area should be around (0.61,0.39,0,1).

In a graphics pipeline, before getting the final rendering result, there is another step calculating the pixel shader; however in pixel shader we just pass through the value to be the final result.

Here, let’s have a quick peek of the final result in the beginning. The white dot in the middle is (0,0,0,1). Thus, (0.1, 0.1, 0) is on the top right of the white dot, in Quadrant I. We can see that the color is between red and green, and more red; which indicates that our calculation result is correct.

This concludes this episode. I hope it gives you a better understanding of this colorful triangle. Rasterization is a significant topic in computer graphics, but grasping the concept and scope of rasterization is more important than understanding the underlying mathematics. In later episodes, we will continue this learning journey, exploring increasingly complex samples and introducing more concepts step by step.

December Travel & First Camera Experience

Flight: Sacramento – Las Vegas, Dec 13th; Las Vegas – Sacramento, Dec 17th

Rout:

Day1: Las Vegas Airport – Bellagio Hotel – Hell’s Kitchen – High Roller

Day2: Bellagio – Zion National Park(Night sky) – Page Town

Day3: Page Town – Upper Antelope Canyon – Lower Antelope Canyon – Horseshoe Bend – Lake Powell night sky

Day4: Page Town – The Grand Canyon – Kingman

Day5: Kingman – Hoover Dam – LAS Chinatown -The Sphere(Postcard from Earth) – Archery training

Some photos:

[TTTA] Episode 1 Hello Triangle

In the first episode, let’s talk about drawing a single triangle on your screen. By analyzing a frame captured from a triangle application, let’s learn what need to offer to the operating system to render this blue triangleIn our inaugural episode, we’ll explore how to render a single triangle on your screen. By examining a frame captured from a simple triangle-rendering application, we’ll learn about the essentials provided to the operating system to achieve this.

Begin by downloading “S1-HelloTriangle.zip” from the shared drive and extract its contents. Open the frame using the “Frame Analyzer” to proceed.

You’ll notice the blue triangle in the center—the outcome of our rendering. On the left, the API log lists just one item, signifying that a single “DrawInstanced” API call was responsible for rendering our triangle on the screen.

By selecting this draw call, a new “Resource” window appears, showcasing three categories: Input(In), Execution(Exe), and Output(Out). These resources are crucial for understanding the data required for rendering.

We will through those resources one by one to find out what has been used in this rendering.

Starting with the Input section, we find two items. ‘B:’ denotes a buffer, followed by a unique SHA code. The term ‘VBV’ (Vertex Buffer Views) indicates that this buffer stores the vertices.

This particular buffer contains a trio of vectors, each comprising two components: Position and Color.

Position: A trio of float vectors, each within the range of [-1.0, 1.0], designating the x, y, and z coordinates. For our 2D example, all z-values are zero, with the x/y pairs forming a triangle onscreen.

Color: A quartet of float vectors, ranging from [0.0, 1.0], representing the ‘RGBA’ color values where, in our case, we have full blue with complete opacity.

Why do colors define in floating points? While designing the interface, ideally an application doesn’t need to know what hardware it needs to be executed. Thus, when rendered in an 8-bit rendering system, the result color will be in range [0, 255]; on a 10-bit rendering system (Also named HDR10), it means the final range will be in [0, 1023].

Why colors and transparencies are defined on the vertices instead of the surface/triangle? In a modern rendering system, a surface is always represented by 3 vertices, and in general the number of surfaces will be larger than the number of vertices. In this case, defining color and transparency information on vertices can reduce the amount of parameter savings. The color and transparency values of a vertex are reused for all surfaces that share that vertex.

Geometry Input: A visual result of the input vertices, drawing a triangle in a 3D space.

That’s the end of Input section.

In execution section, to simplify this episode, we are only focusing on Arguments and Shader(SH):

ID3D12GraphicsCommandList::DrawInstanced: This is the draw command used in this rendering process. Definitions can be found here. ID3D12GraphicsCommandList::DrawInstanced

VertexCountPerInstance: Number of indices read from the index buffer for each instance.

InstanceCount: How many instances/surfaces are there in this draw call. Here we only have 1 triangles.

StartVertexLocation: Sometimes the vertex buffer saves additional information for other draw calls. Here offer an offset to find the correct vertex

StartInstanceLocation: A value added to each index before reading per-instance data from a vertex buffer.

From the DrawInstanced command’s arguments, we ascertain the intention to draw a single instance represented by three vertices.

Shader codes of this rendering system, as a programmable input to define how to render a result.

SH: 2555136722384 Shader resource, with its HASH

HLSL(High-Level Shader Language): A shader language developed by Microsoft

VSMain: The main function of Vertex Shader

PSMain: The main function of Pixel Shader. This will be the final value of a Pixel

The full design and execution of the shader system is complicated, and will be gradually extended in this series, when the demo is getting more complex. To check the full DX12 pipeline system Pipelines and Shaders with DX12.

And that wraps up the Execution section! We’ll touch upon the Output results when relevant in future discussions.

To recap, we’ve covered:

  • Input: Vertex positions and color data.
  • Shader: Source code specifying VSMain and PSMain functions.

With these inputs prepared, we’re set to execute the “DrawInstanced” command:

DrawIndexedInstanced(3, 1, 0, 0, 0)

Executing this function draws the blue triangle on the screen, achieving our rendering goal.

Thank you for joining Episode 1! Your thoughts and questions are welcome in the comments below. Stay tuned for the next installment!

Triangle-To-Triple-A Series – Intro

Step into the realm of game graphics with our comprehensive blog series, designed to empower you with a fundamental understanding of the modern graphics pipeline and the essence of graphics API calls. As we traverse the theoretical concepts, our narrative will be lightly sprinkled with practical examples, aiming to show how applications converse with the operating system to paint pixels on the screen—no heavy coding, just the pure science of graphics.

Within this series, we’ll occasionally reference Intel’s Graphics Performance Analyzers (GPA), a tool that exemplifies these concepts in action, though our focus will remain firmly on the underlying graphics principles rather than the intricacies of the tool itself. By the end of our journey, you’ll not only grasp the foundational elements of game graphics but also appreciate how they coalesce to form the breathtaking visuals in contemporary gaming.

Join us as we peel away the layers of complexity in modern graphics, clarifying the dialogue between application and machine, and illuminating the path from the genesis of a graphic call to the lush, immersive environments that define today’s gaming experiences.


All the frame/stream captures will be used in the blogs can be found under my shared one drive:

https://drive.google.com/drive/u/0/folders/1c8vXyConNvgtM43VnboaAlXCCb8He0ZO


Hope you can enjoy!

*May use the assistant with ChatGPT in the blog writing, but will review before publish:)

Blog at WordPress.com.

Up ↑