19 Jun 2024

Developing AI Applications for the Edge: A New Approach

The past few years have seen a dramatic surge in the number of hardware accelerators for running AI inference at the edge. These offerings, instead of making life easier for developers, have turned the process of developing edge AI applications into a complex, expensive, time consuming, and frustrating experience. Complex because it requires bringing together hardware, software, business logic, and machine learning (ML) models together. Expensive because developers need to invest time and money to bring up different hardware options. Time consuming because each hardware option is accompanied by its own software stack and toolchain for porting models. And frustrating because the toolchains are still not reliable and the whole pipeline of going from a problem to deployed ML model is highly fragmented. Before we delve into how the process can be simplified, it is worthwhile to spend some time on understanding why we are in this situation in the first place.

Developer's Dilemma

To understand why developers need to spend a lot of time, money, and engineering resources on evaluating different HW options, it is instructive to look at the fundamental dilemma that Edge AI application developers face when it comes to developing their applications: should they first develop their application software and then pick the hardware, or should they first pick the hardware and then develop the software?

Ideally, developers would like to first develop the software and defer the selection of hardware until they can evaluate the performance on different hardware options and pick the best one. Unfortunately, they cannot develop application software first without choosing the hardware, as software and ML models are closely tied to the hardware, and there is no unified software stack that works with multiple hardware.

At the same time, they cannot pick hardware first because it is difficult to choose hardware without knowing what performance it will give in the application. Consequently, this leaves them with no choice but to develop the application for each target hardware they want to evaluate, benchmark the performance, and then pick the optimal hardware option that caters to their needs. The development workflow is depicted in the figure below.

A quick look at the workflow tells us why the application development process is complex, expensive, time-consuming, and frustrating, as claimed earlier. At this point, you might be wondering why choosing the right hardware is such a big challenge. Surely, the hardware vendors must be providing enough information in their data sheets to allow developers to make the correct choice, right? Of course, hardware vendors provide datasheets, benchmarks, demos, and a plethora of other supporting material to establish the supremacy of their offerings. However, even armed with all this information, picking the right hardware is a highly non-trivial task.

Choosing the Right AI Hardware

We will now delve deeper into the world of HW selection and understand what makes it such an interesting and complex problem. You would have no doubt come across the various terms used by HW vendors to publicize the strength of their products: Trillion Operations Per Second (TOPS), TOPS/Watt, Frames Per Second (FPS), FPS/Watt, FPS/$, FPS/TOPS (believe me, I am not making up this metric), and so on. We will see why these metrics alone cannot guide HW selection.

AI Hardware Accelerators: Where Everyone is the Best at Something

Every hardware vendor building an AI accelerator or SoC makes the same claims: they have the most powerful hardware combined with the easiest to use software. It’s like a bunch of chefs trying to outdo each other with their secret recipes. “My recipe is better than yours!” “No, mine is better!” “No, mine is the best!” 😂 But, as with recipes, it’s not always easy to pick the best one based on the number of ingredients it has. Ultimately, what matters is that the recipe tastes good and is easy to follow. Similarly, when choosing a hardware option, it’s important to ensure that it meets your requirements and is easy to use. After all, the goal is to create a delicious dish, not just a dish with a lot of ingredients. 😋

TOPS is Like a Box of Chocolates: You Never Know What You’re Going to Get

Metrics like TOPS do not translate to real application performance in a standardized way. Consider two hardware options: A and B. If A has 2x TOPS compared to B, it does not necessarily mean that for a given machine learning (ML) model, A will have 2x FPS compared to B. In fact, this is the case even if A and B are different products from the same company. Now, consider two ML models M1 and M2 running on the same hardware. If M1 has 2x more compute than M2, it does not necessarily mean that M2 will have 2x FPS compared to M1. This is true even if M1 and M2 are the same model architecture running at different input resolution.

One might argue that normalized metrics like TOPS/Watt address the above issue. Unfortunately, this metric is not measured in a standardized way. Some vendors measure only the power consumption of the matrix multiplication unit, which is responsible for the majority of computation, while not accounting for power drawn by other components. Comparing SoCs with pure accelerators is another challenge, as end users are typically interested in overall power budget rather than the narrow scope for which numbers are published.

FPS: The Metric That Tells You Everything and Nothing at the Same Time

You might ask, ‘what about absolute performance metrics like FPS?’ They measure the real performance of the hardware and should allow us to compare apples-to-apples, right? The reality is that while FPS is a good metric that can give us an idea of the power of the hardware, it is far from being perfect. For a given hardware, if you know FPS for one ML model, you cannot reliably estimate the FPS for other models. Even if FPS numbers for all the models you are interested in are published, there is no guarantee that you will achieve those numbers in real usage. This is due to the fact that the conditions under which benchmarks are generated might not match the real use case. Application software load on the system can degrade the ML performance, or the real use case may not be able to supply inputs in batch sizes for which benchmarks are reported.

Moreover, real-world AI applications employ multiple models either in series or in parallel. Performance numbers for a single model do not translate to multiple models. In fact, for some hardware options optimized for running a single model, the performance penalty of switching a model can be very high. Emerging hardware options typically achieve power and performance efficiency by localizing the ML models on the chip. Such hardware options provide high performance for models up to a certain size but experience a drastic performance drop once the models do not fit on the chip.

Hardware Options: All Equal, But Some More Equal Than Others 😉

To further complicate matters, not all hardware options are created equal. Every vendor makes certain design choices that make a fair comparison very difficult. Some options only support INT models, while others give the flexibility of supporting floating-point models. Some hardware options give very high performance on smaller models while suffering performance degradation for larger models. Some other hardware options are optimized for delivering high performance on a single model with a certain batch size. Other hardware options can handle multiplexing multiple models efficiently. So, how can one choose between hardware that is very flexible but has lower performance and another hardware that is very efficient for a small subset of ML models and use cases?

The ML Model Compatibility Conundrum 😕

The varying capabilities of different hardware options also bring another important point into focus: ML model compatibility. Anyone who has spent time porting an ML model to new hardware knows how frustrating the process can be. The trained model (typically a PyTorch or TensorFlow checkpoint) needs to be exported to a format (like ONNX or TFLite) that can be compiled for a hardware target. For hardware that can only run quantized models, there needs to be an intermediate quantization step that may require providing some calibration data. These intermediate steps use tools that are not very reliable: we are in luck when they work but in really deep trouble when they do not work and throw not-so-useful error messages. Not to mention, the hassles of getting all pre-processing and post-processing methods to match the original training settings to ensure that the ported models perform as well as the trained model.

Is there a better way to do things?

Those of you who made it this far have probably been nodding along and thinking to yourselves that ‘all this sounds okay, but what can be done?’ Some of you might also be feeling that these are idle ramblings of a frustrated ML engineer who lost quite a few nights of sleep dealing with AI accelerators and ML models. You are not completely wrong 😊. But instead of just highlighting the difficulties, it is clear that there is a need to actively develop solutions that simplify the edge AI development process. To address the challenges outlined above, a new approach is needed to develop edge AI applications.

Even without developing a solution, one can see that the right approach to solving these issues should have the following features:

Easy-to-use software that allows development of complex real-world AI applications.
Software that works with multiple hardware options.
A mechanism that allows hardware evaluation in a short amount of time with little to no investment in buying the hardware.
Reliable toolchains to optimize and port ML models to different hardware options.

And this is what we have been developing at DeGirum for the past few years. While we are still far away from solving every aspect of the problem, we believe that we have made enough progress in the past couple of years to share with the ML community at large.

This is probably a good time to introduce ourselves without turning this article into the type of marketing pitch that we were critical of earlier in the article. We are a fabless semiconductor company developing hardware and software to simplify the development of edge AI applications. A couple of years ago, we had our working test chip (codenamed Orca) and started working with prospective customers. It was at this point that we started realizing the many challenges that AI solution providers face when bringing AI into real-world applications

When we started working on our hardware, we naively assumed (like all other hardware vendors) that powerful hardware combined with an efficient runtime is enough for solution providers to start integrating AI into their product lines. We soon realized that end users do not have the bandwidth to evaluate new hardware options (no matter how good the numbers look on paper and how easy the vendor claims the software is). While the cost of buying new hardware and the time it takes to work on new hardware are the obvious reasons, the pain of finding and porting suitable ML models to the hardware, and lack of tools that can provide fair ways to benchmark performance also contribute to the reluctance to try new hardware.

Based on our customer feedback, we started working on ways to make evaluating our hardware easier. Towards this goal, we developed the DeGirum AI Hub that allows users to access our hardware remotely, thereby eliminating the need to buy our hardware for the evaluation process. Combined with our Python software development kit, which we creatively named PySDK, the platform turned a process that can take a couple of weeks into under five minutes.

Developers who evaluated our hardware using our AI Hub and PySDK started making requests to extend the convenience to other hardware options as well. It dawned upon us that developers started viewing our AI Hub as a tool that they can use to develop applications for multiple hardware options before investing in buying the hardware and evaluating their performance. At this point, we decided to make our PySDK work with multiple hardware options as well as provide access to these hardware options in the cloud. We feel we are just at the beginning of an adventure and are very excited about the potential of our hardware, software, and cloud platform to simplify edge AI development.

The articles in this publication will delve deep into the features of the AI Hub and the various software APIs. We will also cover some model development and model porting related topics. While we populate our publication with these in-depth articles, here is a quick overview of our PySDK and AI Hub features:

DeGirum PySDK Features

Users can install DeGirum PySDK and get started with AI application development in under five minutes. Instructions for installation and API documentation are available at Overview — DeGirum Development Tools. Examples on how to use PySDK to develop applications are provided at: DeGirum/PySDKExamples: DeGirum PySDK Usage Examples (GitHub · Build and ship software on a single, collaborative platform ). You can even run the examples in Google Colab, which means you do not have to install any SW locally to experiment with our SW. Below are the main features of our PySDK.

Supports three types of inference: Hosted, AI Server, and Local. An article detailing these options will be published soon.
Supports multiple hardware options: DeGirum Orca, Intel CPUs/GPUs, Nvidia GPUs and SoCs, Google Edge TPU, Arm SoCs (Raspberry pi etc), AMD CPUs
Supports advanced functionality such as tracking, slicing, zone counting with simple to use APIs

DeGirum AI Hub Features

Users can now sign up for our AI Hub at DeGirum. Below are the main features:

Remote access to multiple AI HW accelerators: DeGirum Orca, Intel CPUs/GPUs, and Google Edge TPU. Support for Nvidia coming soon.
Hosted compiler: Go from PyTorch checkpoint to compiled model in a single click (including quantization). SOTA model architectures such as YOLOv5 and YOLOv8 are supported.
Inference in browser: Get performance estimate for different models on different hardware right in the browser without the need to install any SW.
Rich model zoos: Access to model zoos with latest models such as YOLOv5 and YOLOv8 trained on different datasets and compiled for different hardware options.

Closing Thoughts

We are at an exciting time now where we are in the middle of enabling AI on a vast number of use cases at the edge. Exciting advances are happening on the hardware, software, and models front at a breakneck speed. For application developers to keep with the pace of emerging technologies, there is a need for tools that can shorten the time between model development to model deployment. We are very excited to be playing a part in this endeavor and hope you all can join us.