10 min read

Rockchip Neural Networking

Rockchip’s RKNN toolkit includes examples that can serve as general-purpose image processing tools where images originate from anywhere: from a continuous GStreamer grab or any other source. Rockchip’s demonstration programs operate simply as command-line input-process-output tools. Input one image, output one image. The output image duplicates the input but also adds boxes and annotations for all recognised objects.

The RKNN hardware does not need to access the image signal processing hardware integrated with the Rockchip system-on-chip, albeit that approach performs best with the lowest achievable processing throughput.

Can the demos easily convert to service-oriented background processes that monitor one, or more, input folders that receive incoming images? After some other transfer process, be it a streaming grabber or just a simple file copy, the background process sees the new image and runs it through a preloaded Rockchip neural network. The answer is yes.

Proof of Concept

The modifications alter the usage of the basic demo as follows. The command line accepts three argument types: a model file, an input directory for monitoring and loading input bitmaps and an output directory for writing processed results.

Usage: ./rknn_ssd_demo
  -m <model>.rknn
  -i <input directory>
  -o <output directory>

The output directory is optional. Without an output directory, the program does not write output images, although it still continuously runs inputs through the tensor flow engine.

Take care when storing the resulting image or images. The output directory should not match the input directory. The i-node monitor will process the outputs as inputs if stored in the same place. The outputs can live within a parent or child folder without triggering a neural processing event, however. The monitor only sees changes to the given image folder, not its sub-folders.

./rknn_ssd_demo \
  -m model/ssd_inception_v2_rv1109_rv1126.rknn \
  -i /tmp/in \
  -o /tmp/out &

The experiment will utilise the “Single-Shot Detector (SSD) Inception” model.

Kernel I-Node Notifications

At the kernel-space user-space boundary, the following simple C++ wrapper watches for changes to the incoming image directory.

#include <bitset>
#include <cerrno>
#include <map>
#include <system_error>
#include <unordered_map>
#include <vector>

extern "C" {
#include <poll.h>
#include <sys/inotify.h>
#include <unistd.h>
}

namespace sys {
class inotify {
  const int fd_;
  std::unordered_map<int, std::string> wds_;

public:
  inotify() : fd_{inotify_init1(IN_NONBLOCK)} {
    if (fd_ < 0)
      throw std::system_error(errno,
                              std::system_category());
  }

  int add_watch(const std::string &pathname,
                uint32_t mask = IN_ALL_EVENTS) {
    const int wd =
        inotify_add_watch(fd_, pathname.c_str(), mask);
    if (wd < 0)
      throw std::system_error(errno,
                              std::system_category());
    wds_.insert(std::make_pair(wd, pathname));
    return wd;
  }

  virtual ~inotify() { close(fd_); }

  int poll(short events = POLLIN, int timeout = 0) {
    pollfd fds = {fd_, events, 0};
    int rc = ::poll(&fds, 1, timeout);
    if (rc < 0)
      throw std::system_error(errno,
                              std::system_category());
    return rc == 1 ? fds.revents : 0;
  }

  struct event {
    std::string wd;
    uint32_t mask;
    std::string name;
  };

  std::vector<event> read() {
    char buf[BUFSIZ];
    auto len = ::read(fd_, buf, sizeof(buf));
    if (len < 0)
      throw std::system_error(errno,
                              std::system_category());
    std::vector<event> events;
    const inotify_event *event;
    for (auto ptr = buf; ptr < buf + len;
         ptr += sizeof(inotify_event) + event->len) {
      event =
          reinterpret_cast<const inotify_event *>(ptr);
      events.push_back(inotify::event{
          .wd = wds_[event->wd],
          .mask = event->mask,
          .name = std::string(event->name)});
    }
    return events;
  }
};
} // namespace sys

Converting the demo to a service daemon requires that the code runs continuously as a background process. The new daemon program has a simple preamble. It extracts the model file path from the command line, along with the watch directory for input bitmaps and the output path.

  const char* model_path = nullptr;
  const char *output_path = nullptr;

  sys::inotify inotify;
  int ch;
  while ((ch = getopt(argc, argv, "m:i:o:")) >= 0)
    switch (ch) {
    case 'm':
      model_path = optarg;
      break;
    case 'i':
      inotify.add_watch(optarg);
      break;
    case 'o':
      output_path = optarg;
      break;
    default:
      std::cerr << "Usage: " << argv[0]
        << " -m <model>.rknn "
           "-i <input directory> "
           "-o <output directory>" << std::endl;
      exit(EXIT_FAILURE);
    }

Polling for Bitmaps

The following snippet inserts in between the model loading and the image loading. It polls for i-node notifications with a one-second timeout, looking for “close-write” file events on names ending with .bmp.

for (;;) {
  while (inotify.poll(POLLIN, 1000) & POLLIN)
    for (auto event : inotify.read()) {
      if (!ends_with(event.name, ".bmp") ||
        (event.mask & IN_CLOSE_WRITE) == 0) continue;
      std::string wd_name = event.wd + "/" + event.name;
      const char* img_path = wd_name.c_str();
      std::cout << img_path << std::endl;
      using namespace std::chrono;
      const auto start = high_resolution_clock::now();
      {
        // load bitmap, set inputs, run, post process, draw
      }
      if (output_path) img.save((std::string(output_path) +
        "/" + event.name).c_str());
      if (input_data) {
          stbi_image_free(input_data);
      }
    }
}

Close-write events occur when some other program closes a file opened for writing. The snippet elides the per-image recognition stages:

  1. loading of the bitmap,
  2. setting up the Rockchip neural inputs,
  3. running the RISC-V tensor-flow computation,
  4. post-processing the tensor flow outputs and
  5. drawing the results over the original bitmap.

Building Demos

Rockchip’s demo programs encapsulate the specific requirements for each type of model.

Build the demos using the following Bash script. Notice that the cross-compiler toolchain path calculation presumes that the script resides within the examples folder. Adjust accordingly if not the case.

#!/bin/bash
set -x
export RV1109_TOOL_CHAIN=$(realpath \
  "$(dirname "$0")/../../../../..")/prebuilts/gcc\
  /linux-x86/arm/gcc-arm-8.3-2019.03-x86_64-arm-linux-gnueabihf
for demo in *_demo; do
    pushd $demo
    ./build.sh
    popd
done

Pushing the Demo

Run the following ADB command from the RKNN rknn_ssd_demo example directory at external/rknpu/rknn/rknn_api/examples relative to the SDK root. It pushes the install directory to /userdata on the Rockchip device.

adb -H host.docker.internal push install /userdata

The -H host option overrides the default ADB server interface. It defaults to localhost. Add the option to point the client to the server hosting the device’s USB connection. In this case, the build system resides within a container where host.docker.internal resolves the Windows container host.

Launch the Service

Change the current working directory to the push destination and run the demo as a service.

mkdir /tmp/in /tmp/out
cd /userdata/install/rknn_ssd_demo
./rknn_ssd_demo \
  -m model/ssd_inception_v2_rv1109_rv1126.rknn \
  -i /tmp/in \
  -o /tmp/out &

The command-line arguments specify the input folder and output folder. Note that the demo program needs to launch from its installed directory otherwise it will terminate with a segmentation fault. The demo looks for coco_labels_list.txt under the model folder relative to the current working directory. This limitation could be corrected for production by providing the path to the label list as a command-line option.

The new service begins by initialising the Rockchip hardware and loading the SSD model, as follows.

Loading model ...
librknn_runtime version 1.7.3

Testing and Evaluation

The following tests utilise the ADB bridge over USB. USB is fast and can approximate the capture rate of a USB camera. The R snippet below lists the functional testing tools used to exercise the recognition circuit using a multi-core Windows host as a 20-node compute cluster.

#' @examples
#' \dontrun{
#' unsplash.rknn() |> imager::load.image() |> plot()
#' }
unsplash.rknn <- \(..., dim = "300x300") {
  tempfile(fileext = ".bmp") -> bmp
  url <- paste0(
    "https://source.unsplash.com/random/", dim, ".jpg")
  imager::save.image(imager::load.image(url), bmp)
  system2("adb", c("push", bmp, "/tmp/in"))
  tmp <- file.path("/tmp/out", basename(bmp))
  while (system2("adb", c("pull", tmp, dirname(bmp))) != 0L) {}
  system2("adb", c("shell", "rm", tmp))
  bmp
}

tmp.out <- \(x, ext = "txt") {
  system2("adb",
          c("shell", "cat",
            file.path("/tmp/out",
              xfun::with_ext(basename(x), ext))),
          stdout = TRUE)
}

#' @examples
#' \dontrun{
#' # Process 100 random bitmaps.
#' # Load their results.
#' sfLapply(1:100, unsplash.rknn) -> bmp
#' sfLapply(bmp, tmp.csv) -> csv
#' sfLapply(bmp, tmp.csv, ext = "txt") -> txt
#' }
tmp.csv <- \(x, ext = "csv") {
  read.csv(text = tmp.out(x, ext))
}

#' Compiles a data frame for a single RKNN run.
#'
#' The data frame includes:
#' - the name of the recognised object,
#' - its left, top, right and bottom bitmap bounds,
#' - millisecond Rockchip performance, and the
#' - marked-up bitmap path.
#'
#' Requires an ADB connection to the system-on-chip.
#'
#' @examples
#' \dontrun{
#' do.call(rbind, sfLapply(bmp, unsplash))
#' }
unsplash <- \(bmp) {
  csv <- tmp.csv(bmp)
  if (nrow(csv) != 0L)
    cbind(csv, tmp.csv(bmp, "txt"), data.frame(bmp = bmp))
}

The test machine launches 20 concurrent processes as a compute cluster and applies 1000 concurrent “unsplash.rknn” jobs; each job pops a random image from the Internet and pushes it through the “inotify” and RKNN service before reloading the marked image.

library(snowfall)
sfInit(parallel = TRUE, cpus = 20L)
sfLapply(1:1000, unsplash.rknn)
load("ms.RData")
hist(ms, main = "RKNN performance (ms)")

See Figure 1. Notice that the vast majority of frames pass through the RISC-V at around 70ms. See also Figure 2. The recognition performance and the compute performance are not correlated.

Conclusions

Generic processing of arbitrary ‘image frame capture’ sources is the primary advantage of the “inotify” approach. It allows for any source that can generate a filesystem image, as well as for multiple sources simultaneously since the architecture does not care about where the frames originate nor how they interrelate. The order of kernel events determines the priority of recognition: first come, first served. Effectively, the approach converts the neural network to a fully abstract service with a filesystem interface.

The experiment turns out quite successful. The “inotify” method of interaction with the RKNN proves to be extremely responsive. The average latency between seeing a notification and responding with a completed recognition-marked bitmap sits around 70 milliseconds, plus or minus a few milliseconds. This amounts to a core neural detection capability of 10007014 frames per second—if the system was capable of feeding the RISC-V at that rate. Feeding the hardware may prove to be the bigger challenge. If able, example scenarios might include performing object recognition on multiple cameras at the same time.

This exercise only demonstrates a proof of principle. It has limitations. For one thing, the neural processing service reads and writes images using the Linux filesystem rather than shared memory. If the filesystem mount point references RAM, as it does in this exercise, the latency reduces. If further optimised to shared memory, the copying associated with reading and writing large bitmaps over a filesystem can disappear.

Future Directions

Currently, the program only outputs symbolic information derived from each frame as text to standard output. Snippet below. The data identifies the recognised object, its location within the input frame and the match likelihood as a probability.

person @ (14 123 59 211) 0.976142
bicycle @ (170 165 279 234) 0.972723
person @ (110 121 151 197) 0.968828
car @ (146 133 216 170) 0.953655
person @ (208 117 255 220) 0.953655
car @ (1 138 13 154) 0.819371
person @ (84 132 94 158) 0.665369
person @ (49 133 58 157) 0.601661
car @ (133 127 160 166) 0.568302
car @ (129 123 169 150) 0.465688

This information could also be transmitted over the MAVLink mesh. If the system knows the location and orientation of the source camera at the time that the frame was encoded, it may be possible to compute a search area when looking for specific object types.

The 14-frames per second rate could be increased slightly if required. The frame stages include pre- and post-processing executed by the quad-core ARM in a single thread. One core at a time handles the steps surrounding tensor flow analysis, not necessarily the same core. If re-architected to support multi-threading, the recogniser service could optimise the surrounding compute-bound requirements over multiple threads, thereby allowing participation by all four ARM cores.

Multi-Model

In its current form, the recognition service utilises just one model. These need not be the case, however. The service could expand to include different models that could apply to all incoming images or selectively to different images on-demand. The service could implement a non-deterministic priority queue working to sequence RKNN run and load_model events according to the set of incoming images and the requested model applications tasked for those images. Optimisation methods such as A* could predetermine the path of least resistance by plotting a network of events with the least cost.