MaixPy support new nncase & Kmodel V4 Now~

What is the new nncase

nncase v0.2 is the new nncase, It generate kmodel V4 (old one generate kmodel V3)
Kmodel V4 support more ops, multi-output, and writen in newest C++ 17
New nncase & Kmodel V4 support more NN models,
While it also cost more memory (extra 360KB ram)
and it is in development, some ops not implement as well as V3.
We convert same mobilenet tflite model to V3&V4 model, and run it,
V3 cost 37ms, while V4 cost 111ms.
It is due to V4’s MATMUL is not implement with KPU, but with CPU.

you are welcome to take part into improve new nncase!

Download nncase

or you can download from CI

Use nncase

for example, convert facedetect.tflite & mbnet75.tflite
put some sample pic to dataset dir
input-mean & input-std depends on your training.

./ncc compile facedetect.tflite fd.kmodel -i tflite --dataset image --input-mean 0.5 --input-std 0.5 --max-allocator-solve-secs 60 --calibrate-method l2 -v

./ncc compile mbnet75.tflite mbnet75.kmodel -i tflite --dataset image --input-mean 0.5 --input-std 0.5 --max-allocator-solve-secs 60 --calibrate-method l2 -v

And you can infer on PC to test the convert:
put verify pic to dataset dir, and it will generate result in output dir.

./ncc infer mbnet75.kmodel output --dataset input --input-mean 0.5 --input-std 0.5

it generate binary file, you can read it via python:

import numpy as np
import struct
f = open('mb.bin','rb')
data_raw = struct.unpack('f'*1000,*1000))

Run Kmodel V4 in MaixPy

Burn newest MaixPy in the attachment (we will upload to after Spring Festiva)
Burn the kmodel to the right place.
run the demo py code in attachment.

You will notice V4 need extra init api:

kpu.set_outputs(task, output_idx, w,h,ch) 

you need specify model’s output shape in kmodel V4 (in kmodel V3, we have method to infer output shape automatic)

It have 4 args:

  1. task: the kpu task
  2. output_idx: the index of output, V4 support multi output.

3~5: w,h,ch: width, height, channel of output; if output is less than 3-dim, just write 1,h,ch, or 1,1,ch

When you run kpu.forward method, it will check the output shape setting is done or not.

After forward operation, you can get single output result as same as V3:

fmap = kpu.forward(task, img)

If your model have multi-output, you need use this API to get result:

fmap=kpu.get_output(task, output_idx)

And we fix kpu.deinit bug, you can load/unload model normally.

Multi-output demo

(Add on2020.1.29)
It is an demo to explain how to get multi-output in kmodel V4.
here we have an model output 8 result in graph:

when you use nncase convert it to kmodel, you will get the print like it:

0	input/ImageInput	1x3x64x64
0	l_eye_up_x/l_eye_up_x	1
1	l_eye_up_y/l_eye_up_y	1
2	l_eye_down_x/l_eye_down_x	1
3	l_eye_down_y/l_eye_down_y	1
4	r_eye_up_y/r_eye_up_y	1
5	r_eye_down_y/r_eye_down_y	1
6	up_mouth_y/up_mouth_y	1
7	down_mouth_y/down_mouth_y	1

It indicate the index of every output, you can use the index to get corresponding output in MaixPy.
for example, fmap=kpu.get_output(task_ld, 1) get left eye’s y-axis postion.

detail and demo code see in the attachment. (2.7 MB)


It is the first tmp version MaixPy support Kmodel V4, please test it, and feedback any problem you habe meet.
just reply in this post is ok.

attachment download link:

1 Like

Good news!

The new nncasekpu is not using DMA for kpu upload, is it fixed in your release?

Hi, it is not a TODO, but a workround of BUG.
K210 have a bug, dma access cache memory will case hang in very small probability.
so it disable dma method.
Do you have time to improve new nncase? many ops need optimize

Who will fix the DMA bug? (is it a software or a hardware issue) was your earlier models before KModel 4 using memcpy for KPU upload or were they using DMA. With new nncase it having a 20-27ms over head per each frame of 320x240 image.

Regarding nncase improvement, I can contribute, please let me know how to proceed, you have my e-mail address, can mail me directly

DMA bug is a hardware issue

Can you provide me with the OLD mobilenet Kmodel which executes in 37ms, I have downloaded mobilenet_0x300000.kfpkg from

is this that model? this takes 60ms and it has lesser layers than the recent mobilenetmodel generated with new nncase.

Try this one in the package:

You are right, but this model doesn’t have any matrix multipication at all

layer 0 [K210Conv]: 0.230000 ms
layer 1 [K210Conv]: 0.037000 ms
layer 2 [K210Conv]: 0.703000 ms
layer 3 [K210Conv]: 0.066000 ms
layer 4 [K210Conv]: 2.730000 ms
layer 5 [K210Conv]: 0.039000 ms
layer 6 [K210Conv]: 1.392000 ms
layer 7 [K210Conv]: 0.039000 ms
layer 8 [K210Conv]: 2.778000 ms
layer 9 [K210Conv]: 0.027000 ms
layer 10 [K210Conv]: 1.475000 ms
layer 11 [K210Conv]: 0.026000 ms
layer 12 [K210Conv]: 2.932000 ms
layer 13 [K210Conv]: 0.021000 ms
layer 14 [K210Conv]: 1.718000 ms
layer 15 [K210Conv]: 0.021000 ms
layer 16 [K210Conv]: 1.718000 ms
layer 17 [K210Conv]: 0.022000 ms
layer 18 [K210Conv]: 1.718000 ms
layer 19 [K210Conv]: 0.021000 ms
layer 20 [K210Conv]: 1.718000 ms
layer 21 [K210Conv]: 0.021000 ms
layer 22 [K210Conv]: 1.718000 ms
layer 23 [K210Conv]: 0.021000 ms
layer 24 [K210Conv]: 3.417000 ms
layer 25 [K210Conv]: 0.030000 ms
layer 26 [K210Conv]: 4.590000 ms
layer 27 [Dequantize]: 1.621000 ms
layer 28 [GAP]: 1.094000 ms
layer 29 [Quantize]: 0.113000 ms
layer 30 [K210AddPad]: 0.080000 ms
layer 31 [K210Conv]: 4.164000 ms
layer 32 [K210RemovePad]: 0.049000 ms
layer 33 [Dequantize]: 0.045000 ms
layer 34 [Softmax]: 0.918000 ms

Can you point out which layer is matrix multiplication?

The mobilnet which has matrix multiplication has more layers

KPUConv2D: 0.234
KPUConv2D: 0.04
KPUConv2D: 0.706
KPUConv2D: 0.069
KPUConv2D: 2.733
KPUConv2D: 0.042
KPUConv2D: 1.395
KPUConv2D: 0.041
KPUConv2D: 2.78
KPUConv2D: 0.03
KPUConv2D: 1.478
KPUConv2D: 0.029
KPUConv2D: 2.936
KPUConv2D: 0.024
KPUConv2D: 1.721
KPUConv2D: 0.024
KPUConv2D: 1.721
KPUConv2D: 0.024
KPUConv2D: 1.721
KPUConv2D: 0.024
KPUConv2D: 1.721
KPUConv2D: 0.024
KPUConv2D: 1.72
KPUConv2D: 0.024
KPUConv2D: 3.42
KPUConv2D: 0.033
KPUConv2D: 7.523
Dequantize: 1.145
Reduce: 15.703
Quantize: 0.034
QuantizedMatMul: 65.238
Dequantize: 0.034
Reduce: 0.42
Quantize: 0.005
QuantizedBinary: 0.849
QuantizedBinary: 0.836
Dequantize: 0.032
Unary: 0.617
Quantize: 0.041
Reduce: 0.411
Quantize: 0.005
QuantizedBinary: 0.926
Dequantize: 0.033

43 layers, are you sure the earlier converter converted the model correctly?

hi, the MATMUL ops can convert to conv ops:

layer 29 [Quantize]: 0.113000 ms
layer 30 [K210AddPad]: 0.080000 ms
layer 31 [K210Conv]: 4.164000 ms
layer 32 [K210RemovePad]: 0.049000 ms
layer 33 [Dequantize]: 0.045000 ms

those layer are equal to MATMUL, run via KPU, much faster than CPU.
the new nncase haven’t implement this optimize yet.
and I still suggest use kmodel V3 on normal model.(faster, smaller, and can be run in flash)

@Zepan Thank you for the explanation, I am trying to get to understand this conversion process. I was wondering whether it was to do with KPU aware opcode not being used.

@Zepan is the original training code for facedetect model in the package still unavailable? It works oh so remarkably well :slight_smile: Would be very interesting to try training it with other datasets.

it is based on yolov2, you can refer to:

1 Like

Nice, thank you very much! Haven’t seen this tutorial before, seems to be quite complete. Thank you for the reference!

1 Like

Great news about support for KMODEL V4! Love new operations being available and multi-output will come in handy.
I tested classification models and they work as expected with 0.5.0-31 firmware.

But I couldn’t run yolo v2 detection model, unfortunately. This architecture used to work fine when converted to kmodel v3. You can check the architecture in this Colab notebook

When I export Keras model to KMODEL v4 and try to run it on Maixduino(0.5.0-31 firmware) this is the error I get:

Additionally the example script you uploaded throws an error on check_key() function, I suppose it is not necessary(the model is not generated with my devices unique key, so…), so I commented it out.

Could you please find the time to have a look at the issue?

thank you very much


Anything wrong?

Well, yes. As it says there is not enough memory to fit the model. The largest model I could succesfully run with mpython firmware is 1.9. Can stretch it a bit more, but not to 6 MB. K210 only has 5.5 MB RAM available.

I found the source of my problem - despite not documenting anywhere, yolo networks also have to have their output shape explicitly set after loading the model. If anyone encounters the same problem, have a look at this example script

Hi all, I am new here

I’m trying to make this network to work with K210.
I got their trained network in keras (.h5 + .json) then I was able to convert in tensorflow (.pb) then in tflite and then in kmodel using Maix_Toolbox and ncc (ver 0.2).

Here the file in all format (4.6 MB)

If I use ncc compile with --inference-type float I obtain a kmodel that with ncc infer gives really good results. If I compile with uint8 the kmodel is not so good, but this is not the real problem.

The problem is that when I try to use the model with the following python script the program freezes without giving any error.

import sensor, image, lcd, time
import KPU as kpu
sensor.set_windowing((224, 224))

task = kpu.load("/sd/new_model.kmodel")
a=kpu.set_outputs(task, 0, 1, 1, 1) #it is add for V4
clock = time.clock()
    img = sensor.snapshot()
    a = kpu.forward(task, img)
    fmap = kpu.get_output(task, 0)
    a = lcd.display(img, oft=(0,0))
    lcd.draw_string(0, 224, "%s:%.2f"%("p", p))
a = kpu.deinit(task)

If I use the float kmodel the programs doesn’t freeze, it goes at 0.3 fps but give always 1.0 as output, while with ncc infer it gives near 0.0 with one kind of input and near 1.0 with the other kind, as it should be.

I attach two images of the two kind: