MaixPy Run Big Size Flash model !

MaixPy Run Big Size Flash Model !

Why we need run model from flash?

K210 have limited ram 6+2 MB,
MaixPy Firmware cost 1~2MB, and pic buf cost another 0.5~1MB
the rest ram space is tight for AI model
sometimes we have to choose minium function MaixPy version to run bigger AI model.

Now, we have the new choose, directly run from Flash!
this new version MaixPy support run AI model in flash, without load it into ram, so you have more ram to do normal things, and AI model only limited by Flash size.

New API

task=kpu.load_flash(model_addr, is_dual_buf(0/1), batch_size, spi_speed)
  1. model_addr: flash addr store your model, note, you need flip the model endian, use convert_le.py to convert normal model. and only support V3 model now.
  2. is_dual_buf: 0, single buf, use less ram and slower speed; 1, dual buf, more ram and faster speed.
  3. batch_size: when choose dual_buf, you need set load batch_size, suggestion value is 0x4000~0x10000, you can test out best value for your model.
  4. spi_speed: when use flash runner, we will temporary set flash to high speed mode, set the spi speed you want. the value should <= 80000000

And now you can use normal kpu.forward to calculate your model.

Test it

here is the test script in attachment,
you need burn two kmodel first, and put tiger.jpg to the spiffs
and set spu freq to 480M, run the script, the result is:

(481, 398)	#CPU freq & KPU freq
#ram run model test
label idx=292
load 421 ms, forward 40 ms

flash run model test (single buf)
SPI freq 80166666 Hz
label idx=292
load 2 ms, forward 106 ms

flash run model test (dual buf)
SPI freq 80166666 Hz
label idx=292

load 2 ms, forward 83 ms

Conclusion

you can see normal ram run cost many time to load model, but very fast in infer.
single buf flash run cost 2.65X time
dual buf flash run cost ~2X time

When use QSPI PSRAM (133M), the dual buf run cost 53ms, about 1.3X time
And we will test OSPI PSRAM(133M*2), it should cost in 45ms, about 1.1X time.
So, it is possible to run model up to Flash size(16M), and won’t lost too much speed.

MaixPy Flash Runner.zip (6.2 MB)
http://dl.sipeed.com/MAIX/MaixPy/release/temp/MaixPy%20Flash%20Runner.zip

1 Like

I was wondering if load_flash is compatible with yolo networks and anchor boxes ?

re-recognition does not work:
ValueError: [MAIXPY]kpu: check img format err!

you have to do the following in a loop:
while 1:
task = kpu.load_flash (0x200000,1,0xC000,80000000)


a = kpu.deinit(task)

This is how things are for now, but it takes about 2ms overhead to read from flash (i’m talking about mobilenet).

After some research and trial-and-error i came to conclusion that it does, except you must hardcode
anchor boxes somehow and completely run your yolov3 inference through kpu.forward. And after this you
have to parse huge tensor manually (in case of yolo and 1 detection label you would get tensor like (7,7,30)) which ends up giving you an array of 7730 elements).