IMAX2/3/4 Applications

crypto/sha256, fft/fft, filter/filter (一般フィルタ,超解像,フレーム補間,距離画像生成等), llama/llama (llama-v2), mm_cnn_lf/cnn, mm_cnn_lf/cnn3d, mm_cnn_lf/gather (離散ステンシル:Lightfieldレンダリング), mm_cnn_lf/gdepth (離散ステンシル:Lightfield距離画像), mm_cnn_lf/inv (逆行列), mm_cnn_lf/mm (密行列積), rsim/rsim (normal MNIST/CIFAR10/CNN), sort/sort (パイプラインソート), spgemm/test022 (SpGEMM), spgemm/test024 (疎行列圧縮), ssim/ssim (stochastic MNIST/CIFAR10/CNN), stencil/stencil (degree=1,2,3各種ステンシル計算), stringsearch/search (文字列検索), tsim/tsim (multithread MNIST/CIFAR10/CNN), vsim/vsim (GGML), vbgmm, graph-cnn, graph-attention, U-net

IMAX2/3/4 Docs/Tutorials

Download IMAX2/3/4

Introduction to IMAX3: Amazing Dataflow-Centric Gen4-CGLA(non-CGRA) (CGLA:Coarse Grained Linear Array)

Introductive slides with synthesizable notes

0.非常識に理解するコンピュータ(0.予告編) 0.IMAX3 begins(0.Trailer)
1.非常識に理解するコンピュータ(1.集めたデータはどこに置くのがいいの?) 1.IMAX3 begins(1.Where is the best location to save data?)
2.非常識に理解するコンピュータ(2.データに置き方ってあるの?) 2.IMAX3 begins(2.Is there a manner to put data?)
3.非常識に理解するコンピュータ(3.計算って何のこと?) 3.IMAX3 begins(3.What do you mean by calculation?)
4.非常識に理解するコンピュータ(4.押しかけるのがいいの?待つのがいいの?) 4.IMAX3 begins(4.Should I push? Should I wait?)
5.非常識に理解するコンピュータ(5.何を勉強すれば給料もらえるの?)

Expertized slides with synthesizable notes

0.Let's start Gen3-CGLA(non-CGRA)
1.Introduction
2.Image filters basic
3.Image filters advanced
4.Image filters professional
5.Machine Learning
6.High-degree stencil computation
7.Inverse matrix
8.Sparse matrix and Sorting
9.Hash, FFT and String search
10.High-speed compiler
11.Three level sophisticated loop
12.拡張性編 12.Scalability
13.HW/SW協調設計編 13.HW/SW codesign
0-13.短い総集編(#1-#13) 0-13.Short summary(#1-#13)
0-13.長い総集編(#1-#13) 0-13.Long summary(#1-#13)
14.CPU/Vectorとの違い編 14.Difference from CPU/Vector
15.ソフト制御キャッシュの仕組み 15.Software-controlled cache memory
16.チップレットとの相性
17.データ流の自由度と最適化指針
18.4次元配列計算の写像
19.IMAX3でchat.pyが動くまで
20.CGLAあみだくじ 20.Decision Tree
21.プロジェクト実習
22.ホストキャッシュメモリの有効化
23.LLAMA編
24.データフローと写像の種類
25.もっとLLM
26.スタートアップ用カタログ
27.特許のまとめ
28.審査委員のひとりごと
29.自在に繋がる基本UNITの手と中身

Petalinux 2024.1 IMAX2 Kit for basic CGLA

ZU19EG (16 units) ... Vivado project is included.

  1. linux# zcat ZU19EG-step4000-20241111.img.gz | dd bs=64k of=/dev/mmcblk0 (16GB SDcard)
  2. linux# mount /dev/mmcblk0p2 /mnt
  3. linux# replace root-password in /mnt/etc/shadow
  4. linux# umount /mnt
  5. zu19eg# insert SDcard
  6. zu19eg# boot from SDcard (dhcp)
  7. linux% ssh -Y [email protected] (Xwindow)
  8. zu19eg% zcat proj-arm64.tgz|tar xpf -
  9. zu19eg% cd proj-arm64/sample/mm_cnn_lf
  10. zu19eg% make -f Makefile-zynq.emax6+dma mm-zynq.emax6+dma-16st (how to make)
  11. zu19eg% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma-16st (matrix-mult)
  12. passwd: temppwd
  13. localhost:11.0: Cannot open display
  14. zu19eg% cp ~/.Xauthority /tmp/111
  15. zu19eg% sudo cp /tmp/111 /root/.Xauthority
  16. zu19eg% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma-16st (retry)
  17. <<<ORIG>>>
  18. usec: ARM:2098589 DRAIN:0 CONF:0 REGV:0 RANGE:0 LOAD:0 EXEC:0 total:2098589 (usec)
  19. <<<IMAX>>>
  20. usec: ARM:426 DRAIN:1224 CONF:105 REGV:1041 RANGE:663 LOAD:14861 EXEC:24324 total:42647 (usec)

ZCU102+VU440 (64/128/192/256/512 units /single lane) ... Vivado project is included.

  1. vu440# connect with zcu102 (see figure)
  2. vu440# write VU440-step4000-20221020-V24.1-78.125+78.125+48+260+130+48-CRYPTO-SPU.bin to SDcard
  3. vu440# insert SDcard
  4. linux# zcat ZCU102-step4000-20201010.img.gz | dd bs=64k of=/dev/mmcblk0 (16GB SDcard)
  5. linux# mount /dev/mmcblk0p2 /mnt
  6. linux# replace root-password in /mnt/etc/shadow
  7. linux# umount /mnt
  8. zcu102# insert SDcard
  9. zcu102# boot from SDcard (dhcp)
  10. linux% ssh -Y [email protected] (Xwindow)
  11. zcu102% zcat proj-arm64.tgz|tar xpf -
  12. zcu102% cd proj-arm64/sample/mm_cnn_lf
  13. zcu102% make -f Makefile-zynq.emax6+dma mm-zynq.emax6+dma (how to make)
  14. zcu102% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma (matrix-mult)
  15. passwd: temppwd

Petalinux 2024.1 IMAX3 Kit for professional CGLA

VMK180 (32 units) ... Vivado project is included.

  1. linux# zcat alice139-step4000.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  2. linux# mount /dev/mmcblk0p2 /mnt
  3. linux# replace root-password in /mnt/etc/shadow
  4. linux# umount /mnt
  5. vmk180# insert SDcard
  6. vmk180# boot from SDcard (dhcp)
  7. linux% ssh -Y [email protected] (Xwindow)
  8. vmk180% zcat proj-arm64.tgz|tar xpf -
  9. vmk180% cd proj-arm64/sample/mm_cnn_lf
  10. vmk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma-32st (how to make)
  11. vmk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma-32st (matrix-mult)
  12. passwd: temppwd

VMK180 (32 units x2 lanes) ... Vivado project is included.

  1. linux# zcat alice135-step4200-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  2. linux# zcat alice137-step4200-slave-img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  3. linux# mount /dev/mmcblk0p2 /mnt
  4. linux# replace root-password in /mnt/etc/shadow
  5. linux# umount /mnt
  6. vmk180# connect two boards w/ QSFP28-AOC cable
  7. vmk180# insert SDcard
  8. vmk180# boot from SDcard (dhcp)
  9. linux% ssh -Y [email protected] (Xwindow)
  10. vmk180% zcat proj-arm64.tgz|tar xpf -
  11. vmk180% cd proj-arm64/sample/mm_cnn_lf
  12. vmk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma-32st (how to make)
  13. vmk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma-32st (matrix-mult)
  14. vmk180% sudo proj-arm64/sample/test/test025-acap.emax7+dma-32st (dual matrix-mult)
  15. vmk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
  16. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I0 -C1 -F1 (MNIST conv1+fc inference)
  17. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I0 -C1 -F1 (MNIST conv1+fc training)
  18. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I0 -C3 -F1 (MNIST conv3+fc inference)
  19. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I0 -C3 -F1 (MNIST conv3+fc training)
  20. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
  21. vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)

VPK180 (64 units x2 lanes)

  1. linux# zcat alice120-step4800-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  2. linux# mount /dev/mmcblk0p2 /mnt
  3. linux# replace root-password in /mnt/etc/shadow
  4. linux# umount /mnt
  5. vpk180# insert SDcard
  6. vpk180# boot from SDcard (dhcp)
  7. linux% ssh -Y [email protected] (Xwindow)
  8. vpk180% zcat proj-arm64.tgz|tar xpf -
  9. vpk180% cd proj-arm64/sample/mm_cnn_lf
  10. vpk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma (how to make)
  11. vpk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma (matrix-mult)
  12. vpk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
  13. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C1 -F1 (MNIST conv*1+fc inference)
  14. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C1 -F1 (MNIST conv*1+fc training)
  15. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C3 -F1 (MNIST conv*3+fc inference)
  16. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C3 -F1 (MNIST conv*3+fc training)
  17. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
  18. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)
  19. vpk180% sudo ./vsim-acap.emax7+dma gptneox -m /home/nakashim/.cformers/models/OpenAssistant/oasst-sft-1-pythia-12b/int4_fixed_zero --prompt "50278 12092 2 0 50281" --seed 42 --threads 1 --n_predict 100 --top_k 20 --top_p 0.95 --temp 0.85 --repeat_last_n 64 --repeat_penalty 1.3 (GGML)
  20. vpk180% sudo ./llama-cli-acap.emax7+dma -t 4 -s 1 -fa -m ~/.llama/model/rinna-youri-7b-instruction-gguf/rinna-youri-7b-instruction-q2_K.gguf -p "Prime numbers smaller than ten" -n 32 (LLAMA-v2)

VPK180 (64 units x8 lanes)

  1. linux# zcat alice120-step4800-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  2. linux# zcat alice122-step4800-slave1.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  3. linux# zcat alice124-step4800-slave2.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  4. linux# zcat alice126-step4800-slave3.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
  5. linux# mount /dev/mmcblk0p2 /mnt
  6. linux# replace root-password in /mnt/etc/shadow
  7. linux# umount /mnt
  8. vpk180# connect four boards w/ QSFPDD-DAC cable
  9. vpk180# insert SDcard
  10. vpk180# boot from SDcard (dhcp)
  11. linux% ssh -Y [email protected] (Xwindow)
  12. vpk180% zcat proj-arm64.tgz|tar xpf -
  13. vpk180% cd proj-arm64/sample/mm_cnn_lf
  14. vpk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma (how to make)
  15. vpk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma (matrix-mult)
  16. vpk180% sudo proj-arm64/sample/test/test025-acap.emax7+dma (dual matrix-mult)
  17. vpk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
  18. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C1 -F1 (MNIST conv*1+fc inference)
  19. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C1 -F1 (MNIST conv*1+fc training)
  20. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C3 -F1 (MNIST conv*3+fc inference)
  21. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C3 -F1 (MNIST conv*3+fc training)
  22. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
  23. vpk180% sudo ./tsim-acap.emax7+dma -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)
  24. vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 -M16 (CIFAR10 multi-lane)
  25. vpk180% sudo ./vsim-acap.emax7+dma gptneox -m /home/nakashim/.cformers/models/OpenAssistant/oasst-sft-1-pythia-12b/int4_fixed_zero --prompt "50278 12092 2 0 50281" --seed 42 --threads 2 --n_predict 100 --top_k 20 --top_p 0.95 --temp 0.85 --repeat_last_n 64 --repeat_penalty 1.3 (GGML)
  26. vpk180% sudo ./llama-cli-acap.emax7+dma -t 4 -s 8 -fa -m ~/.llama/model/rinna-youri-7b-instruction-gguf/rinna-youri-7b-instruction-q2_K.gguf -p "Prime numbers smaller than ten" -n 32 (LLAMA-v2)

Petalinux 2024.1 IMAX4 Kit for Intel servers

PCI-e(VPK120)+VPM180 (64 units x8/x16 lanes) ... Vivado project is included.

  • IMAX4 170MHz, 512 units, 20480 operations / 4 cycles, 512KB-cache/unit
  • each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer