IMAX2/3/4 Applications
crypto/sha256, fft/fft, filter/filter (一般フィルタ,超解像,フレーム補間,距離画像生成等), llama/llama (llama-v2), mm_cnn_lf/cnn, mm_cnn_lf/cnn3d, mm_cnn_lf/gather (離散ステンシル:Lightfieldレンダリング), mm_cnn_lf/gdepth (離散ステンシル:Lightfield距離画像), mm_cnn_lf/inv (逆行列), mm_cnn_lf/mm (密行列積), rsim/rsim (normal MNIST/CIFAR10/CNN), sort/sort (パイプラインソート), spgemm/test022 (SpGEMM), spgemm/test024 (疎行列圧縮), ssim/ssim (stochastic MNIST/CIFAR10/CNN), stencil/stencil (degree=1,2,3各種ステンシル計算), stringsearch/search (文字列検索), tsim/tsim (multithread MNIST/CIFAR10/CNN), vsim/vsim (GGML), vbgmm, graph-cnn, graph-attention, U-net
IMAX2/3/4 Docs/Tutorials
Download IMAX2/3/4
- IMAX2 document(jpn) IMAX2 document(eng)
- IMAX3 document(jpn) IMAX3 document(eng)
- IMAX4 document(jpn) IMAX4 document(eng)
- IMAX2/3/4 all-in-one kit including document, compiler, simulator, examples, FPGA bin-files, and Vivado-projects (26GB in total)
- IMAX2/3/4 suppremental kit for CentOS (300MB)
Introduction to IMAX3: Amazing Dataflow-Centric Gen4-CGLA(non-CGRA) (CGLA:Coarse Grained Linear Array)
Introductive slides with synthesizable notes
Expertized slides with synthesizable notes
Petalinux 2024.1 IMAX2 Kit for basic CGLA
ZU19EG (16 units) ... Vivado project is included.
- IMAX2 250MHz, 16 units, 640 operations / 4 cycles, 128KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-ZU19EG
- proj-arm64/fpga/ZU19EG-step4000-20241111_IP.tgz
- proj-arm64/fpga/ZU19EG-step4000-20241111.tgz
- proj-arm64/fpga/ZU19EG-step4000-20241111.img.gz
- linux# zcat ZU19EG-step4000-20241111.img.gz | dd bs=64k of=/dev/mmcblk0 (16GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- zu19eg# insert SDcard
- zu19eg# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- zu19eg% zcat proj-arm64.tgz|tar xpf -
- zu19eg% cd proj-arm64/sample/mm_cnn_lf
- zu19eg% make -f Makefile-zynq.emax6+dma mm-zynq.emax6+dma-16st (how to make)
- zu19eg% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma-16st (matrix-mult)
- passwd: temppwd
- localhost:11.0: Cannot open display
- zu19eg% cp ~/.Xauthority /tmp/111
- zu19eg% sudo cp /tmp/111 /root/.Xauthority
- zu19eg% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma-16st (retry)
- <<<ORIG>>>
- usec: ARM:2098589 DRAIN:0 CONF:0 REGV:0 RANGE:0 LOAD:0 EXEC:0 total:2098589 (usec)
- <<<IMAX>>>
- usec: ARM:426 DRAIN:1224 CONF:105 REGV:1041 RANGE:663 LOAD:14861 EXEC:24324 total:42647 (usec)
ZCU102+VU440 (64/128/192/256/512 units /single lane) ... Vivado project is included.
- IMAX2 130MHz, 64-512 units, 2560-20480 operations / 4 cycles, 64KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-ZCU102
- proj-arm64/fpga/README-VU440
- proj-arm64/fpga/ZCU102-step4000-20201010.img.gz
- proj-arm64/fpga/VU440-step4000-20221020.tgz
- proj-arm64/fpga/VU440-step4000-20221020-V24.1-78.125+78.125+48+260+130+48-CRYPTO-SPU.bin
- vu440# connect with zcu102 (see figure)
- vu440# write VU440-step4000-20221020-V24.1-78.125+78.125+48+260+130+48-CRYPTO-SPU.bin to SDcard
- vu440# insert SDcard
- linux# zcat ZCU102-step4000-20201010.img.gz | dd bs=64k of=/dev/mmcblk0 (16GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- zcu102# insert SDcard
- zcu102# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- zcu102% zcat proj-arm64.tgz|tar xpf -
- zcu102% cd proj-arm64/sample/mm_cnn_lf
- zcu102% make -f Makefile-zynq.emax6+dma mm-zynq.emax6+dma (how to make)
- zcu102% sudo proj-arm64/sample/mm_cnn_lf/mm-zynq.emax6+dma (matrix-mult)
- passwd: temppwd
Petalinux 2024.1 IMAX3 Kit for professional CGLA
VMK180 (32 units) ... Vivado project is included.
- IMAX3 180MHz, 32 units, 1280 operations / 4 cycles, 512KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-VMK180
- proj-arm64/fpga/VMK180-step4000-20241130_IP.tgz
- proj-arm64/fpga/VMK180-step4000-20241130.tgz
- proj-arm64/fpga/alice139-step4000.img.gz
- linux# zcat alice139-step4000.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- vmk180# insert SDcard
- vmk180# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- vmk180% zcat proj-arm64.tgz|tar xpf -
- vmk180% cd proj-arm64/sample/mm_cnn_lf
- vmk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma-32st (how to make)
- vmk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma-32st (matrix-mult)
- passwd: temppwd
VMK180 (32 units x2 lanes) ... Vivado project is included.
- IMAX3 180MHz, 64 units, 2560 operations / 4 cycles, 512KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-VMK180
- proj-arm64/fpga/VMK180-step4200-MASTER.tgz
- proj-arm64/fpga/VMK180-step4200-SLAVE.tgz
- proj-arm64/fpga/alice135-step4200-master.img.gz
- proj-arm64/fpga/alice137-step4200-slave.img.gz
- linux# zcat alice135-step4200-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# zcat alice137-step4200-slave-img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- vmk180# connect two boards w/ QSFP28-AOC cable
- vmk180# insert SDcard
- vmk180# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- vmk180% zcat proj-arm64.tgz|tar xpf -
- vmk180% cd proj-arm64/sample/mm_cnn_lf
- vmk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma-32st (how to make)
- vmk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma-32st (matrix-mult)
- vmk180% sudo proj-arm64/sample/test/test025-acap.emax7+dma-32st (dual matrix-mult)
- vmk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I0 -C1 -F1 (MNIST conv1+fc inference)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I0 -C1 -F1 (MNIST conv1+fc training)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I0 -C3 -F1 (MNIST conv3+fc inference)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I0 -C3 -F1 (MNIST conv3+fc training)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
- vmk180% sudo ./tsim-acap.emax7+dma-32st -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)
VPK180 (64 units x2 lanes)
- IMAX3 170MHz, 128 units, 5120 operations / 4 cycles, 512KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-VPK180
- proj-arm64/fpga/VPK180-step4000-20240930_IP.tgz
- proj-arm64/fpga/VPK180-step4000-20240930.tgz
- proj-arm64/fpga/alice120-step4800-master.img.gz
- linux# zcat alice120-step4800-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- vpk180# insert SDcard
- vpk180# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- vpk180% zcat proj-arm64.tgz|tar xpf -
- vpk180% cd proj-arm64/sample/mm_cnn_lf
- vpk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma (how to make)
- vpk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma (matrix-mult)
- vpk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C1 -F1 (MNIST conv*1+fc inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C1 -F1 (MNIST conv*1+fc training)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C3 -F1 (MNIST conv*3+fc inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C3 -F1 (MNIST conv*3+fc training)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)
- vpk180% sudo ./vsim-acap.emax7+dma gptneox -m /home/nakashim/.cformers/models/OpenAssistant/oasst-sft-1-pythia-12b/int4_fixed_zero --prompt "50278 12092 2 0 50281" --seed 42 --threads 1 --n_predict 100 --top_k 20 --top_p 0.95 --temp 0.85 --repeat_last_n 64 --repeat_penalty 1.3 (GGML)
- vpk180% sudo ./llama-cli-acap.emax7+dma -t 4 -s 1 -fa -m ~/.llama/model/rinna-youri-7b-instruction-gguf/rinna-youri-7b-instruction-q2_K.gguf -p "Prime numbers smaller than ten" -n 32 (LLAMA-v2)
VPK180 (64 units x8 lanes)
- IMAX3 170MHz, 512 units, 20480 operations / 4 cycles, 512KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer
- proj-arm64/fpga/README-VPK180
- proj-arm64/fpga/VPK180-step4800-MASTER.tgz
- proj-arm64/fpga/VPK180-step4800-SLAVE.tgz
- proj-arm64/fpga/alice120-step4800-master.img.gz
- proj-arm64/fpga/alice122-step4800-slave1.img.gz
- proj-arm64/fpga/alice124-step4800-slave2.img.gz
- proj-arm64/fpga/alice126-step4800-slave3.img.gz
- linux# zcat alice120-step4800-master.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# zcat alice122-step4800-slave1.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# zcat alice124-step4800-slave2.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# zcat alice126-step4800-slave3.img.gz | dd bs=64k of=/dev/mmcblk0 (32GB SDcard)
- linux# mount /dev/mmcblk0p2 /mnt
- linux# replace root-password in /mnt/etc/shadow
- linux# umount /mnt
- vpk180# connect four boards w/ QSFPDD-DAC cable
- vpk180# insert SDcard
- vpk180# boot from SDcard (dhcp)
- linux% ssh -Y [email protected] (Xwindow)
- vpk180% zcat proj-arm64.tgz|tar xpf -
- vpk180% cd proj-arm64/sample/mm_cnn_lf
- vpk180% make -f Makefile-acap.emax7+dma mm-acap.emax7+dma (how to make)
- vpk180% sudo proj-arm64/sample/mm_cnn_lf/mm-acap.emax7+dma (matrix-mult)
- vpk180% sudo proj-arm64/sample/test/test025-acap.emax7+dma (dual matrix-mult)
- vpk180% cd proj-arm64/sample/tsim (MNIST/CIFAR10)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C1 -F1 (MNIST conv*1+fc inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C1 -F1 (MNIST conv*1+fc training)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I0 -C3 -F1 (MNIST conv*3+fc inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I0 -C3 -F1 (MNIST conv*3+fc training)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 (CIFAR10 conv6+fc2 inference)
- vpk180% sudo ./tsim-acap.emax7+dma -x -t -I1 -C6 -F2 (CIFAR10 conv6+fc2 training)
- vpk180% sudo ./tsim-acap.emax7+dma -x -i -r -I1 -C6 -F2 -M16 (CIFAR10 multi-lane)
- vpk180% sudo ./vsim-acap.emax7+dma gptneox -m /home/nakashim/.cformers/models/OpenAssistant/oasst-sft-1-pythia-12b/int4_fixed_zero --prompt "50278 12092 2 0 50281" --seed 42 --threads 2 --n_predict 100 --top_k 20 --top_p 0.95 --temp 0.85 --repeat_last_n 64 --repeat_penalty 1.3 (GGML)
- vpk180% sudo ./llama-cli-acap.emax7+dma -t 4 -s 8 -fa -m ~/.llama/model/rinna-youri-7b-instruction-gguf/rinna-youri-7b-instruction-q2_K.gguf -p "Prime numbers smaller than ten" -n 32 (LLAMA-v2)
Petalinux 2024.1 IMAX4 Kit for Intel servers
PCI-e(VPK120)+VPM180 (64 units x8/x16 lanes) ... Vivado project is included.
- IMAX4 170MHz, 512 units, 20480 operations / 4 cycles, 512KB-cache/unit
- each unit has:32-load/8-store, quad-sparse-load, 3-cascaded octa-int/media, octa-single-float FMA, 32-stochastic FMA, Dual addr-synchronizer