1# Building ExecuTorch Android Demo for Llama running MediaTek 2This tutorial covers the end to end workflow for running Llama 3-8B-instruct inference on MediaTek AI accelerators on an Android device. 3More specifically, it covers: 41. Export and quantization of Llama models against the MediaTek backend. 52. Building and linking libraries that are required to inference on-device for Android platform using MediaTek AI accelerators. 63. Loading the needed model files on the device and using the Android demo app to run inference. 7 8Verified on MacOS, Linux CentOS (model export), Python 3.10, Android NDK 26.3.11579264 9Phone verified: MediaTek Dimensity 9300 (D9300) chip. 10 11## Prerequisites 12* Download and link the Buck2 build, Android NDK, and MediaTek ExecuTorch Libraries from the MediaTek Backend Readme ([link](https://github.com/pytorch/executorch/tree/main/backends/mediatek/scripts#prerequisites)). 13* MediaTek Dimensity 9300 (D9300) chip device 14* Desired Llama 3 model weights. You can download them on HuggingFace [Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)). 15* Download NeuroPilot Express SDK from the [MediaTek NeuroPilot Portal](https://neuropilot.mediatek.com/resources/public/npexpress/en/docs/npexpress): 16 - `libneuronusdk_adapter.mtk.so`: This universal SDK contains the implementation required for executing target-dependent code on the MediaTek chip. 17 - `libneuron_buffer_allocator.so`: This utility library is designed for allocating DMA buffers necessary for model inference. 18 - `mtk_converter-8.8.0.dev20240723+public.d1467db9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`: This library preprocess the model into a MediaTek representation. 19 - `mtk_neuron-8.2.2-py3-none-linux_x86_64.whl`: This library converts the model to binaries. 20 21## Setup ExecuTorch 22In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it [here](https://anaconda.org/anaconda/conda)). The commands below are running on Linux (CentOS). 23 24Create a Conda environment 25``` 26conda create -yn et_mtk python=3.10.0 27conda activate et_mtk 28``` 29 30Checkout ExecuTorch repo and sync submodules 31``` 32git clone https://github.com/pytorch/executorch.git 33cd executorch 34git submodule sync 35git submodule update --init 36``` 37Install dependencies 38``` 39./install_requirements.sh 40``` 41## Setup Environment Variables 42### Download Buck2 and make executable 43* Download Buck2 from the official [Release Page](https://github.com/facebook/buck2/releases/tag/2024-02-01) 44* Create buck2 executable 45``` 46zstd -cdq "<downloaded_buck2_file>.zst" > "<path_to_store_buck2>/buck2" && chmod +x "<path_to_store_buck2>/buck2" 47``` 48 49### Set Environment Variables 50``` 51export BUCK2=path_to_buck/buck2 # Download BUCK2 and create BUCK2 executable 52export ANDROID_NDK=path_to_android_ndk 53export NEURON_BUFFER_ALLOCATOR_LIB=path_to_buffer_allocator/libneuron_buffer_allocator.so 54export NEURON_USDK_ADAPTER_LIB=path_to_usdk_adapter/libneuronusdk_adapter.mtk.so 55export ANDROID_ABIS=arm64-v8a 56``` 57 58## Export Llama Model 59MTK currently supports Llama 3 exporting. 60 61### Set up Environment 621. Follow the ExecuTorch set-up environment instructions found on the [Getting Started](https://pytorch.org/executorch/stable/getting-started-setup.html) page 632. Set-up MTK AoT environment 64``` 65// Ensure that you are inside executorch/examples/mediatek directory 66pip3 install -r requirements.txt 67 68pip3 install mtk_neuron-8.2.2-py3-none-linux_x86_64.whl 69pip3 install mtk_converter-8.8.0.dev20240723+public.d1467db9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 70``` 71 72This was tested with transformers version 4.40 and numpy version 1.23. If you do not have these version then, use the following commands: 73``` 74pip install transformers==4.40 75 76pip install numpy=1.23 77``` 78 79### Running Export 80Prior to exporting, place the config.json, relevant tokenizer files and .bin or .safetensor weight files in `examples/mediatek/models/llm_models/weights`. 81 82Here is an export example ([details](https://github.com/pytorch/executorch/tree/main/examples/mediatek#aot-flow)): 83``` 84cd examples/mediatek 85# num_chunks=4, num_tokens=128, cache_size=512 86source shell_scripts/export_llama.sh llama3 "" "" "" alpaca.txt 87``` 88 89There will be 3 main set of files generated: 90* num_chunks*2 pte files: half are for prompt and the other half are for generation. Generation pte files are denoted by “1t” in the file name. 91* Token embedding bin file: located in the weights folder where `config.json` is placed (`examples/mediatek/modes/llm_models/weight/<model_name>/embedding_<model_name>_fp32.bin`) 92* Tokenizer file: `tokenizer.model` file 93 94Note: Exporting model flow can take 2.5 hours (114GB RAM for num_chunks=4) to complete. (Results may vary depending on hardware) 95 96Before continuing forward, make sure to modify the tokenizer, token embedding, and model paths in the examples/mediatek/executor_runner/run_llama3_sample.sh. 97 98### Deploy 99First, make sure your Android phone’s chipset version is compatible with this demo (MediaTek Dimensity 9300 (D9300)) chip. Once you have the model, tokenizer, and runner generated ready, you can push them and the .so files to the device before we start running using the runner via shell. 100 101``` 102adb shell mkdir -p /data/local/tmp/et-mtk/ (or any other directory name) 103adb push embedding_<model_name>_fp32.bin /data/local/tmp/et-mtk 104adb push tokenizer.model /data/local/tmp/et-mtk 105adb push <exported_prompt_model_0>.pte /data/local/tmp/et-mtk 106adb push <exported_prompt_model_1>.pte /data/local/tmp/et-mtk 107... 108adb push <exported_prompt_model_n>.pte /data/local/tmp/et-mtk 109adb push <exported_gen_model_0>.pte /data/local/tmp/et-mtk 110adb push <exported_gen_model_1>.pte /data/local/tmp/et-mtk 111... 112adb push <exported_gen_model_n>.pte /data/local/tmp/et-mtk 113``` 114 115## Populate Model Paths in Runner 116 117The Mediatek runner (`examples/mediatek/executor_runner/mtk_llama_runner.cpp`) contains the logic for implementing the function calls that come from the Android app. 118 119**Important!** Currently the model paths are set in the runner-level. Modify the values in `examples/mediatek/executor_runner/llama_runner/llm_helper/include/llama_runner_values.h` to set the model paths, tokenizer path, embedding file path, and other metadata. 120 121 122## Build AAR Library 123 124Next we need to build and compile the MediaTek backend and MediaTek Llama runner. By setting `NEURON_BUFFER_ALLOCATOR_LIB`, the script will build the MediaTek backend. 125``` 126sh build/build_android_llm_demo.sh 127``` 128 129**Output**: This will generate an .aar file that is already imported into the expected directory for the Android app. It will live in `examples/demo-apps/android/Llamademo/app/libs`. 130 131If you were to unzip the .aar file or open it in Android Studio, verify it contains the following related to MediaTek backend: 132* libneuron_buffer_allocator.so 133* libneuronusdk_adapter.mtk.so 134* libneuron_backend.so (generated during build) 135 136## Run Demo 137 138### Alternative 1: Android Studio (Recommended) 1391. Open Android Studio and select “Open an existing Android Studio project” to open examples/demo-apps/android/LlamaDemo. 1402. Run the app (^R). This builds and launches the app on the phone. 141 142### Alternative 2: Command line 143Without Android Studio UI, we can run gradle directly to build the app. We need to set up the Android SDK path and invoke gradle. 144``` 145export ANDROID_HOME=<path_to_android_sdk_home> 146pushd examples/demo-apps/android/LlamaDemo 147./gradlew :app:installDebug 148popd 149``` 150If the app successfully run on your device, you should see something like below: 151 152<p align="center"> 153<img src="https://raw.githubusercontent.com/pytorch/executorch/refs/heads/main/docs/source/_static/img/opening_the_app_details.png" style="width:800px"> 154</p> 155 156Once you've loaded the app on the device: 1571. Click on the Settings in the app 1582. Select MediaTek from the Backend dropdown 1593. Click the "Load Model" button. This will load the models from the Runner 160 161## Reporting Issues 162If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new). 163