• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Building ExecuTorch Android Demo for Llama running MediaTek
2This tutorial covers the end to end workflow for running Llama 3-8B-instruct inference on MediaTek AI accelerators on an Android device.
3More specifically, it covers:
41. Export and quantization of Llama models against the MediaTek backend.
52. Building and linking libraries that are required to inference on-device for Android platform using MediaTek AI accelerators.
63. Loading the needed model files on the device and using the Android demo app to run inference.
7
8Verified on MacOS, Linux CentOS (model export), Python 3.10, Android NDK 26.3.11579264
9Phone verified: MediaTek Dimensity 9300 (D9300) chip.
10
11## Prerequisites
12* Download and link the Buck2 build, Android NDK, and MediaTek ExecuTorch Libraries from the MediaTek Backend Readme ([link](https://github.com/pytorch/executorch/tree/main/backends/mediatek/scripts#prerequisites)).
13* MediaTek Dimensity 9300 (D9300) chip device
14* Desired Llama 3 model weights. You can download them on HuggingFace [Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)).
15* Download NeuroPilot Express SDK from the [MediaTek NeuroPilot Portal](https://neuropilot.mediatek.com/resources/public/npexpress/en/docs/npexpress):
16  - `libneuronusdk_adapter.mtk.so`: This universal SDK contains the implementation required for executing target-dependent code on the MediaTek chip.
17  - `libneuron_buffer_allocator.so`: This utility library is designed for allocating DMA buffers necessary for model inference.
18  - `mtk_converter-8.8.0.dev20240723+public.d1467db9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`: This library preprocess the model into a MediaTek representation.
19  - `mtk_neuron-8.2.2-py3-none-linux_x86_64.whl`: This library converts the model to binaries.
20
21## Setup ExecuTorch
22In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it [here](https://anaconda.org/anaconda/conda)). The commands below are running on Linux (CentOS).
23
24Create a Conda environment
25```
26conda create -yn et_mtk python=3.10.0
27conda activate et_mtk
28```
29
30Checkout ExecuTorch repo and sync submodules
31```
32git clone https://github.com/pytorch/executorch.git
33cd executorch
34git submodule sync
35git submodule update --init
36```
37Install dependencies
38```
39./install_requirements.sh
40```
41## Setup Environment Variables
42### Download Buck2 and make executable
43* Download Buck2 from the official [Release Page](https://github.com/facebook/buck2/releases/tag/2024-02-01)
44* Create buck2 executable
45```
46zstd -cdq "<downloaded_buck2_file>.zst" > "<path_to_store_buck2>/buck2" && chmod +x "<path_to_store_buck2>/buck2"
47```
48
49### Set Environment Variables
50```
51export BUCK2=path_to_buck/buck2 # Download BUCK2 and create BUCK2 executable
52export ANDROID_NDK=path_to_android_ndk
53export NEURON_BUFFER_ALLOCATOR_LIB=path_to_buffer_allocator/libneuron_buffer_allocator.so
54export NEURON_USDK_ADAPTER_LIB=path_to_usdk_adapter/libneuronusdk_adapter.mtk.so
55export ANDROID_ABIS=arm64-v8a
56```
57
58## Export Llama Model
59MTK currently supports Llama 3 exporting.
60
61### Set up Environment
621. Follow the ExecuTorch set-up environment instructions found on the [Getting Started](https://pytorch.org/executorch/stable/getting-started-setup.html) page
632. Set-up MTK AoT environment
64```
65// Ensure that you are inside executorch/examples/mediatek directory
66pip3 install -r requirements.txt
67
68pip3 install mtk_neuron-8.2.2-py3-none-linux_x86_64.whl
69pip3 install mtk_converter-8.8.0.dev20240723+public.d1467db9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
70```
71
72This was tested with transformers version 4.40 and numpy version 1.23. If you do not have these version then, use the following commands:
73```
74pip install transformers==4.40
75
76pip install numpy=1.23
77```
78
79### Running Export
80Prior to exporting, place the config.json, relevant tokenizer files and .bin or .safetensor weight files in `examples/mediatek/models/llm_models/weights`.
81
82Here is an export example ([details](https://github.com/pytorch/executorch/tree/main/examples/mediatek#aot-flow)):
83```
84cd examples/mediatek
85# num_chunks=4, num_tokens=128, cache_size=512
86source shell_scripts/export_llama.sh llama3 "" "" "" alpaca.txt
87```
88
89There will be 3 main set of files generated:
90* num_chunks*2 pte files: half are for prompt and the other half are for generation. Generation pte files are denoted by “1t” in the file name.
91* Token embedding bin file: located in the weights folder where `config.json` is placed (`examples/mediatek/modes/llm_models/weight/<model_name>/embedding_<model_name>_fp32.bin`)
92* Tokenizer file: `tokenizer.model` file
93
94Note: Exporting model flow can take 2.5 hours (114GB RAM for num_chunks=4) to complete. (Results may vary depending on hardware)
95
96Before continuing forward, make sure to modify the tokenizer, token embedding, and model paths in the  examples/mediatek/executor_runner/run_llama3_sample.sh.
97
98### Deploy
99First, make sure your Android phone’s chipset version is compatible with this demo (MediaTek Dimensity 9300 (D9300)) chip. Once you have the model, tokenizer, and runner generated ready, you can push them and the .so files to the device before we start running using the runner via shell.
100
101```
102adb shell mkdir -p /data/local/tmp/et-mtk/ (or any other directory name)
103adb push embedding_<model_name>_fp32.bin /data/local/tmp/et-mtk
104adb push tokenizer.model /data/local/tmp/et-mtk
105adb push <exported_prompt_model_0>.pte /data/local/tmp/et-mtk
106adb push <exported_prompt_model_1>.pte /data/local/tmp/et-mtk
107...
108adb push <exported_prompt_model_n>.pte /data/local/tmp/et-mtk
109adb push <exported_gen_model_0>.pte /data/local/tmp/et-mtk
110adb push <exported_gen_model_1>.pte /data/local/tmp/et-mtk
111...
112adb push <exported_gen_model_n>.pte /data/local/tmp/et-mtk
113```
114
115## Populate Model Paths in Runner
116
117The Mediatek runner (`examples/mediatek/executor_runner/mtk_llama_runner.cpp`) contains the logic for implementing the function calls that come from the Android app.
118
119**Important!** Currently the model paths are set in the runner-level. Modify the values in `examples/mediatek/executor_runner/llama_runner/llm_helper/include/llama_runner_values.h` to set the model paths, tokenizer path, embedding file path, and other metadata.
120
121
122## Build AAR Library
123
124Next we need to build and compile the MediaTek backend and MediaTek Llama runner. By setting  `NEURON_BUFFER_ALLOCATOR_LIB`, the script will build the MediaTek backend.
125```
126sh build/build_android_llm_demo.sh
127```
128
129**Output**: This will generate an .aar file that is already imported into the expected directory for the Android app. It will live in `examples/demo-apps/android/Llamademo/app/libs`.
130
131If you were to unzip the .aar file or open it in Android Studio, verify it contains the following related to MediaTek backend:
132* libneuron_buffer_allocator.so
133* libneuronusdk_adapter.mtk.so
134* libneuron_backend.so (generated during build)
135
136## Run Demo
137
138### Alternative 1: Android Studio (Recommended)
1391. Open Android Studio and select “Open an existing Android Studio project” to open examples/demo-apps/android/LlamaDemo.
1402. Run the app (^R). This builds and launches the app on the phone.
141
142### Alternative 2: Command line
143Without Android Studio UI, we can run gradle directly to build the app. We need to set up the Android SDK path and invoke gradle.
144```
145export ANDROID_HOME=<path_to_android_sdk_home>
146pushd examples/demo-apps/android/LlamaDemo
147./gradlew :app:installDebug
148popd
149```
150If the app successfully run on your device, you should see something like below:
151
152<p align="center">
153<img src="https://raw.githubusercontent.com/pytorch/executorch/refs/heads/main/docs/source/_static/img/opening_the_app_details.png" style="width:800px">
154</p>
155
156Once you've loaded the app on the device:
1571. Click on the Settings in the app
1582. Select MediaTek from the Backend dropdown
1593. Click the "Load Model" button. This will load the models from the Runner
160
161## Reporting Issues
162If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new).
163