InferrLM is a mobile application that brings LLMs & SLMs directly to your Android & iOS device and lets your device act as a local server. Cloud-based models like Claude, Gemini and ChatGPT are also supported. File attachments with RAG are also well-supported for local models.
If you want to support me and the development of this project, you can donate to me through Ko-fi.
Demos of using the Apple Foundation Model and downloading MLX models from HuggingFace and running them on-device.
- Core local inference through llama.cpp with support for GGUF models on both Android and iOS.
- Local MLX inference for Apple Silicon devices.
- Seamless integration with cloud-based models from OpenAI, Gemini, and Anthropic. You need your own API keys and an InferrLM registered account for remote models. Using remote models is optional.
- Customizable base URLs for OpenAI-compatible providers like OpenRouter, Groq, Ollama, LM Studio, Together AI. This allows you to access alternative API endpoints within the app.
- Apple Foundation model support for Apple Intelligence supported devices.
- Vision support through multimodal models with their corresponding projector (mmproj) files. You can read more about them here.
- Built-in camera lets you capture pictures directly within the app and send them to models.
- RAG (Retrieval-Augmented Generation) support for enhanced document understanding and context-aware responses.
- File attachment support with a built-in document extractor that performs OCR locally on all pages of your documents and extracts text content to send to the local models.
- Document ingestion system that processes and indexes your files for efficient retrieval during conversations.
- Native file upload support for the remote models.
- Built-in HTTP server that exposes REST APIs for accessing your models from any device on your local network. The server can be started from the Server tab. Share your InferrLM chat interface with computers, tablets, or other devices through a URL.
- Full API documentation is available HERE and at the server homepage.
- A command-line interface tool is available at github.com/sbhjt-gr/InferrLM-CLI that demonstrates how to build applications using its API.
- Download manager that fetches models directly from HuggingFace. Cherry-picked model list optimized for running on edge devices is available in Models -> "Download Models" tab.
- Downloaded models appear in the chat screen model selector and in the "Stored Models" tab inside the "Models" tab.
- Import models from local storage or download directly from URLs.
- Messages support editing, regeneration, copy functionality and markdown rendering.
- Fast native markdown rendering with math rendering support powered by
react-native-nitro-markdown, a C++ based renderer built on the Nitro Modules bridge. - Dedicated branching support on each chat bubble lets you fork the conversation from any message, preserving the original thread so you can explore alternate directions without losing prior context.
- Code generated by the models is rendered inside codeblocks with clipboard functionality.
- Chat history management with the ability to pin conversations.
If you want to contribute or just try to run it locally, follow the guide below. Your work/modifications should adhere to our LICENSE.
- Node.js (>= 16.0.0, < 23.0.0)
- npm or yarn
- Expo CLI
- Android Studio (for Android development)
- Xcode (for iOS development)
-
Clone the repository
git clone https://github.com/sbhjt-gr/InferrLM cd InferrLM -
Install dependencies
yarn install
-
Set up environment variables Configure your API keys and Firebase settings. The list of variables is available in the app.config.json
-
Run on device or emulator
# For Android npx expo run:android # For iOS npx expo run:ios
InferrLM includes a built-in HTTP server that exposes the models using the OpenAI API for accessing your local models from any device on your local network. This allows you to integrate InferrLM with other applications, scripts, or services.
- Open the InferrLM app
- Navigate to the Server tab
- Toggle the server switch to start it
- The server URL will be displayed (typically
http://YOUR_DEVICE_IP:8889)
The InferrLM-CLI tool is a terminal-based client that connects to your InferrLM server and provides an interactive chat interface directly from your command line. This serves as both a functional tool and a reference implementation for developers who want to build applications using the InferrLM REST API.
The CLI is built with React and Ink to provide a basic terminal UI with features like streaming responses, conversation history, and an interactive setup flow. You can find the complete source code and installation instructions at github.com/sbhjt-gr/InferrLM-CLI.
To get started with the CLI, make sure your InferrLM server is running on your mobile device, then install the CLI tool and follow the setup instructions provided in its repository.
Once the server is running, you can access the complete API documentation by opening the server URL in any web browser. The documentation includes:
- Chat and completion endpoints
- Model management operations
- RAG and embeddings APIs
- Server configuration and status
For detailed API reference, see the REST API Documentation.
This project is distributed under the AGPL-3.0 License. Please read it here. Any modifications must adhere to the rules of this LICENSE.
Contributions are welcome! You can find issues in the issues tab or raise new ones and start your work.
Read our Contributing Guide for detailed contribution guidelines, code standards, and best practices.
- Framework: React Native 0.81 with Expo 54 (New Architecture)
- App language: TypeScript, JavaScript
- iOS native modules: Swift
- Android native modules: Kotlin
- Inference engine: C, C++
- Navigation: React Navigation
- Database: OP-SQLite, Expo SQLite
- llama.cpp - The underlying engine for running local GGUF models on both Android and iOS.
- mlx-swift-lm - Swift library for running MLX language models on Apple Silicon, powering the MLX inference backend on iOS.
- inferrlm-llama.rn - The customized React Native adapter which provides the bridge for llama.cpp. Originally forked and self-hosted from llama.rn for updating llama.cpp more frequently.
- @inferrlm/react-native-mlx - Apple Silicon MLX inference engine for iOS, providing optimized on-device performance via the Nitro Modules bridge forked and maintained from react-native-nitro-mlx
- react-native-nitro-markdown - Native C++ markdown renderer for React Native, used for fast chat message rendering.
- react-native-rag + @langchain/textsplitters - RAG implementation for React Native that powers the document retrieval and ingestion features using LangChain.
- react-native-ai - The adaptor that provides access to the Apple Foundation model with its Swift API.
- If someone thinks they also need to be mentioned here, please let me know.
Star this repository if you find it useful!


