Building AI-Powered Image Vector Search on iPhone
As someone deeply involved in the world of AI, I’ve always been fascinated by the potential of machine learning to solve real-world problems. Despite not being an iOS developer by trade, I recently found myself embarking on an ambitious project that combined cutting-edge AI technology with mobile app development. The catalyst? My wife’s fashion brand and her ongoing struggle with image management on the IPhone.
The Spark of an Idea
Like many small business owners, my wife relies heavily on WhatsApp to communicate with her clients. However, she often complained about the time-consuming process of searching for specific product images on her phone to send to her clients as it was a manual process. As I listened to her frustrations, a lightbulb went off in my head. What if I could leverage AI to create a text-based image search tool specifically for her needs?
The concept was simple: build an app that would allow to quickly find relevant product photos using natural language queries. However, the execution would prove to be anything but straightforward.
Technical Challenges and Breakthroughs
To bring this idea to life, I needed to overcome several significant hurdles:
1. Finding a Vector DB that works on mobile, specifically iOS.
2. Converting a neural network model to run on CoreML, Apple’s machine learning framework.
3. Building a SwiftUI app from scratch, despite my limited iOS development experience..
Turso and Vector Search
I started searching around for databases that could work on IPhone and even asked on Reddit, but without much success. The project was made possible when Turso announced vector search support. Turso, a database provider, had created libSQL, a fork of SQLite that adds powerful new features to this well-established database. I also happen to know the founders Glauber and Pekka so I could be an early adopter and ask any questions in case I had issues.
Turso’s vector implementation currently offers the vector search comparison modes:
1. Cosine Distance / Similarity
2. DiskANN
Image Similiarity Search
I had been playing around with image similarity search for another personal project I had recently built called PetEnchente.
We had the worst floods in Brazil’s history in May 2024, and over 20k animals were separated from their owners. Many temporary shelters were created to rescue the animals. They created Instagram pages to give visibility. This was all in good will, but it was also complicated for the pet owners to figure out which pages to check and go through all of them (over 100).
To facilitate reuniting with their, I created with friends a site to index all these Instagram pages and let owners upload a picture to find similar animals. I might do a post on this later to explain the technical details. Let me know if you are interested.
For PetEnchente I used a mix of models (DinoV2, YoloV8 and custom models) and at the time was when I first heard about CLIP (Contrastive Language-Image Pre-training), a powerful multi modal model developed by OpenAI.
The Power of CLIP
CLIP’s ability to generate similar embeddings for related text and image pairs was exactly what I needed. To make things even better, I found a fine-tuned version by Laion called CLIP-ViT-B-32-laion2B-s34B-b79K, which had been trained on an astounding 5 billion image-text pairs!
The Development Process
The easy part was coming up with the idea, now let's put this all together.
Compiling Turso for iOS
The first step was to compile Turso’s SQLite for iOS so I could use vector search on device. This was not a current use case, so I had to figure it out by myself and share here the step by step process on how I did it so you don't have to waste time.
Before starting, make sure you have XCode and command line dev tools installed.
xcode-select --install
Clone the source code of the project
At the moment the vector functionality is only available in the beta branch "vector" before it becomes GA.
git clone --branch vector --depth 1 https://github.com/tursodatabase/libsql.git
Build SQLite for Mac
Now run the command bellow to configure and build the project for Mac. This will generate the SQLite almagation file and sqlite3.h header file we will need to link the library on XCode:
cd libsql/libsql-sqlite3 && ./configure && make
Build the sqlite libs for iOS:
The build for the IOS simulator is different from the real device, so you need both if you want to develop on the simulator and use on the real device. For my project, using a real device I could test on real photos as the simulator comes with only a few photos.
SRC_DIR=$(pwd)/libsql/libsql-sqlite3
DST_DIR="${SRC_DIR}/Turso_ios/sqlite"
XCODE_PATH=$(xcode-select --print-path)
SDK_VERSION=$(xcrun --sdk iphoneos --show-sdk-version)
# Copy sqlite3.h into XCode Project
cp ${SRC_DIR}/sqlite3.h ${DST_DIR}/sqlite3.h
# Clean previous build
rm -rf ${DST_DIR}
mkdir -p ${DST_DIR}
# Build for iOS (arm64)
xcrun --sdk iphoneos clang -arch arm64 -isysroot $(xcrun --sdk iphoneos --show-sdk-path) -dynamiclib ${SRC_DIR}/sqlite3.c -o ${DST_DIR}/libsqlite3_arm64.dylib -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_ENABLE_COLUMN_METADATA -install_name @rpath/libsqlite3_arm64.dylib
# Build for iOS Simulator (arm64)
xcrun --sdk iphonesimulator clang -arch arm64 -isysroot $(xcrun --sdk iphonesimulator --show-sdk-path) -dynamiclib ${SRC_DIR}/sqlite3.c -o ${DST_DIR}/libsqlite3_sim_arm64.dylib -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_ENABLE_COLUMN_METADATA -install_name @rpath/libsqlite3_sim_arm64.dylibAt the end of these steps you should have inside the foder "ios_build" the built libraries:
libsqlite3_arm64.dylib (for IPhone)
libsqlite3_sim_arm64.dylib (for iOS Simulator)
sqlite3.h
Setting up XCode to use SQLite
Add and copy the newly generated files to your Xcode Project.
Add an Objective-C file to the project with whatever name to your project, this can be later deleted, but will prompt to create a bridging header file that we will use.
Create the Bridging Header and delete the just created Objective-C file.
Add the following to the bridging header file to import the sqlite c headers: (the file should be called something like <project-name>-Bridging-Header.h)
#import <sqlite3.h>
Your project should now look something like this:
Now check if everything is correctly setup in XCode.
The sqlite lib we built from sources should be in your Target > Build Phases
Make sure in your target that in the General the lib Embed option is set to "Embed & Sign"
Now you are ready to use it from Swift. I created a Github repo with this all setup for you and an example of calling the c code from Swift ready to go. Just remember to rebuild the sql libs to get the latest features, and use the main branch once GA becomes available.
I've also provided in the repo a basic class so you can call the lib from Swift using the C headers.
Converting CLIP to CoreML
The next challenge was to convert the CLIP model to run on CoreML. This process involved several steps. Luckily I found an example of converting CLIP to CoreML in a similar project and adapted it to Laion's fine tuned model. There is also good documentation on Apple's site.
It basically involved tracing the original model with Pytorch and converting to CoreML.
Set up a Python environment:
Make sure to use Python 3.10 for this, later version's aren't currently compatible with coreml tools.
Bonus Tip: I use $(which python) when installing pip packages in conda environments. This makes sure the package is correctly installed in the correct conda environment. This saves me a lot of headaches.
conda create -n clip python=3.10
conda activate clip
$(which python) -m pip install coremltools
$(which python) -m pip install transformers
conda install pytorch::pytorch==2.2.0 torchvision torchaudio -c pytorch
$(which python) -m pip install git+https://github.com/openai/CLIP.git
2. Trace and convert the text model:
import torch
import coremltools as ct
import numpy as np
from transformers import CLIPTextModelWithProjection, CLIPTokenizerFast
model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
model = CLIPTextModelWithProjection.from_pretrained(model_id, return_dict=False)
tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
model.eval()
example_input = tokenizer("Black Cat", return_tensors="pt")
example_input = example_input.data['input_ids']
traced_model = torch.jit.trace(model, example_input)
max_seq_length = 76
text_encoder_model = ct.convert(
traced_model,
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS16,
compute_precision=ct.precision.FLOAT32,
inputs=[ct.TensorType(name="prompt",
shape=[1,max_seq_length],
dtype=np.int32)],
outputs=[ct.TensorType(name="embOutput", dtype=np.float32),
ct.TensorType(name="embOutput2", dtype=np.float32)],
)
text_encoder_model.save("TextCLIPEncoder.mlpackage")
3. Trace and convert the image model:
from transformers import CLIPVisionModelWithProjection, CLIPProcessor
from PIL import Image
model = CLIPVisionModelWithProjection.from_pretrained(model_id, return_dict=False)
processor = CLIPProcessor.from_pretrained(model_id)
model.eval()
img = Image.open("cat.jpeg")
example_input = processor(images=img, return_tensors="pt")
example_input = example_input['pixel_values']
traced_model = torch.jit.trace(model, example_input)
bias = [-processor.image_processor.image_mean[i]/processor.image_processor.image_std[i] for i in range(3)]
scale = 1.0 / (processor.image_processor.image_std[0] * 255.0)
image_input_scale = ct.ImageType(name="colorImage",
color_layout=ct.colorlayout.RGB,
shape=example_input.shape,
scale=scale, bias=bias,
channel_first=True,)
image_encoder_model = ct.convert(
traced_model,
convert_to="mlprogram",
compute_precision=ct.precision.FLOAT32,
minimum_deployment_target=ct.target.iOS16,
inputs=[image_input_scale],
outputs=[ct.TensorType(name="embOutput", dtype=np.float32),
ct.TensorType(name="embOutput2", dtype=np.float32)],
)
image_encoder_model.save("ImageCLIPEncoder.mlpackage")
4. Compile the models for use with Xcode and them to the project:
xcrun coremlc compile TextEncoder_float32.mlpackage ./CompiledModels
xcrun coremlc compile ImageEncoder_float32.mlpackage ./CompiledModels
Putting It All Together
With the core components in place, the next challenge was integrating everything into a functional iOS app. This involved:
1. Using Apple’s APIs to read all the photos on the device (which proved to be quite challenging, especially when dealing with asynchronous operations and callbacks).
2. Implementing the search functionality using Turso’s vector search capabilities.
3. Creating a user-friendly interface with SwiftUI.
The resulting app works as follows: When a user types a text query, it’s converted into a 512-dimensional vector embedding after a brief debounce period it automatically executes the vector search. This embedding is then compared to the pre-indexed images that also have 512-dimensions in the database on the device, and then displaying the closest matches almost instantly.
Here is an example of what this looks like in SQL:
"SELECT
photo_id,
file_name,
full_path
FROM
photos
ORDER BY
(1-vector_distance_cos(embedding, '\(strEmbeddings)')) desc
LIMIT \(k);"
or to use the DiskANN index
"SELECT
photo_id
FROM
vector_top_k('photos_idx', '\(strEmbeddings)', \(k))
JOIN
photos
ON
photos.rowid = id;"
The Results
After indexing over 10,000 images on my Iphone (a process that took around15 min at 0.1–0.2s per image), I was amazed by the app’s performance. Image searches were lightning-fast, returning relevant results in milliseconds.
Everything isn't perfect as I still need to improve my iOS skills and the time it takes to load the image neural net that is used for indexing (45–60s). Also, after the first 10k images, each image was taking 2–3s when it started pulling images from iCloud.
What truly impressed me was the model's ability to understand and find images based on complex concepts. It could even identify photos related to my favorite football team, Grêmio, demonstrating a level of contextual understanding that went beyond simple keyword matching.
I plan to create extra features as time allows and improve my wife's workflow:
moving search results to specific albums
save smart filters
deduplicating similar images
search for images by similarity
filter by metadata information (dates, album, etc..)
Give me more ideas on the comments!
Final Thoughts
This project, born out of a desire to help my wife work more efficiently and me wanting to tryout on-device AI and vector search. As AI and machine learning models become more sophisticated and efficient, we’re entering an era even more powerful, context-aware search capabilities can be put directly into users’ hands.
For app developers, vector DB's combined with ML models opens up a world of possibilities. The ability to perform advanced semantic search without relying on cloud services not only enhances privacy but also enables offline functionality, making apps more resilient and user-friendly.
Apple announced last month lot of AI features for iOS18 that are probably being powered by a proprietary vector search solutions under the hood. There is one that is very similar to this project and allows searching images by text. It was announced after I started building the project, but my goal for this project was to learn how to build an AI mobile app.
On-device vector search and AI-powered features are poised to revolutionize the mobile experience. For developers willing to dive in and experiment, the possibilities are truly exciting. Who knows? The next game-changing app idea might just come from trying to solve a simple, everyday problem.