I recently got some time to try out [Modal](https://modal.com), which is a serverless platform that offers GPUs. I was searching for an easy way to run inference and fine-tune models in an easy and cost-effective manner (since I am GPU poor). There were a couple of options during my research, such as [Replicate](https://replicate.com/) and friends, [Huggingface Inference Endpoints](https://huggingface.co/docs/inference-endpoints/main/en/index), etc. but I was drawn to Modal for a couple of reasons. # Actually Serverless Modal is a *completely* serverless Python platform. This is different from a scale to zero model like Replicate's, which you create and upload a container, select a GPU, and call out to it through an API endpoint which can be rate-limited (If you *are* calling an endpoint for LLM inference, I encourage you to [[Go into AI with Go|explore using Go]]!). In Modal, native Python function calls becomes a remote container call with just an annotation, no serialization/deserialization logic or platform specific SDK needed. This makes it really easy to adopt Modal, as you can do anything you want as long as the parameters being passed around between functions are supported by cloudpickle. This means I can use Huggingface Pipeline, TGI, VLLM, Ollama, whatever I want, as long as I encapsulate functionality in a Python function. In addition, the function can scale out across inputs by simply provisioning more containers. This is one of the biggest draws, as I can set a concurrency limit and call `.map` over some list of function inputs and it will automatically create the needed number of containers, do the work, and return the *ordered* list of results. # Cost Effective For my use cases, in order for a platform to be cost effective, I just want to pay for compute used. Modal allows as I can run a Python program locally, provision compute in Modal on demand, and once the program exits, all provisioned compute immediately terminates. This is especially useful for experimentation, running evaluations, and offline batched inference to maximize GPU utilization. Later in this post, I'll show you how to run the test split of GSM8K bench against Mistral 7B v0.2, for **$0.60 in ~150 seconds** using multiple A10G GPUs. If you are not doing something like offline batched inference, you still scale out from zero and back down to zero if some minor cost incurred in the form of idle compute which is *configurable*. If you want to pay for idle compute of 5 minutes to avoid cold starts, you can. # Some cons While Modal is great for the most part, and I'm definitely impressed, it does have some minor cons as of the present. ## Cost First, the most glaring con is cost. A simple cost comparison of on-demand GPUs from a couple of different platforms as of time of this blog post: | per hour cost per platform | Modal | Runpod | Lambda Labs | | ---- | ---- | ---- | ---- | | T4 | 0.59 | - | - | | L4 | 1.05 | 0.44 | - | | A10G | 1.10 | - | - | | A100 40 GB | 3.73 | - | 1.29 | | A100 80 GB PCIe | 5.59 | 1.89 | - | | H100 PCIe | 7.65 | 3.89 | 2.49 | It may seem odd that I described Modal as cost effective given the cost table above, but in practice, more workloads are bursty, and if you tune your Python programs, you can maximize GPU usage and minimize execution time. Many times during experimentation, such as prompt engineering, you are only running the GPUs seconds at a time, while the rest of the time the GPU is idling, which is wasted cost. Unless you have sustained, *stable* traffic, Modal is more likely to be way more cost effective. ## Log retention Log retention is really low: 24 hours. After which, the logs are permanently deleted. This is a minor inconvenience, as realistically we would export logs to some external stack like ELK, Splunk, etc., but during experimentation, it would be nice to have longer retention time to view results or previous runs logs. # Benchmarking I didn't start from scratch, I took the [example of running Gemma 7B using VLLM](https://modal.com/docs/examples/vllm_gemma) and adopted it to run with Mistral 7B Instruct v0.2 and run against the GSM8K benchmark. I recommend running the example first and playing around with it, before following along here. The entire benchmark runs in a single `main.py` file. ## Prerequisites A part of the script runs locally on your machine and a part runs in Modal. In order to run the script, we need to have some dependencies installed for the local part of the script. ``` pip install modal datasets transformers jinja2 ``` ## Make the container image We want to run on VLLM and use the Mistral 7B Instruct v0.2 model. Here is the first part of the `main.py` file: ```python import copy import os import time # Some necessary imports to setup our serverless app from modal import Image, Stub, enter, gpu, method # The directory that the model will be downloaded to from HuggingFace MODEL_DIR = "/model" # The huggingface repo id for the model BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" # Specifying the GPU we want to use to Modal GPU_CONFIG = gpu.A100(count=1, memory=40) # A function that runs on Modal when building our custom container image. We want to cache the model weights so our cold start time is reduced def download_model_to_folder(): from huggingface_hub import snapshot_download from transformers.utils import move_cache os.makedirs(MODEL_DIR, exist_ok=True) snapshot_download( BASE_MODEL, local_dir=MODEL_DIR, ) move_cache() # The spec for how we want to define our custom container image image = ( Image.from_registry( "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.10" ) .pip_install( "vllm==0.3.2", "huggingface_hub==0.20.3", "hf-transfer==0.1.5", "torch==2.1.2" ) .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) .run_function( download_model_to_folder, timeout=60 * 20, ) ) ``` The next part of the `main.py` defines what we want to run in Modal's infrastructure. We are creating a Modal Function with lifecycle events to run our model on VLLM ```python # Define our stub to label our ephemeral apps in Modal stub = Stub("gsm8k-demo") # Annotation to let Modal this needs to run on their infrastructure @stub.cls( gpu=GPU_CONFIG, # Use the GPU config we defined above timeout=60 * 5, # Timeout the function after 5 minutes container_idle_timeout=60 * 5, # Shutdown the function after 5 minutes. For our experiment, this doesn't matter, as the modal CLI will take care of terminating workers as soon as our script finishes. concurrency_limit=10, # Don't run more than 10 concurrent functions image=image, # Use the container image we defined above retries=3 # In case of problems or timeout, retry n number of times ) class Model: def __init__(self): self.llm = None # Run this function once on container startup @enter() def load_model(self): from vllm import LLM # Create the LLM engine to do batched inference self.llm = LLM(MODEL_DIR, max_model_len=16752, max_context_len_to_capture=16752, kv_cache_dtype="fp8_e5m2") # The function that our script calls for batched inference @method() def generate(self, prompts: list[str], stop_seqs: list[str] = None) -> dict: from vllm import SamplingParams, RequestOutput import time sampling_params = SamplingParams( temperature=0.75, top_p=1, max_tokens=800, presence_penalty=1.15, stop=stop_seqs, ) # Get the time before generating start_time = time.time() # Generate the outputs from the LLM Engine results: list[RequestOutput] = self.llm.generate(prompts, sampling_params, use_tqdm=False) # Get the after generating end_time = time.time() execution_time = end_time - start_time # Count the number of input tokens input_tokens = 0 # Count the number of output tokens output_tokens = 0 for output in results: input_tokens += len(output.prompt_token_ids) output_tokens += len(output.outputs[0].token_ids) return { "input_tokens": input_tokens, "output_tokens": output_tokens, "output": results[0].outputs[0].text, "execution_time": execution_time } ``` For the final part, we will define an entrypoint into our benchmark script: ```python # This annotation lets modal CLI know to call this function to run the script @stub.local_entrypoint() def main(): from datasets import load_dataset from transformers import AutoTokenizer # Get the start time of the script start_time = time.time() # Load the benchmark dataset dataset = load_dataset("gsm8k", name="main", split="test") messages = [] # Load the model tokenizer so we can use the chat template tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) stop_seqs = [tokenizer.eos_token] # Number of few shot examples for input k = 8 # Convert the few shot examples to a dict that will represent chat conversation for i in range(k): question = dataset[i]["question"] answer = dataset[i]["answer"] messages.append({ "role": "user", "content": question }) messages.append({ "role": "assistant", "content": answer }) # Remove the few shot examples from the benchmark dataset = dataset[k:] # Batch size to pass to the workers. This is the ideal number I found for A100 GPUs after testing a couple of different numbers batch_size = 200 # Instaniate the Modal Function model = Model() # Convert the conversation dicts to a prompt string using the tokenizers chat template, and construct list of batched prompt strings inputs = [] prompts = [] for i in range(len(dataset["question"])): question = dataset["question"][i] # Copy the few shot dict conversation completion = copy.deepcopy(messages) # Append the actual word problem we want the LLM to solve completion.append({ "role": "user", "content": question }) # Convert the few shot examples to the models prompt format using the tokenizers chat template prompts.append(tokenizer.apply_chat_template(completion, tokenize=False)) # When we have accumulated enough prompt strings for a single batch, append it to the list of inputs if len(prompts) == batch_size or i == len(dataset["question"])-1: inputs.append((prompts, stop_seqs)) prompts = [] # The special starmap function maps our batched inputs to multiple containers and returns the results in the same order responses = list(model.generate.starmap(inputs)) # Report the stats total_GPU_execution_time = 0 total_input_tokens = 0 total_output_tokens = 0 for response in responses: total_GPU_execution_time += response["execution_time"] total_input_tokens += response["input_tokens"] total_output_tokens += response["output_tokens"] print(f"total GPU execution time: {total_GPU_execution_time}") print(f"total input tokens: {total_input_tokens}") print(f"total output tokens: {total_output_tokens}") print(f"script execution time: {time.time() - start_time}") ``` To run the above, just run `modal run main.py`. Running the above, I get this output: ```text total GPU execution time: 355.26616978645325 total input tokens: 2101425 total output tokens: 221878 script execution time: 97.08975577354431 ``` You can see that total GPU execution time is more than our actual script execution time, since we utilized multiple GPUs concurrently to process our benchmark. It cost me $0.74 to run the script. As a comparison, utilizing an inference platform like [Together.xyz](https://www.together.ai/) to call the same exact model with a maximum of 100 concurrent calls took about 52 seconds and costs $0.4646606. Obviously, in this case calling an optimized LLM inference platform is both cheaper and faster, however the tradeoff is that you are boxed into their API. As of the time of this blog post, you cannot run a custom model, there is no support for providing images as part of the prompt to a vision language model like Qwen, you are subject to rate limits, there is no support for structured output libraries like [outlines](https://outlines-dev.github.io/outlines/welcome/), and you cannot batch inputs to the API. With Modal, there is no such limitations, and you have a drastically simplified programming model. On [Together.xyz](https://www.together.ai/), in order to complete the benchmark as fast as possible, I created a Golang program which utilized a bounded number of goroutines, channels and wait groups to concurrently send and manage responses from the API, significantly adding to the code boilerplate needed to run the benchmark. When experimenting, this additional code can slow experimentation development velocity down, or worse, introduce bugs that take away time that can spent on running more experiments. ### Sidenote Just by changing the GPU config and batch size, I was able to significantly reduce the script execution time, and make it cheaper! It now costs $0.64. Here is the output of the script: ```text total GPU execution time: 155.90221166610718 total input tokens: 2101425 total output tokens: 219306 script execution time: 42.53459310531616 ``` Change the following lines of code: ```python # Specifying the GPU we want to use to Modal, change this to H100 GPU_CONFIG = gpu.H100(count=1) # Batch size to pass to the workers. Reduce this to utilize more workers concurrently batch_size = 175 ``` # Conclusion I will definitely be adding Modal as part my toolbox when experimenting with models and prompt engineering. Due to its serverless nature, running experiments is cheap and the programming model makes it simple to try out different things. As a general serverless platform, it also lends itself well to offline batch jobs. I don't believe it is cost effective for running any fine-tuning jobs, as it is a stable workload, and we can see that there are many other options in the cost comparison table above that can offer on-demand GPUs for much cheaper pricing.