Setting Up Your Own LLM with Ollama on AWS Using Nvidia GPU
Written on
Chapter 1: Introduction to Ollama and AWS
If you're looking to establish your own LLM, Ollama stands out as an excellent choice. It supports private data, facilitates a straightforward Retrieval Augmented Generation (RAG) setup, and offers GPU compatibility on AWS. Plus, you can get it operational in just a few minutes with a variety of models at your disposal.
Instead of diving into what an LLM is—a topic well-covered by many—I will concentrate on the essential steps to get Ollama operational on an AWS instance.
To begin, you need an AWS account and access to GPU-based instances. This access is not granted by default, so you'll need to navigate to the "Service Request" page to request the instance type. Specifically, you’ll be asking for access to G type instances.
Find the Request Quota limits and select:
Service Limit Increase
EC2 Instances
Primary Instance Type: All G instances
Limit name: Instance Limit
New limit value: 10
Region that you will use
For your request, here's a sample use case description:
Hi, AWS,
I'm working on a GPU project for learning, and I'd like to increase my vCPU count to 10 for region US East [Northern Virginia].
Please increase it as soon as you can.
Thank you.
Once submitted, you can typically expect a response within about two hours.
Section 1.1: Creating Your AWS Instance
With access granted, you can now create your instance. Navigate to the EC2 Dashboard and initiate the process to create a new instance. Here is the information I used:
- OS: Amazon Linux
- AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1
- Instance type: g4dn.xlarge
Make sure to configure your security keys. If you haven't set these up yet, now’s the time. Ensure you establish access rules: SSH on port 22 and Ollama remote access on port 11434.
Launching your new instance will incur costs—approximately $0.60 per hour, leading to an estimated monthly cost of around $500, or roughly $14 daily. Since I don’t keep the instance running constantly, I only start it as needed.
Now you can SSH into your instance. Obtain the public IP of the running instance (e.g., 172.10.10.148) and connect using:
Section 1.2: Installing Ollama Software
After connecting, you can load the software and configure the server. Use the root access with the sudo command:
$ sudo bash
Create a workspace and install the software:
mkdir ollamawork
cd ollamawork
Once installed, start the Ollama service:
service ollama start
To check the loaded LLMs (which will initially be none), run:
ollama list
Now let's load a model:
ollama pull llama2
After pulling the model, you can list again to confirm it's available. To test the API, execute:
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt":"Why is the sky blue?",
"stream":false
}'
You should receive a response confirming everything is functioning properly.
To verify that your instance has GPU support, run:
lspci
Additionally, check for the CUDA driver:
nvidia-smi -q | head
To view the logs for Ollama, use:
journalctl -u ollama
Chapter 2: Configuring Remote Access
At this point, your server is up and running, but it is only accessible locally. To enable remote access, we need to modify the Ollama server settings to listen on an external IP.
First, check which IPs it is currently listening to:
netstat -a | grep 11434
Next, configure the server to listen on all IP adapters by editing the service file:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Use the following command to input this configuration:
systemctl edit ollama.service
After saving the changes, reload the service:
systemctl daemon-reload
systemctl restart ollama
Run the netstat command again to verify it’s listening on the 0.0.0.0 IP address:
netstat -a | grep 11434
Now, from an external computer, you can access the API using the instance's external IP:
curl http://172.10.10.148:11434/api/generate -d '{
"model": "llama2",
"prompt":"Why is the sky blue?",
"stream":false
}'
Summary
Congratulations! You now have your Ollama LLM server operational. You can explore additional models as they become available on the Ollama site.
Once Ollama is running, you’ll be ready to set up your Retrieval Augmented Generation (RAG) using LangChain, which I’ll discuss in a future post.
Thank you for reading, and feel free to connect with me!
The first video provides an expert guide on installing Ollama LLM with GPU on AWS in just 10 minutes, offering step-by-step instructions and insights.
The second video showcases how to deploy any open-source LLM with Ollama on an AWS EC2 + GPU in 10 minutes, covering models like Llama-3.1 and Gemma-2.