1949catering.com

Setting Up Your Own LLM with Ollama on AWS Using Nvidia GPU

Written on

Chapter 1: Introduction to Ollama and AWS

If you're looking to establish your own LLM, Ollama stands out as an excellent choice. It supports private data, facilitates a straightforward Retrieval Augmented Generation (RAG) setup, and offers GPU compatibility on AWS. Plus, you can get it operational in just a few minutes with a variety of models at your disposal.

Instead of diving into what an LLM is—a topic well-covered by many—I will concentrate on the essential steps to get Ollama operational on an AWS instance.

To begin, you need an AWS account and access to GPU-based instances. This access is not granted by default, so you'll need to navigate to the "Service Request" page to request the instance type. Specifically, you’ll be asking for access to G type instances.

Find the Request Quota limits and select:

Service Limit Increase

EC2 Instances

Primary Instance Type: All G instances

Limit name: Instance Limit

New limit value: 10

Region that you will use

For your request, here's a sample use case description:

Hi, AWS,

I'm working on a GPU project for learning, and I'd like to increase my vCPU count to 10 for region US East [Northern Virginia].

Please increase it as soon as you can.

Thank you.

Once submitted, you can typically expect a response within about two hours.

Section 1.1: Creating Your AWS Instance

With access granted, you can now create your instance. Navigate to the EC2 Dashboard and initiate the process to create a new instance. Here is the information I used:

  • OS: Amazon Linux
  • AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1
  • Instance type: g4dn.xlarge

Make sure to configure your security keys. If you haven't set these up yet, now’s the time. Ensure you establish access rules: SSH on port 22 and Ollama remote access on port 11434.

Launching your new instance will incur costs—approximately $0.60 per hour, leading to an estimated monthly cost of around $500, or roughly $14 daily. Since I don’t keep the instance running constantly, I only start it as needed.

Now you can SSH into your instance. Obtain the public IP of the running instance (e.g., 172.10.10.148) and connect using:

ssh [email protected]

Section 1.2: Installing Ollama Software

After connecting, you can load the software and configure the server. Use the root access with the sudo command:

$ sudo bash

Create a workspace and install the software:

mkdir ollamawork

cd ollamawork

Once installed, start the Ollama service:

service ollama start

To check the loaded LLMs (which will initially be none), run:

ollama list

Now let's load a model:

ollama pull llama2

After pulling the model, you can list again to confirm it's available. To test the API, execute:

curl http://localhost:11434/api/generate -d '{

"model": "llama2",

"prompt":"Why is the sky blue?",

"stream":false

}'

You should receive a response confirming everything is functioning properly.

To verify that your instance has GPU support, run:

lspci

Additionally, check for the CUDA driver:

nvidia-smi -q | head

To view the logs for Ollama, use:

journalctl -u ollama

Chapter 2: Configuring Remote Access

At this point, your server is up and running, but it is only accessible locally. To enable remote access, we need to modify the Ollama server settings to listen on an external IP.

First, check which IPs it is currently listening to:

netstat -a | grep 11434

Next, configure the server to listen on all IP adapters by editing the service file:

[Service]

Environment="OLLAMA_HOST=0.0.0.0"

Use the following command to input this configuration:

systemctl edit ollama.service

After saving the changes, reload the service:

systemctl daemon-reload

systemctl restart ollama

Run the netstat command again to verify it’s listening on the 0.0.0.0 IP address:

netstat -a | grep 11434

Now, from an external computer, you can access the API using the instance's external IP:

curl http://172.10.10.148:11434/api/generate -d '{

"model": "llama2",

"prompt":"Why is the sky blue?",

"stream":false

}'

Summary

Congratulations! You now have your Ollama LLM server operational. You can explore additional models as they become available on the Ollama site.

Once Ollama is running, you’ll be ready to set up your Retrieval Augmented Generation (RAG) using LangChain, which I’ll discuss in a future post.

Thank you for reading, and feel free to connect with me!

The first video provides an expert guide on installing Ollama LLM with GPU on AWS in just 10 minutes, offering step-by-step instructions and insights.

The second video showcases how to deploy any open-source LLM with Ollama on an AWS EC2 + GPU in 10 minutes, covering models like Llama-3.1 and Gemma-2.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Finding Inspiration Amidst Life's Chaos: A Personal Journey

A reflection on finding creativity and purpose amidst life's distractions and challenges.

# The Inspiring Journey of Nick Woodman: From Failure to Success

Discover how Nick Woodman transformed setbacks into the success of GoPro, illustrating the power of perseverance and innovation.

# Embracing Bacteria: A Revolutionary Perspective on Health and Nature

This article explores how bacteria can reshape our understanding of health and ecology, urging us to embrace rather than fear these microorganisms.