Category: Blog

Your blog category

Building a Simple RAG System with Ollama: Retrieval-Augmented Generation Explained
What is RAG?

Retrieval-Augmented Generation (RAG) is a powerful technique in AI that combines the strengths of information retrieval and generative models. Instead of relying solely on a language model’s pre-trained knowledge, RAG first retrieves relevant information from a knowledge base and then uses that context to generate more accurate and up-to-date responses.

Why is RAG important? Traditional language models can hallucinate or provide outdated information. RAG addresses this by grounding responses in real, external data, making AI systems more reliable for tasks like question-answering, chatbots, and knowledge assistants.

How RAG Works

RAG typically involves two main steps:
1. Retrieval: Search for relevant documents or data chunks based on the user’s query.
2. Generation: Feed the retrieved information as context to a language model, which generates a response.
Common tools for RAG include vector databases (e.g., Pinecone, Chroma) for efficient similarity search, and embedding models to convert text into vectors. In our example we will use a simple JSON file for simplicity.

Our Simple RAG Example

In this post, we’ll build a basic RAG system using Ollama, a local AI platform. Our example is a contact lookup chatbot: users can ask questions like “What’s Peter’s phone number?” and the system retrieves the relevant contact info before generating a response.

We are going to extend the example from this blog post. You can download the full source code here: GitHub

Key Components
- Data: A JSON file with 100+ contacts (name, phone, email, address).
- Embeddings: We use Ollama’s mxbai-embed-large model to convert contact details into vectors.
- Similarity Search: Cosine similarity to find the best-matching contact.
- Generation: Ollama’s llama3 model generates responses using the retrieved contact as context.
- Intent Detection: A simple keyword-based check to decide if a query needs RAG or just general chat.
Step-by-Step Implementation

Prepare Data:
- Create contacts.json with contact details or use the example data provided with the source code.
- Run build_embeddings.js to generate embeddings for each contact using the full text (name + phone + email + address).
Server Setup (server.js):
- Handle POST requests to /rag.
- Detect if the query is contact-related (using keywords like “phone”, “email”).
- If yes: Generate query embedding, find the most similar contact via cosine similarity, build context, and call the LLM.
- If no: Direct LLM call without context.
Cosine Similarity:
- Measures vector similarity (0 to 1, where 1 is identical).
- Used to rank contacts by relevance to the query.
Frontend (index.html, main.js):
- Simple chat interface.
- Sends queries to /rag and displays responses.
Running the Example
1. Install Ollama and pull models: ollama pull mxbai-embed-large and ollama pull llama3.
2. Run node build_embeddings.js to prepare data.
3. Start server: node server.js.
4. Open http://localhost:8000 and chat!
Example queries:
- “What’s Anna’s email?” → Retrieves Anna’s contact and generates a response.
- “Tell a joke.” → General LLM response.
Conclusion

This example demonstrates RAG’s core principles in under 200 lines of code. It’s not production-ready (no error handling, security, JSON instead of a DB), but perfect for learning.

RAG bridges the gap between retrieval and generation, making AI more factual and context-aware. Our contact chatbot shows how easy it is to implement with local tools like Ollama.

Full code: GitHub
March 10, 2026
How to Create an Azure AI Foundry Resource
Azure AI Foundry is Microsoft’s unified environment for building, testing, and deploying AI applications and agents. It brings together model catalog, prompt engineering tools, evaluation workflows, deployment management, and governance in one place. Developers use it to prototype conversational agents, automate internal processes, integrate AI into existing applications, or run large‑scale inference workloads without managing infrastructure.

A major advantage is the direct access to a wide range of models, including ChatGPT, Claude, and many specialized foundation models. You can test them interactively in the browser, configure deployments, and expose them as APIs for your own applications.

You can try all of this with a free Azure trial subscription. The included credits allow you to create resources, deploy models, and experiment with Azure AI Foundry at no cost — ideal for learning, prototyping, and building your first AI‑powered tools.

Step 1: Create a model and project
1. Open ai.azure.com and sign in.
2. Click on Model catalog.
3. Search for GPT‑4.1 mini (or any other model you like).
4. Open the model’s detail page.
5. Click Use this model.
6. Choose “Create new project” when prompted and enter a project name.
7. Choose:
  - Subscription
  - Resource group
  - Resource name
  - Region
8. Confirm creation.
Azure will provision the resource in the background and create your model. Once ready, you will land on an overview page:

Step 2: Use the API Key in Your Code Project

On the right side of the overview you will find coding examples how to integrate the model in your projects.

On the left side you will find the URL of the API endpoint and the API key we are going to use in our example.

You could for example create a simple Node.js aplication using the OpenAIClient:
- Install Node.js if you have not already
- Create a file package.json and copy the following snippet to install all needed dependencies:
```
{
  "type": "module",
  "dependencies": {
    "openai": "latest"
  }
}
```
- run npm install
- create a new .js file, e.g. foundryTest.js
- copy the following code and insert the URL and API key of your Foundry Resource
```
import { AzureOpenAI } from "openai";

const endpoint = "<your API endpoint>";
const apiKey = "<your API key>";

const apiVersion = "2024-04-01-preview";
const deployment = "gpt-4.1-mini";

export async function main() {

    const options = { endpoint, apiKey, deployment, apiVersion }
    const client = new AzureOpenAI(options);

    const response = await client.chat.completions.create({
        messages: [
            { role: "system", content: "You are a helpful assistant." },
            { role: "user", content: "I am going to Paris, what should I see?" }
        ]
    });

    if (response?.error !== undefined && response.status !== "200") {
        throw response.error;
    }
    console.log(response.choices[0].message.content);
}

main().catch((err) => {
    console.error("The sample encountered an error:", err);
});
```
- run “node foundryTest.js”
The code will send the hardcoded request “I am going to Paris, what should I see?” to your foundry resource and authorize using your API key. You will see the response of your model in the console.

A Note on API Keys

API keys are sensitive secrets that grant full access to your Azure AI resources. Anyone who obtains your key can run requests against your deployment, which may generate unexpected costs or allow unauthorized use of your models. For that reason, API keys must never be shared publicly, posted in screenshots, or committed to GitHub repositories. Always store them in environment variables, secret managers, or encrypted configuration files, and rotate them immediately if you suspect they may have leaked.

Summary

Azure AI Foundry makes it easy to explore modern AI models, deploy them as APIs, and integrate them into your applications. With the free Azure trial, you can experiment with GPT‑4.1 mini and build your first AI‑powered tools without upfront cost. Just pick a model, create a project, deploy it, copy your API key, and start coding.
March 5, 2026
What Is Agentic AI?
Agentic AI is emerging as a key concept in the next generation of software development. Instead of simply responding to prompts, agentic systems can take initiative, break down tasks, make decisions, and interact with tools or codebases autonomously. This shifts AI from a passive assistant to an active collaborator—one that can analyze projects, modify files, generate code, and maintain complex systems with far less manual effort.

This blog post explains what agentic AI is, how it works, and why it matters specifically for software engineering.

Understanding Agentic AI

Agentic AI refers to systems that can act independently toward a defined goal. Unlike traditional language models, which only generate text based on a prompt, agentic systems combine several capabilities:
- Goal‑oriented behavior
- Task decomposition
- Tool use
- Memory and context management
- Autonomous decision‑making
This makes agentic AI more similar to a junior developer or automation system than a simple chatbot.

How Agentic AI Differs from Standard LLMs

A conventional LLM responds to a single prompt, has no persistent memory, and cannot take actions or modify files. An agentic AI system, by contrast, can:
- receive a goal
- analyze the project
- decide which files to inspect
- propose or apply changes
- evaluate whether the goal is met
- iterate until the task is complete
This transforms AI from a text generator into an active problem‑solver.

Core Components of Agentic AI

Planning

The agent determines what steps are required to achieve the goal.

Tool Use

Agents can call external tools such as file editors, compilers, linters, test runners, or APIs.

Memory

Agents maintain short‑term or long‑term memory to track progress and context.

Reflection

Agents evaluate their own output and adjust their approach.

Why Agentic AI Matters for Software Development

Software development naturally involves multi‑step reasoning, interacting with tools, modifying files, and maintaining consistency across a codebase. Agentic systems can support developers by:
- automating code changes
- understanding project‑wide structure
- providing continuous assistance
- reducing repetitive manual work
This leads to faster iteration and more efficient workflows.

Examples of Agentic AI in Modern Development Tools

IDE‑Integrated Agents

Tools like VS Code extensions or Xcode’s new agentic features allow agents to inspect project structure, apply code changes, fix build errors, and generate new components.

DevOps and CI Agents

Agents can analyze pipelines, update configurations, or validate deployments.

Codebase Maintenance

Agents can scan for outdated dependencies, unused code, or inconsistent patterns and propose fixes.

Summary

Agentic AI represents the next step in AI‑assisted software development. Instead of simple prompt‑response interactions, agentic systems can plan, act, use tools, and modify code autonomously. This enables faster development cycles, automated maintenance, and deeper integration with IDEs and local workflows.
March 3, 2026
Xcode Introduces Built‑In Agentic AI Tools

Apple’s latest release of Xcode introduces integrated agentic AI development tools, marking a significant shift in how developers can build, analyze, and maintain applications across Apple platforms. With Xcode 26.3, AI agents from Anthropic and OpenAI can now operate directly inside the IDE, assisting with tasks that range from analyzing project structure to autonomously modifying files and generating code.

Apple describes this as agentic coding, a workflow in which coding agents can break down tasks, make decisions based on the project architecture, and collaborate throughout the entire development lifecycle.

What’s new in Xcode 26.3

The update expands Xcode’s existing AI capabilities by allowing agents such as Claude Agent and OpenAI’s Codex to take action inside the IDE rather than simply offering suggestions. These agents can analyze entire projects, update configurations, fix compile errors, and help developers iterate more quickly. This represents a move from traditional prompt‑based assistance toward more autonomous, goal‑oriented development support.

Availability

The updated version of Xcode is now available as a free download in the Mac App Store, making these new AI‑driven development features accessible to all Apple developers. You can find it here:

XCode in Apple App Store

February 27, 2026

What Are LLMs and How Do They Perform on Consumer Hardware?

Running large language models (LLMs) locally have become a realistic option for developers who want privacy, predictable costs, and full control over their AI workflows. Tools like Ollama, LM Studio, and mlx‑based models on Apple Silicon make it possible to run capable models directly on a laptop or compact desktop machine.

This article provides an overview of how local LLMs work, what the key concepts mean, and what you can expect from consumer hardware such as the Mac mini M4.

What a Large Language Model Actually Is

A large language model is a statistical system trained on large text datasets. It predicts the next token in a sequence, where a token is a small unit of text (roughly 3–4 characters on average). Everything an LLM does—writing code, explaining errors, generating documentation—comes from this next‑token prediction process.

At runtime, the model does not “look up” answers. It performs a sequence of matrix multiplications using its internal parameters (weights). These weights encode patterns learned during training.

Model Size: What “7B”, “13B”, or “70B” Means

Model sizes are usually expressed in billions of parameters:

3B–7B	Fast chat, simple coding tasks, lightweight agents	Runs on almost any modern machine
13B	More coherent reasoning, better coding support	Needs more RAM and bandwidth
30B–70B	High‑quality reasoning, strong coding performance	Requires high memory bandwidth and large RAM/VRAM

A parameter is a single floating‑point value. More parameters generally mean better reasoning and more context understanding, but also higher memory usage and slower inference.

How this works at runtime

When you send a prompt to a local model:

The model loads its weights into RAM (or VRAM).
The prompt is converted into tokens.
The model processes these tokens through its layers.
It predicts the next token.
The new token is appended to the input.
Steps 3–5 repeat until the output is complete.

The speed of this loop depends on:

memory bandwidth
CPU/GPU architecture
quantization level
model size
context length

Apple Silicon performs well here because of its unified memory architecture and high bandwidth.

The Mac mini M4 is a strong machine for local AI development. Even the base model offers:

high memory bandwidth
efficient matrix multiplication hardware
unified memory (shared between CPU and GPU)
excellent performance per watt

Practical Model Sizes on a Mac mini M4:

Llama 3.1 8B	~4–5 GB	Smooth	Good for chat and basic coding
Llama 3.1 12B	~7–8 GB	Smooth	Better reasoning, solid coding
Qwen 2.5 Coder 7B	~4–5 GB	Smooth	Strong coding performance
Qwen 2.5 Coder 14B	~8–10 GB	Good	Slower but usable for coding tasks
Llama 3.1 70B	35–40 GB	Not practical	Too large for local RAM limits

Beyond Apple Silicon systems, several other current hardware options handle local LLMs effectively. AMD’s recent desktop processors, such as the Ryzen 7000 and 9000 series, offer strong CPU‑side inference performance thanks to high core counts and substantial memory bandwidth, making them suitable for running 7B–13B models in quantized formats.

For users who prefer GPU acceleration, mid‑range NVIDIA cards like the RTX 4060, 4070, or 4070 Ti provide stable throughput for models in the 7B–30B range, depending on available VRAM. An RTX 4060 with 8 GB VRAM is well suited for 7B models in FP16 or larger models in 4‑bit quantization, while a 4070 with 12 GB VRAM can handle 13B models at higher precision with comfortable performance. These configurations give developers a broad set of options for running local models, depending on whether they prioritize CPU inference, GPU acceleration, or a balance of both.

Local Models vs. Cloud Models: What you should know

Local models offer:

Privacy: your code never leaves your machine
Predictable cost: no API billing
Offline capability
Full control over model versions and behavior

Cloud models still lead in:

complex reasoning
long‑context tasks
multi‑modal workflows
raw performance

For many developers, a hybrid workflow makes sense: local models for everyday coding and cloud models for complex tasks.

Summary

Local LLMs have reached a point where they are practical for everyday development tasks. With tools like Ollama and Continue and modern hardware such as the Mac mini, you can run capable models without relying on cloud services.

Key points:

LLMs predict tokens using learned parameters.
Model size affects quality and hardware requirements.
A Mac mini M4 handles 7B–12B models very well.
Local models are ideal for private, cost‑controlled development workflows.

February 27, 2026

Build a local AI coding agent with VS Code, Continue and Ollama
Building a local coding assistant is a practical way to keep your data private and avoid recurring AI subscription costs. If your hardware is capable of running local language models—such as an Apple Silicon machine—you can integrate them directly into Visual Studio Code using the Continue extension.

Prerequisites:
- Visual Studio Code
- the Continue extension from the VS Code marketplace
- Ollama installed and running locally
- at least one model installed
I use a Mac mini M4 as my local AI environment. Models in the 7B–12B range run reliably on this hardware and provide good responsiveness for development tasks. This includes models such as Llama 3.1 8B, Qwen 2.5 Coder 7B, and Mistral 7B.

Installing Continue in Visual Studio Code
1. Open Visual Studio Code.
2. Go to the Extensions panel.
3. Search for “Continue”.
4. Install the extension
5. Reload the editor if prompted.
After installation, a new sidebar icon labeled “Continue” appears in the Activity Bar.

Preparing Ollama

If you have not yet installed Ollama, you can check out my guide here

Before connecting Continue to Ollama, verify that Ollama is installed and running:
```
ollama run llama3.1
```
If the model loads and responds, the local AI server is active.

Connecting Continue to Ollama

Continue uses a configuration file named continue.json. The extension creates it automatically the first time you open the sidebar.

To configure Ollama:
1. Open the Continue sidebar.
2. Click the settings icon in the top-right corner.
3. Navigate to “Configs” / “Local Config”.
4. Add a model entry pointing to the local Ollama server.
A minimal configuration looks like this:
```
name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Qwen2.5-Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - autocomplete
      - chat
      - edit
      - apply
```
Using the Chat Window

The chat window is the main interface for interacting with your local model. It supports several useful features:

Asking Questions About Your Code

You can ask the model to explain a function, summarize a file, or describe how a module works. Continue automatically includes the relevant file context when you reference it. When you type your question and send it with Ctrl/Cmd + Enter, Continue will automatically add the active file as context.

Generating or Refactoring Code

You can request new code or improvements to existing code:
```
“Refactor this function for readability.”
“Generate a TypeScript interface for this JSON structure.”
```
Switching Models

The model dropdown at the top of the chat panel allows you to switch between installed Ollama models instantly. This is useful when comparing output quality or performance.

Inline Editing Actions

Continue also supports inline actions directly in the editor:
1. Select a block of code.
2. Press Cmd+I (macOS) or Ctrl+I (Windows/Linux).
3. Choose an action such as “Explain”, “Refactor”, or “Add Comments”.
The model processes only the selected code and returns the result in a new editor tab or inline, depending on the action.

This workflow is efficient for small, focused tasks.

Continue Quickstart

The Continue extension includes a small quickstart Python file that demonstrates how the extension works. You can find it in the Continue settings (inside the chat window) under “Help” / “Quickstart”

It contains a few code examples and instructions how Continue can work with them.

Summary

The Continue extension provides a clean and flexible way to use local Ollama models inside Visual Studio Code. Installation is straightforward, configuration requires only a few lines in a JSON file, and the chat interface integrates naturally into the development workflow. With a capable machine such as the Mac mini M4, local models offer fast responses and a private, cost‑free alternative to cloud‑based assistants.
February 26, 2026
Building a minimal Ollama Chat in pure HTML & JavaScript
In our previous tutorial, we set up a local Ollama instance: How to Install Ollama locally and run your first model

In this tutorial, we’re going to build a super‑simple chat app using plain HTML and JavaScript. We’ll walk through how the chat logic works, how messages are sent to Ollama, and how the UI updates in real time.

You can download the sample app here:https://github.com/agentic-ai-info/simple-ollama-chat

Overview

Our minimal chat app consists of:
- index.html — a simple UI with a message list, input field and send button
- main.js — the actual chat logic (sending prompts, receiving responses, updating the UI)
- server.js — a tiny Node server that serves the static files and proxies requests to Ollama
The entire app runs locally and communicates with a local Ollama instance.

The Chat Logic (main.js)

Let’s break down the important parts: At the top of the file, we grab references to the DOM elements we need:
```
const messagesEl = document.getElementById("messages");
const promptEl = document.getElementById("prompt");
const sendBtn = document.getElementById("sendBtn");
const modelEl = document.getElementById("model");
```
These give us access to:
- the chat message container
- the text input
- the send button
- the model selector
Displaying Messages

Whenever the user or the model sends a message, we append it to the chat window:
```
function addMessage(text, role) {
    const div = document.createElement("div");
    div.className = "msg " + role;
    div.textContent = text;
    messagesEl.appendChild(div);
    messagesEl.scrollTop = messagesEl.scrollHeight;
}
```
This function:
- creates a new <div>
- assigns it a CSS class (user, assistant, or llm)
- inserts the message text
- scrolls the chat window to the bottom
Sending a Message to Ollama

The core of the chat app is the sendMessage() function:
```
async function sendMessage() {
    const prompt = promptEl.value.trim();
    if (!prompt || isSending) return;

    const model = modelEl.value.trim() || "llama3";

    addMessage(prompt, "user");
    promptEl.value = "";
    promptEl.focus();
}
```
Here we
- read the user’s input
- prevent double‑sending
- display the user message immediately
- clear the input field
Then we send the request to our proxy endpoint:
```
const response = await fetch("/proxy/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        model: model,
        messages: [{ role: "user", content: prompt }],
        stream: false
    })
});
```
- We call /proxy/api/chat instead of talking to Ollama directly (to prevent CORS problems)
- The request body matches Ollama’s chat API format
- stream: false keeps things simple (no streaming yet)
Once Ollama replies, we parse the JSON and extract the model’s message:
```
const data = await response.json();
const content = data.message?.content ?? JSON.stringify(data);
addMessage(content, "llm");
```
The Minimal Node Server (server.js)

Our ‘server’ does two things:

1. Serves the static files: It delivers index.html, main.js, and any CSS files to the browser.

2. Acts as a reverse proxy to Ollama: Normally the browser is not allowed to call Ollama (http://localhost:11434) directly because of CORS restrictions.

So we forward all requests under /proxy/* to Ollama:
- stripping unnecessary browser headers
- keeping the request clean
- avoiding CORS issues entirely
To run the app, just call
```
node server.js
```
Conclusion

This tiny Ollama chat app is an example of how far you can get with just:
- HTML
- pure JavaScript
- a minimal Node server
No frameworks, no build tools, no dependencies.

If you want to extend it, here are some natural next steps:
- add streaming responses
- store chat history
- support multiple models
- add a nicer UI
But even in its minimal form, this setup gives you a fully functional local LLM chat interface that’s easy to understand.
February 25, 2026
How to set up GitHub Copilot Chat in VS Code (Step‑by‑Step Guide)
GitHub Copilot Chat is one of the easiest ways to get AI assistance directly inside Visual Studio Code. Whether you want help writing code, generating project templates, or understanding errors, Copilot Chat integrates seamlessly into your workflow – and it’s going to be your new best friend on the next projects.

In this quick guide, you’ll learn how to:
- install Visual Studio Code
- create a GitHub account
- install the GitHub Copilot Chat extension
- send your first AI prompt inside VS Code
- understand GitHub Copilot pricing and the free tier
1. Install Visual Studio Code

Visual Studio Code (VS Code) is a lightweight, fast, and free code editor from Microsoft.

Go to the official download page: https://code.visualstudio.com

Choose your operating system:
- Windows
- macOS
- Linux
Install VS Code using the standard installer.

Once installed, launch the editor.

2. Create a GitHub Account

GitHub Copilot requires a GitHub account. You can start right away with a free GitHub Copilot account: https://github.com/signup

Follow the steps to:
- enter your email
- choose a username
- set a password
- verify your account
Once done, you’re ready to connect GitHub to VS Code.

3. Install the GitHub Copilot Chat Extension

GitHub Copilot Chat is available as an official extension inside VS Code.

You can install it directly from your GitHub settings page: https://github.com/settings/copilot

There you can activate Copilot and follow the instructions to connect it to VS Code.

Alternatively, you can install via VS Code Marketplace
1. Open VS Code
2. Click the Extensions icon on the left sidebar
3. Search for: GitHub Copilot Chat
4. Click Install
After installation VS Code will ask you to sign in with GitHub.

Confirm the login and authorize VS Code to access your GitHub account.

4. Send Your First Message in Copilot Chat

Once the extension is installed, you’ll see a new Copilot Chat icon in the sidebar.

Open the chat
- Click Copilot Chat
- A chat window appears on the right side of VS Code
Send your first prompt

Try something like:
```
Create a minimal HTML/JS WebPage project.
```
Copilot will generate:
- an index.html file
- a basic JavaScript file
- optional CSS
- instructions on how to run the project
You can accept or modify the suggestions and let Copilot insert the files directly into your workspace.

5. GitHub Copilot Pricing (Including Free Tier)

GitHub Copilot offers several plans depending on your needs. You can start with a free account which currently offers
- 2,000 code completions per month
- 50 Copilot Chat messages per month
- Access to GPT‑4o and Claude 3.5 Sonnet models
GitHub also provides a free Copilot plan for:
- verified students
- teachers
- maintainers of popular open‑source projects
This includes access to Copilot Chat.

Paid Plans
- Copilot Individual: Monthly subscription with full access to Copilot Chat, code completions, and inline suggestions, starting at 10$/month
- Copilot Business / Enterprise: For teams, with additional security and policy controls.
February 12, 2026
How to Install Ollama locally and run your first model
Running large language models (LLMs) locally has never been easier. Ollama provides a lightweight, fast, and privacy‑friendly way to run models like Llama 3, Mistral, Phi‑3, Gemma, and many others directly on your machine — without sending data to the cloud.

In this guide, you’ll learn:
- how to install Ollama
- how to verify the installation
- how to download and run your first model
- how to send your first chat message
- optional: how to use the local Ollama API
Let’s get started.

1. What Is Ollama?

Ollama is a local runtime for LLMs that focuses on simplicity and performance. It provides:
- one‑command model downloads
- automatic GPU acceleration (if available)
- a built‑in chat interface
- a local REST API
- support for many open‑source models
It’s ideal for developers, researchers, and anyone who wants to experiment with AI locally.

2. Installing Ollama

Ollama supports macOS, Windows, and Linux. Installation takes only a minute.

Windows Installation
1. Download the Windows installer from the official website: https://ollama.com/download
2. Run the .exe file
3. Follow the setup wizard
4. After installation, Ollama is available in PowerShell or Command Prompt
macOS Installation
1. Download the macOS installer from the official website: https://ollama.com/download
2. Open the .dmg file
3. Drag Ollama into your Applications folder
4. Launch Ollama once to initialize the background service
Linux Installation

Run the official install script:
```
curl -fsSL https://ollama.com/install.sh | sh
```
This installs:
- the Ollama daemon
- the command‑line interface
- system services
3. Verify That Ollama Is Installed

Open your terminal (macOS/Linux) or PowerShell (Windows) and run:
```
ollama --version
```
If you see a version number, everything is installed correctly.

4. Download and Run Your First Model

Ollama downloads models automatically when you run them for the first time.

For example, to run Llama 3:
```
ollama run llama3
```
What happens now:
- Ollama downloads the model
- The model starts running locally
- A chat prompt appears
5. Send Your First Chat Message

Once the model is running, you’ll see a prompt and can type anything, for example:
```
>>> Hello! How are you?
```
The model will respond immediately.

To exit the chat:
- type /bye
- or press Ctrl + C
6. List Installed Models, remove a model

To see which models are currently installed:
```
ollama list
```
If you want to free up disk space:
```
ollama rm llama3
```
8. Using the Ollama API (Optional)

Ollama exposes a local API at:
```
http://localhost:11434
```
You can send requests using curl:
```
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Tell me something about Agentic AI."
}'
```
This is perfect for integrating Ollama into:
- Python scripts
- Web apps
- Backend services
- Automation workflows
Conclusion

Ollama makes it incredibly easy to run powerful AI models locally. With just a few commands, you can:
- install the runtime
- download models
- chat with them
- integrate them into your own applications
If you’re exploring AI, building prototypes, or experimenting with local LLMs, Ollama is one of the best tools to start with.
February 11, 2026