Implementing RAG with Azure Foundry, .NET backend and PostgreSQL vector database

In our last blog post we created a super simple RAG example with Ollama, a JSON file and Javascript, see here. In this post we are going to do a more ‘realistic’ approach.

In this example we will build a small demo app using Azure Foundry, a .NET backend, a PostgreSQL database for embeddings and docker desktop to run all this.

You can download the sample code here: https://github.com/agentic-ai-info/AzureRAG

The Core Purpose of RAG

Think of RAG as a two-step process:

  1. Retrieve the most relevant pieces of your data for a user question.
  2. Generate an answer using those retrieved snippets as context.

This means you can ask questions like:

  • “How do I configure feature X?”
  • “What are the supported integration limits?”
  • “Which steps are required after installation?”

…and the model answers based on your embedded documents, not generic internet-style guesses.

For customer-facing scenarios, this is huge: you get more precise answers, better consistency, and easier control over what information the model uses.

Demo App Overview

Before we can start with the demo application you will need a few prerequisites:

Get the sources here: https://github.com/agentic-ai-info/AzureRAG

You will also need Docker Desktop to run the containers: https://www.docker.com/products/docker-desktop/

To run the data import script you will need python.

And you will need a Azure Foundry resource where you can deploy two models. Before you can run the sample code you will have to create two endpoints in Azure Foundry. You can follow this blog post to get started: How to Create an Azure AI Foundry Resource

The demo uses two Azure Foundry endpoints and two models:

  • Chat model: e.g gpt-4.1-mini
  • Embedding model: e.g. text-embedding-ada-002

You need to add your model endpoints and the api key to a .env file, you can copy .env.template to have a starting point.

Our demo app consists of the following parts:

  • ASP.NET API backend
    Exposes endpoints to store embeddings and query the system.
  • PostgreSQL with pgvector
    Stores document chunks and vectors, and performs nearest-neighbor search.
  • Azure Foundry client integration
    Calls Azure-hosted models for embeddings and answer generation.
  • Chunking script
    A Python script reads a text file, chunks it, and sends chunks to the API for embedding/storage.

High-level flow

  1. A document (for example user documentation) is split into chunks.
  2. Each chunk is converted into a vector embedding.
  3. The text + metadata + vector are stored in PostgreSQL.
  4. A user question is embedded the same way.
  5. Vector search finds the nearest chunks.
  6. Those chunks are passed as context to the chat model.
  7. The chat model returns a final answer.

That is the RAG loop in action.

How to Run the Demo in 3 Steps

If you want to try this RAG pipeline yourself, here is the fastest path:

1) Start the stack

docker compose up --build -d

This starts PostgreSQL (with pgvector) and the ASP.NET API.

2) Import the demo knowledge base

python3 scripts/embed_file.py scripts/demo-data.txt --source demo-data

This reads the demo text, chunks it, creates embeddings via Azure Foundry, and stores vectors in Postgres.

3) Ask a question

curl -X POST http://localhost:5001/query \
  -H 'Content-Type: application/json' \
  -d '{"question":"What is the best month to travel?"}'

You should get an answer grounded in the imported document (e.g. spring: April–June, and early autumn: September–October).

API Calls to Azure Foundry in This Demo

Both Azure Foundry calls can use the same API key, as long as both model deployments live in the same Azure resource.
Authentication is sent via the api-key header.

1) Embeddings API call

This call is used in two places:

  • when ingesting document chunks
  • when embedding the user’s question for retrieval

Request pattern

  • POST .../embeddings?...
  • JSON body contains input (the text to embed)

The solution contains s small python script you can use to embed any text file you want. A small, fictional travel guide is also attached (demo-data.txt) which can be used to test the solution.

Response pattern

  • data[0].embedding returns the numeric vector
  • with text-embedding-ada-002, that vector has 1536 dimensions

What does 1536 dim vector mean?

The model represents each text chunk as a point in a 1536‑dimensional mathematical space. A higher‑dimensional vector gives the model more “room” to encode nuance. But that does not automatically mean “better RAG”.

Smaller (e.g., 384, 512)Fast, cheap, small index, good for short textsLess semantic nuance
Medium (768–1536)Strong general-purpose semantic qualityLarger index, slower
Large (2048–4096+)More expressive, better for long/complex textsMuch heavier compute, diminishing returns


text-embedding-ada-002 at 1536 dims became popular because it hit a strong balance of: semantic quality, speed, cost and compatibility with vector DBs.

The resulting vector is stored in pgvector and later used for similarity search.

2) Chat Completions API call

This call is used after retrieval, to generate the final grounded answer.

Request pattern

  • POST .../chat/completions?...
  • JSON body contains messages, typically:
    • a system instruction
    • a user message that includes both retrieved context and the question

Response pattern

In short

  • Embeddings endpoint: finds the most relevant context
  • Chat endpoint: writes the final answer using that context

Why This Matters in Real Projects

RAG is one of the most practical ways to put LLMs into production without fine-tuning:

  • You can update knowledge by updating documents (not retraining models).
  • You can keep answers aligned with your own product/domain wording.
  • You can add metadata, filtering, and source traceability.

For support teams, onboarding portals, or technical documentation assistants, this architecture is often the fastest path to value.

Final Notes: Cost and Data Responsibility

Two important reminders before trying this:

  1. Azure Foundry costs money
    Every embeddings and chat call consumes tokens/resources.
    Monitor usage and set budgets/alerts. Doing the embeddings for this example and sending a few test questions cost me exactly €0.01 in my Azure subscription. Interestingly, building the code itself used about 3% of my monthly Copilot request quota — so writing the code was actually more expensive than running it.
  2. Be careful with data and code
    Never send sensitive data blindly to APIs and avoid publishing secrets (API keys, endpoints with credentials, internal data) in repos or logs.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *