Connecting to an external Vector Store#

In this tutorial, you will learn how to connect to an external Vector Store using the Inline Python Agent tool. In this tutorial we will look at MongoDB Atlas Vector Search specifically, however, this tutorial would apply to other vector stores as well, as you’d only need to swap out the Mongo-specific code with the vector store you’d want to use otherwise.

Prerequisites#

As we’ll be creating an Inline Python Tool, it can be beneficial to read how they work in this tutorial. It is not a mandatory tutorial to follow, but it will allow you to understand the basics of Inline Python Tools.

  • Dataiku >= 14.2

  • MongoDB Atlas Cluster

    • Use the pre-seeded sample database if the database is empty

    • Have a username/password with read access to the database

    • A Vector Search Index

  • Python environment with pymongo installed as an additional package

Introduction#

In this tutorial, we’ll be diving into how to use MongoDB Atlas Vector Search in an Agent Tool. We’ll be using MongoDB Atlas to store the data. For easy demonstration of the tools.

You’ll see references to movies and a movie index in this tutorial. This is because we’re using the pre-seeded database by MongoDB: a collection of movies and their respective information. You, of course, don’t need to use the pre-seeded database, so feel free to adapt to your needs.

Generate Embeddings#

Although generating embeddings is listed as a prerequisite for this tutorial, it is essential to highlight this step. You will need to generate embeddings for the data you intend to use. You must use the same embedding model for creating the embeddings as you will use in the agent tool; Otherwise, it will not work (correctly).

Hint

If you haven’t yet created the embeddings, this is the time to do so. You can follow the tutorial on the MongoDB Documentation site for this if you don’t know how.

Storing the embeddings in the MongoDB Vector Search index is a mandatory step here.

Setting up the Inline Python Tool#

To set up the Inline Python Tool, go to the GenAI menu and click Agent Tools. Then click the New Agent Tool button on the top-right, and select Inline Python. Then, give it a suitable name, such as MongoDB-Vector-Search, and click Create.

Creating Agent Tool

The generated Inline Python Tool will have a prefilled code structure that we’ll use to create our tool.

Selecting your Code Environment#

Now, you need a Code Environment with the pymongo package installed.

Once you have created this Code Environment, you can select it through the Settings tab on your Inline Python Tool, and then click Save.

Testing the Inline Python Tool#

Your code is complete. For reference, you can find the full code here.

mongodb-vector-search.py
import dataiku
from dataiku.llm.agent_tools import BaseAgentTool
from pymongo import MongoClient

MONGO_URI = "mongodb+srv://[username]:[password]@[cluster]"
MONGO_DB = "movies"
MONGO_COLLECTION = "movies_embeddings"
MONGO_INDEX = "movie_index"
TEXT_FIELD = "title"
VECTOR_FIELD = "embedding"
NUM_CANDIDATES = 100
EMBEDDING_MODEL_ID = "internal-embedding-id"

class MongoVectorSearchTool(BaseAgentTool):
    def set_config(self, config, plugin_config):
        # Not needed for this script
        pass

    def get_descriptor(self, tool):
        return {
            "description": "Semantic search over a MongoDB Atlas Vector Search collection",
            "inputSchema": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "User question or search text",
                    },
                    "k": {
                        "type": "integer",
                        "description": "Number of results to return",
                        "default": 5,
                    },
                },
                "required": ["query"],
            },
        }

    def invoke(self, input, trace):
        args = input["input"]
        query_text = args["query"]
        k = args.get("k", 5)

        client = dataiku.api_client()
        project = client.get_default_project()
        llm = project.get_llm(EMBEDDING_MODEL_ID)

        emb_query = llm.new_embeddings()
        emb_query.add_text(query_text)
        emb_resp = emb_query.execute()
        query_vector = emb_resp.get_embeddings()[0]

        mongo = MongoClient(MONGO_URI)
        coll = mongo[MONGO_DB][MONGO_COLLECTION]

        pipeline = [
            {
                "$vectorSearch": {
                    "index": MONGO_INDEX,
                    "path": VECTOR_FIELD,
                    "queryVector": query_vector,
                    "numCandidates": NUM_CANDIDATES,
                    "limit": k,
                }
            },
            {
                "$project": {
                    TEXT_FIELD: 1,
                    "score": {"$meta": "vectorSearchScore"},
                }
            },
        ]

        raw = list(coll.aggregate(pipeline))
        results = [
            {
                "text": doc[TEXT_FIELD],
                "score": doc["score"],
            }
            for doc in raw
        ]

        return {"output": results}

It is time to test the tool. If you have configured all correctly, you should see an immediate response.

On the right of the screen in your Inline Python Tool, you see a Text field. Enter a query that matches your data using the format below, and then click the Run Test button.

{
   "input": {
      "query": "Give me movies about trains",
      "k": 5
   },
   "context": {}
}

You should see the result below the Run Test button in the Tool Output tab.

[
  {
    "text": "The Great Train Robbery",
    "score": 0.4930661916732788
  },
  {
    "text": "Shanghai Express",
    "score": 0.44809791445732117
  },
  {
    "text": "Now or Never",
    "score": 0.43981125950813293
  },
  {
    "text": "The Iron Horse",
    "score": 0.4387389123439789
  },
  {
    "text": "City Lights",
    "score": 0.4223432242870331
  }
]

This section is also where you can explore the Logs as well as the Tool Descriptor, which will show you the result of what you defined in the get_descriptor method.

Conclusion#

You’ve now connected to a remote MongoDB Vector Search using a Python Inline tool. You can use this connection and connections similar to it, to connect to external sources that have not been configured inside your Dataiku instance and/or those that are not supported out of the box by Dataiku.

You can now use this Inline Python tool to query your external Vector Store throughout Dataiku.

Reference documentation#

Classes#

dataikuapi.DSSClient(host[, api_key, ...])

Entry point for the DSS API client

dataikuapi.dss.llm.DSSLLMEmbeddingsQuery(...)

A handle to interact with an embedding query.

dataikuapi.dss.llm.DSSLLMEmbeddingsResponse(...)

A handle to interact with an embedding query result.

Functions#

get_llm(llm_id)

Get a handle to interact with a specific LLM

new_embeddings([text_overflow_mode])

Create a new embedding query.

get_embeddings()

Retrieve vectors resulting from the embeddings query.

add_text(text)

Add text to the embedding query.

execute()

Run the embedding query.