”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 从头开始构建一个小型矢量存储

从头开始构建一个小型矢量存储

发布于2024-11-06
浏览:792

With the evolving landscape of generative AI, vector databases are playing crucial role in powering generative AI applications. There are so many vector databases currently available that are open source such as Chroma, Milvus along with other popular proprietary vector databases such as Pinecone, SingleStore. You can read the detailed comparison of different vector databases on this site.

But, have you ever wondered how these vector databases work behind the scenes?

A great way to learn something is to understand how things work under the hood. In this article, we will be building a tiny in-memory vector store "Pixie" from scratch using Python with only NumPy as a dependency.

Building a tiny vector store from scratch


Before diving into the code, let's briefly discuss what a vector store is.

What is a vector store?

A vector store is a database designed to store and retrieve vector embeddings efficiently. These embeddings are numerical representations of data (often text but could be images, audio etc.) that capture semantic meaning in a high-dimensional space. The key feature of a vector store is their ability to perform efficient similarity searches, finding the most relevant data points based on their vector representations. Vector stores can be used in many tasks such as:

  1. Semantic search
  2. Retrieval augmented generation (RAG)
  3. Recommendation system

Let's code

In this article, we are going to create a tiny in-memory vector store called "Pixie". While it won't have all the optimizations of a production-grade system, it will demonstrate the core concepts. Pixie will have two main functionalities:

  1. Storing document embeddings
  2. Performing similarity searches

Setting up the vector store

First, we'll create a class called Pixie:

import numpy as np
from sentence_transformers import SentenceTransformer
from helpers import cosine_similarity


class Pixie:
    def __init__(self, embedder) -> None:
        self.store: np.ndarray = None
        self.embedder: SentenceTransformer = embedder
  1. First, we import numpy for efficient numerical operations and storing multi-dimensional arrays.
  2. We will also import SentenceTransformer from sentence_transformers library. We are using SentenceTransformer for embeddings generation, but you could use any embedding model that converts text to vectors. In this article, our primary focus will be on vector store itself, and not on embeddings generation.
  3. Next, we'll initialize Pixie class with an embedder. The embedder can be moved outside of the main vector store but for simplicity purposes, we'll initialize it inside the vector store class.
  4. self.store will hold our document embeddings as a NumPy array.
  5. self.embedder will hold the embedding model that we'll use to convert documents and queries into vectors.

Ingesting documents

To ingest documents/data in our vector store, we'll implement the from_docs method:

def from_docs(self, docs):
        self.docs = np.array(docs)
        self.store = self.embedder.encode(self.docs)
        return f"Ingested {len(docs)} documents"

This method does few key things:

  1. It takes a list of documents and stores them as a NumPy array in self.docs.
  2. It uses the embedder model to convert each document into a vector embedding. These embeddings are stored in self.store.
  3. It returns a message confirming how many documents were ingested. The encode method of our embedder is doing the heavy lifting here, converting each text document into a high-dimensional vector representation.

Performing similarity search

The heart of our vector store is the similarity search function:

def similarity_search(self, query, top_k=3):
        matches = list()
        q_embedding = self.embedder.encode(query)
        top_k_indices = cosine_similarity(self.store, q_embedding, top_k)
        for i in top_k_indices:
            matches.append(self.docs[i])
        return matches

Let's break this down:

  1. We start by creating an empty list called matches to store our matches.
  2. We encode the user query using the same embedder model we used for ingesting the documents. This ensures that the query vector is in the same space as our document vectors.
  3. We call a cosine_similarity function (which we'll define next) to find the most similar documents.
  4. We use the returned indices to fetch the actual documents from self.docs.
  5. Finally, we return the list of matching documents.

Implementing cosine similarity

import numpy as np


def cosine_similarity(store_embeddings, query_embedding, top_k):
    dot_product = np.dot(store_embeddings, query_embedding)
    magnitude_a = np.linalg.norm(store_embeddings, axis=1)
    magnitude_b = np.linalg.norm(query_embedding)

    similarity = dot_product / (magnitude_a * magnitude_b)

    sim = np.argsort(similarity)
    top_k_indices = sim[::-1][:top_k]

    return top_k_indices

This function is doing several important things:

  1. It calculates the cosine similarity using the formula: cos(θ) = (A · B) / (||A|| * ||B||)
  2. First, we calculate the dot product between the query embeddings and all document embeddings in the store.
  3. Then, we compute the magnitudes (Euclidean norms) of all vectors.
  4. Lastly, we sort the found similarities and return the indices of the top-k most similar documents. We are using cosine similarity because it measures the angle between vectors, ignoring their magnitudes. This means it can find semantically similar documents regardless of their length.
  5. There are other similarity metrics that you can explore such as:
    1. Euclidean distance
    2. Dot product similarity

You can read more about cosine similarity here.

Piecing everything together

Now that we have built all the pieces, let's understand how they work together:

  1. When we create a Pixie instance, we provide it with an embedding model.
  2. When we ingest documents, we create vector embeddings for each document and store them in self.store.
  3. For a similarity search:
    1. We create an embedding for the query.
    2. We calculate cosine similarity between the query embeddings and all document embeddings.
    3. We return the most similar documents. All the magic happens inside the cosine similarity calculation. By comparing the angle between vectors rather than their magnitude, we can find semantically similar documents even if they use different words or phrasing.

Seeing it in action

Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.

import os
import sys
import warnings

warnings.filterwarnings("ignore")

import ollama
import numpy as np
from sentence_transformers import SentenceTransformer

current_dir = os.path.dirname(os.path.abspath(__file__))
root_dir = os.path.abspath(os.path.join(current_dir, ".."))
sys.path.append(root_dir)

from pixie import Pixie


# creating an instance of a pre-trained embedder model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# creating an instance of Pixie vector store
pixie = Pixie(embedder)


# generate an answer using llama3 and context docs
def generate_answer(prompt):
    response = ollama.chat(
        model="llama3",
        options={"temperature": 0.7},
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return response["message"]["content"]


with open("example/spacebattle.txt") as f:
    content = f.read()
    # ingesting the data into vector store
    ingested = pixie.from_docs(docs=content.split("\n\n"))
    print(ingested)

# system prompt
PROMPT = """
    User has asked you following question and you need to answer it based on the below provided context. 
If you don't find any answer in the given context then just say 'I don't have answer for that'. 
In the final answer, do not add "according to the context or as per the context". 
You can be creative while using the context to generate the final answer. DO NOT just share the context as it is.

    CONTEXT: {0}
    QUESTION: {1}

    ANSWER HERE:
"""

while True:
    query = input("\nAsk anything: ")
    if len(query) == 0:
        print("Ask a question to continue...")
        quit()

    if query == "/bye":
        quit()

    # search similar matches for query in the embedding store
    similarities = pixie.similarity_search(query, top_k=5)
    print(f"query: {query}, top {len(similarities)} matched results:\n")

    print("-" * 5, "Matched Documents Start", "-" * 5)
    for match in similarities:
        print(f"{match}\n")
    print("-" * 5, "Matched Documents End", "-" * 5)

    context = ",".join(similarities)
    answer = generate_answer(prompt=PROMPT.format(context, query))
    print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer))

    continue

Here is the output:

Ingested 8 documents

Ask anything: What was the invasion about?
query: What was the invasion about?, top 5 matched results:

----- Matched Documents Start -----
Epilogue: A New Dawn
Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished.

Chapter 3: The Invasion
Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks.

Chapter 5: The Final Battle
Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path.

Chapter 1: The Warning
It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline.

Chapter 4: Turning the Tide
The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back.
----- Matched Documents End -----


Question: What was the invasion about?
Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.

Building a tiny vector store from scratch

We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.

Github repo: Pixie

Happy coding, and may your vectors always point in the right direction!

版本声明 本文转载于:https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • Python中何时用"try"而非"if"检测变量值?
    Python中何时用"try"而非"if"检测变量值?
    使用“ try“ vs.” if”来测试python 在python中的变量值,在某些情况下,您可能需要在处理之前检查变量是否具有值。在使用“如果”或“ try”构建体之间决定。“ if” constructs result = function() 如果结果: 对于结果: ...
    编程 发布于2025-07-21
  • 如何在JavaScript对象中动态设置键?
    如何在JavaScript对象中动态设置键?
    在尝试为JavaScript对象创建动态键时,如何使用此Syntax jsObj['key' i] = 'example' 1;不工作。正确的方法采用方括号: jsobj ['key''i] ='example'1; 在JavaScript中,数组是一...
    编程 发布于2025-07-21
  • 在JavaScript中如何并发运行异步操作并正确处理错误?
    在JavaScript中如何并发运行异步操作并正确处理错误?
    同意操作execution 在执行asynchronous操作时,相关的代码段落会遇到一个问题,当执行asynchronous操作:此实现在启动下一个操作之前依次等待每个操作的完成。要启用并发执行,需要进行修改的方法。 第一个解决方案试图通过获得每个操作的承诺来解决此问题,然后单独等待它们: co...
    编程 发布于2025-07-21
  • JavaScript计算两个日期之间天数的方法
    JavaScript计算两个日期之间天数的方法
    How to Calculate the Difference Between Dates in JavascriptAs you attempt to determine the difference between two dates in Javascript, consider this s...
    编程 发布于2025-07-21
  • 在GO中构造SQL查询时,如何安全地加入文本和值?
    在GO中构造SQL查询时,如何安全地加入文本和值?
    在go中构造文本sql查询时,在go sql queries 中,在使用conting and contement和contement consem per时,尤其是在使用integer per当per当per时,per per per当per. 在GO中实现这一目标的惯用方法是使用fmt.spr...
    编程 发布于2025-07-21
  • PHP阵列键值异常:了解07和08的好奇情况
    PHP阵列键值异常:了解07和08的好奇情况
    PHP数组键值问题,使用07&08 在给定数月的数组中,键值07和08呈现令人困惑的行为时,就会出现一个不寻常的问题。运行print_r($月份)返回意外结果:键“ 07”丢失,而键“ 08”分配给了9月的值。此问题源于PHP对领先零的解释。当一个数字带有0(例如07或08)的前缀时,PHP将...
    编程 发布于2025-07-21
  • Async Void vs. Async Task在ASP.NET中:为什么Async Void方法有时会抛出异常?
    Async Void vs. Async Task在ASP.NET中:为什么Async Void方法有时会抛出异常?
    在ASP.NET async void void async void void void void void的设计无需返回asynchroncon而无需返回任务对象。他们在执行过程中增加未偿还操作的计数,并在完成后减少。在某些情况下,这种行为可能是有益的,例如未期望或明确预期操作结果的火灾和...
    编程 发布于2025-07-21
  • 如何高效地在一个事务中插入数据到多个MySQL表?
    如何高效地在一个事务中插入数据到多个MySQL表?
    mySQL插入到多个表中,该数据可能会产生意外的结果。虽然似乎有多个查询可以解决问题,但将从用户表的自动信息ID与配置文件表的手动用户ID相关联提出了挑战。使用Transactions和last_insert_id() 插入用户(用户名,密码)值('test','test...
    编程 发布于2025-07-21
  • 为什么使用Firefox后退按钮时JavaScript执行停止?
    为什么使用Firefox后退按钮时JavaScript执行停止?
    导航历史记录问题:JavaScript使用Firefox Back Back 此行为是由浏览器缓存JavaScript资源引起的。要解决此问题并确保在后续页面访问中执行脚本,Firefox用户应设置一个空功能。 警报'); }; alert('inline Alert')...
    编程 发布于2025-07-21
  • Go语言如何动态发现导出包类型?
    Go语言如何动态发现导出包类型?
    与反射软件包中的有限类型的发现能力相反,本文探讨了在运行时发现所有包装类型(尤其是struntime go import( “ FMT” “去/进口商” ) func main(){ pkg,err:= incorter.default()。导入(“ time”) ...
    编程 发布于2025-07-21
  • 如何解决AppEngine中“无法猜测文件类型,使用application/octet-stream...”错误?
    如何解决AppEngine中“无法猜测文件类型,使用application/octet-stream...”错误?
    [ appengine静态文件mime type override ,静态文件处理程序偶尔可以覆盖正确的mime类型,在错误消息中导致错误消息:“无法猜测mimeType for for File [使用[File]。 application/application/octet-stream ....
    编程 发布于2025-07-21
  • 如何在Chrome中居中选择框文本?
    如何在Chrome中居中选择框文本?
    选择框的文本对齐:局部chrome-inly-ly-ly-lyly solument 您可能希望将文本中心集中在选择框中,以获取优化的原因或提高可访问性。但是,在CSS中的选择元素中手动添加一个文本 - 对属性可能无法正常工作。初始尝试 state)</option> < op...
    编程 发布于2025-07-21
  • Python高效去除文本中HTML标签方法
    Python高效去除文本中HTML标签方法
    在Python中剥离HTML标签,以获取原始的文本表示Achieving Text-Only Extraction with Python's MLStripperTo streamline the stripping process, the Python standard librar...
    编程 发布于2025-07-21
  • 在程序退出之前,我需要在C ++中明确删除堆的堆分配吗?
    在程序退出之前,我需要在C ++中明确删除堆的堆分配吗?
    在C中的显式删除 在C中的动态内存分配时,开发人员通常会想知道是否需要手动调用“ delete”操作员在heap-exprogal exit exit上。本文深入研究了这个主题。 在C主函数中,使用了动态分配变量(HEAP内存)的指针。当应用程序退出时,此内存是否会自动发布?通常,是。但是,即使在这...
    编程 发布于2025-07-21
  • 为什么PYTZ最初显示出意外的时区偏移?
    为什么PYTZ最初显示出意外的时区偏移?
    与pytz 最初从pytz获得特定的偏移。例如,亚洲/hong_kong最初显示一个七个小时37分钟的偏移: 差异源利用本地化将时区分配给日期,使用了适当的时区名称和偏移量。但是,直接使用DateTime构造器分配时区不允许进行正确的调整。 example pytz.timezone(...
    编程 发布于2025-07-21

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3