」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 從頭開始建立一個小型向量存儲

從頭開始建立一個小型向量存儲

發佈於2024-11-06
瀏覽:987

With the evolving landscape of generative AI, vector databases are playing crucial role in powering generative AI applications. There are so many vector databases currently available that are open source such as Chroma, Milvus along with other popular proprietary vector databases such as Pinecone, SingleStore. You can read the detailed comparison of different vector databases on this site.

But, have you ever wondered how these vector databases work behind the scenes?

A great way to learn something is to understand how things work under the hood. In this article, we will be building a tiny in-memory vector store "Pixie" from scratch using Python with only NumPy as a dependency.

Building a tiny vector store from scratch


Before diving into the code, let's briefly discuss what a vector store is.

What is a vector store?

A vector store is a database designed to store and retrieve vector embeddings efficiently. These embeddings are numerical representations of data (often text but could be images, audio etc.) that capture semantic meaning in a high-dimensional space. The key feature of a vector store is their ability to perform efficient similarity searches, finding the most relevant data points based on their vector representations. Vector stores can be used in many tasks such as:

  1. Semantic search
  2. Retrieval augmented generation (RAG)
  3. Recommendation system

Let's code

In this article, we are going to create a tiny in-memory vector store called "Pixie". While it won't have all the optimizations of a production-grade system, it will demonstrate the core concepts. Pixie will have two main functionalities:

  1. Storing document embeddings
  2. Performing similarity searches

Setting up the vector store

First, we'll create a class called Pixie:

import numpy as np
from sentence_transformers import SentenceTransformer
from helpers import cosine_similarity


class Pixie:
    def __init__(self, embedder) -> None:
        self.store: np.ndarray = None
        self.embedder: SentenceTransformer = embedder
  1. First, we import numpy for efficient numerical operations and storing multi-dimensional arrays.
  2. We will also import SentenceTransformer from sentence_transformers library. We are using SentenceTransformer for embeddings generation, but you could use any embedding model that converts text to vectors. In this article, our primary focus will be on vector store itself, and not on embeddings generation.
  3. Next, we'll initialize Pixie class with an embedder. The embedder can be moved outside of the main vector store but for simplicity purposes, we'll initialize it inside the vector store class.
  4. self.store will hold our document embeddings as a NumPy array.
  5. self.embedder will hold the embedding model that we'll use to convert documents and queries into vectors.

Ingesting documents

To ingest documents/data in our vector store, we'll implement the from_docs method:

def from_docs(self, docs):
        self.docs = np.array(docs)
        self.store = self.embedder.encode(self.docs)
        return f"Ingested {len(docs)} documents"

This method does few key things:

  1. It takes a list of documents and stores them as a NumPy array in self.docs.
  2. It uses the embedder model to convert each document into a vector embedding. These embeddings are stored in self.store.
  3. It returns a message confirming how many documents were ingested. The encode method of our embedder is doing the heavy lifting here, converting each text document into a high-dimensional vector representation.

Performing similarity search

The heart of our vector store is the similarity search function:

def similarity_search(self, query, top_k=3):
        matches = list()
        q_embedding = self.embedder.encode(query)
        top_k_indices = cosine_similarity(self.store, q_embedding, top_k)
        for i in top_k_indices:
            matches.append(self.docs[i])
        return matches

Let's break this down:

  1. We start by creating an empty list called matches to store our matches.
  2. We encode the user query using the same embedder model we used for ingesting the documents. This ensures that the query vector is in the same space as our document vectors.
  3. We call a cosine_similarity function (which we'll define next) to find the most similar documents.
  4. We use the returned indices to fetch the actual documents from self.docs.
  5. Finally, we return the list of matching documents.

Implementing cosine similarity

import numpy as np


def cosine_similarity(store_embeddings, query_embedding, top_k):
    dot_product = np.dot(store_embeddings, query_embedding)
    magnitude_a = np.linalg.norm(store_embeddings, axis=1)
    magnitude_b = np.linalg.norm(query_embedding)

    similarity = dot_product / (magnitude_a * magnitude_b)

    sim = np.argsort(similarity)
    top_k_indices = sim[::-1][:top_k]

    return top_k_indices

This function is doing several important things:

  1. It calculates the cosine similarity using the formula: cos(θ) = (A · B) / (||A|| * ||B||)
  2. First, we calculate the dot product between the query embeddings and all document embeddings in the store.
  3. Then, we compute the magnitudes (Euclidean norms) of all vectors.
  4. Lastly, we sort the found similarities and return the indices of the top-k most similar documents. We are using cosine similarity because it measures the angle between vectors, ignoring their magnitudes. This means it can find semantically similar documents regardless of their length.
  5. There are other similarity metrics that you can explore such as:
    1. Euclidean distance
    2. Dot product similarity

You can read more about cosine similarity here.

Piecing everything together

Now that we have built all the pieces, let's understand how they work together:

  1. When we create a Pixie instance, we provide it with an embedding model.
  2. When we ingest documents, we create vector embeddings for each document and store them in self.store.
  3. For a similarity search:
    1. We create an embedding for the query.
    2. We calculate cosine similarity between the query embeddings and all document embeddings.
    3. We return the most similar documents. All the magic happens inside the cosine similarity calculation. By comparing the angle between vectors rather than their magnitude, we can find semantically similar documents even if they use different words or phrasing.

Seeing it in action

Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.

import os
import sys
import warnings

warnings.filterwarnings("ignore")

import ollama
import numpy as np
from sentence_transformers import SentenceTransformer

current_dir = os.path.dirname(os.path.abspath(__file__))
root_dir = os.path.abspath(os.path.join(current_dir, ".."))
sys.path.append(root_dir)

from pixie import Pixie


# creating an instance of a pre-trained embedder model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# creating an instance of Pixie vector store
pixie = Pixie(embedder)


# generate an answer using llama3 and context docs
def generate_answer(prompt):
    response = ollama.chat(
        model="llama3",
        options={"temperature": 0.7},
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return response["message"]["content"]


with open("example/spacebattle.txt") as f:
    content = f.read()
    # ingesting the data into vector store
    ingested = pixie.from_docs(docs=content.split("\n\n"))
    print(ingested)

# system prompt
PROMPT = """
    User has asked you following question and you need to answer it based on the below provided context. 
If you don't find any answer in the given context then just say 'I don't have answer for that'. 
In the final answer, do not add "according to the context or as per the context". 
You can be creative while using the context to generate the final answer. DO NOT just share the context as it is.

    CONTEXT: {0}
    QUESTION: {1}

    ANSWER HERE:
"""

while True:
    query = input("\nAsk anything: ")
    if len(query) == 0:
        print("Ask a question to continue...")
        quit()

    if query == "/bye":
        quit()

    # search similar matches for query in the embedding store
    similarities = pixie.similarity_search(query, top_k=5)
    print(f"query: {query}, top {len(similarities)} matched results:\n")

    print("-" * 5, "Matched Documents Start", "-" * 5)
    for match in similarities:
        print(f"{match}\n")
    print("-" * 5, "Matched Documents End", "-" * 5)

    context = ",".join(similarities)
    answer = generate_answer(prompt=PROMPT.format(context, query))
    print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer))

    continue

Here is the output:

Ingested 8 documents

Ask anything: What was the invasion about?
query: What was the invasion about?, top 5 matched results:

----- Matched Documents Start -----
Epilogue: A New Dawn
Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished.

Chapter 3: The Invasion
Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks.

Chapter 5: The Final Battle
Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path.

Chapter 1: The Warning
It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline.

Chapter 4: Turning the Tide
The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back.
----- Matched Documents End -----


Question: What was the invasion about?
Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.

Building a tiny vector store from scratch

We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.

Github repo: Pixie

Happy coding, and may your vectors always point in the right direction!

版本聲明 本文轉載於:https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • PHP與C++函數重載處理的區別
    PHP與C++函數重載處理的區別
    作為經驗豐富的C開發人員脫離謎題,您可能會遇到功能超載的概念。這個概念雖然在C中普遍,但在PHP中構成了獨特的挑戰。讓我們深入研究PHP功能過載的複雜性,並探索其提供的可能性。 在PHP中理解php的方法在PHP中,函數超載的概念(如C等語言)不存在。函數簽名僅由其名稱定義,而與他們的參數列表無關...
    程式設計 發佈於2025-05-12
  • 為什麼PYTZ最初顯示出意外的時區偏移?
    為什麼PYTZ最初顯示出意外的時區偏移?
    與pytz 最初從pytz獲得特定的偏移。例如,亞洲/hong_kong最初顯示一個七個小時37分鐘的偏移: 差異源利用本地化將時區分配給日期,使用了適當的時區名稱和偏移量。但是,直接使用DateTime構造器分配時區不允許進行正確的調整。 example pytz.timezone(&#...
    程式設計 發佈於2025-05-12
  • Go語言垃圾回收如何處理切片內存?
    Go語言垃圾回收如何處理切片內存?
    Garbage Collection in Go Slices: A Detailed AnalysisIn Go, a slice is a dynamic array that references an underlying array.使用切片時,了解垃圾收集行為至關重要,以避免潛在的內存洩...
    程式設計 發佈於2025-05-12
  • 同實例無需轉儲複製MySQL數據庫方法
    同實例無需轉儲複製MySQL數據庫方法
    在同一實例上複製一個MySQL數據庫而無需轉儲在同一mySQL實例上複製數據庫,而無需創建InterMediate sqql script。以下方法為傳統的轉儲和IMPORT過程提供了更簡單的替代方法。 直接管道數據 MySQL手動概述了一種允許將mysqldump直接輸出到MySQL cli...
    程式設計 發佈於2025-05-12
  • Spark DataFrame添加常量列的妙招
    Spark DataFrame添加常量列的妙招
    在Spark Dataframe ,將常數列添加到Spark DataFrame,該列具有適用於所有行的任意值的Spark DataFrame,可以通過多種方式實現。使用文字值(SPARK 1.3)在嘗試提供直接值時,用於此問題時,旨在為此目的的column方法可能會導致錯誤。 df.withCo...
    程式設計 發佈於2025-05-12
  • 如何克服PHP的功能重新定義限制?
    如何克服PHP的功能重新定義限制?
    克服PHP的函數重新定義限制 但是,PHP工具腰帶中有一個隱藏的寶石:runkit擴展。它使您能夠靈活地重新定義函數。 runkit_function_renction_rename() runkit_function_redefine() //重新定義'this'以返回“新和...
    程式設計 發佈於2025-05-12
  • 如何使用node-mysql在單個查詢中執行多個SQL語句?
    如何使用node-mysql在單個查詢中執行多個SQL語句?
    Multi-Statement Query Support in Node-MySQLIn Node.js, the question arises when executing multiple SQL statements in a single query using the node-mys...
    程式設計 發佈於2025-05-12
  • JavaScript計算兩個日期之間天數的方法
    JavaScript計算兩個日期之間天數的方法
    How to Calculate the Difference Between Dates in JavascriptAs you attempt to determine the difference between two dates in Javascript, consider this s...
    程式設計 發佈於2025-05-12
  • 在GO中構造SQL查詢時,如何安全地加入文本和值?
    在GO中構造SQL查詢時,如何安全地加入文本和值?
    在go中構造文本sql查詢時,在go sql queries 中,在使用conting and contement和contement consem per時,尤其是在使用integer per當per當per時,per per per當per. [&​​&&&&&&&&&&&&&&&默元組方法在...
    程式設計 發佈於2025-05-12
  • 如何使用Python有效地以相反順序讀取大型文件?
    如何使用Python有效地以相反順序讀取大型文件?
    在python 中,如果您使用一個大文件,並且需要從最後一行讀取其內容,則在第一行到第一行,Python的內置功能可能不合適。這是解決此任務的有效解決方案:反向行讀取器生成器 == ord('\ n'): 緩衝區=緩衝區[:-1] ...
    程式設計 發佈於2025-05-12
  • 為什麼HTML無法打印頁碼及解決方案
    為什麼HTML無法打印頁碼及解決方案
    無法在html頁面上打印頁碼? @page規則在@Media內部和外部都無濟於事。 HTML:Customization:@page { margin: 10%; @top-center { font-family: sans-serif; font-weight: ...
    程式設計 發佈於2025-05-12
  • 如何有效地選擇熊貓數據框中的列?
    如何有效地選擇熊貓數據框中的列?
    在處理數據操作任務時,在Pandas DataFrames 中選擇列時,選擇特定列的必要條件是必要的。在Pandas中,選擇列的各種選項。 選項1:使用列名 如果已知列索引,請使用ILOC函數選擇它們。請注意,python索引基於零。 df1 = df.iloc [:,0:2]#使用索引0和1 ...
    程式設計 發佈於2025-05-12
  • 如何在無序集合中為元組實現通用哈希功能?
    如何在無序集合中為元組實現通用哈希功能?
    在未訂購的集合中的元素要糾正此問題,一種方法是手動為特定元組類型定義哈希函數,例如: template template template 。 struct std :: hash { size_t operator()(std :: tuple const&tuple)const {...
    程式設計 發佈於2025-05-12
  • MySQL中如何高效地根據兩個條件INSERT或UPDATE行?
    MySQL中如何高效地根據兩個條件INSERT或UPDATE行?
    在兩個條件下插入或更新或更新 solution:的答案在於mysql的插入中...在重複鍵更新語法上。如果不存在匹配行或更新現有行,則此功能強大的功能可以通過插入新行來進行有效的數據操作。如果違反了唯一的密鑰約束。 實現所需的行為,該表必須具有唯一的鍵定義(在這種情況下為'名稱'...
    程式設計 發佈於2025-05-12
  • 如何使用不同數量列的聯合數據庫表?
    如何使用不同數量列的聯合數據庫表?
    合併列數不同的表 當嘗試合併列數不同的數據庫表時,可能會遇到挑戰。一種直接的方法是在列數較少的表中,為缺失的列追加空值。 例如,考慮兩個表,表 A 和表 B,其中表 A 的列數多於表 B。為了合併這些表,同時處理表 B 中缺失的列,請按照以下步驟操作: 確定表 B 中缺失的列,並將它們添加到表的...
    程式設計 發佈於2025-05-12

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3