In this post, I will present a project that implements semantic search with a chat interface for Swedish legislation and regulations. This project utilizes an OpenAI’s embedding model and a chat completion model to provide accurate and contextual responses to user queries. The project leverages embeddings and a vector database to enable efficient and accurate search results with references to your own data. In the following sections, I will go into the aspects of the different components and their significance in the project.

The project’s implementation involves the following steps:

    1. Scraping Swedish legislation data
    1. Creating embeddings
    1. Setting up a vector database,
    1. Populating the vector database with embeddings
    1. Setup a context generation systetem that builds up a context around a query that is used as input to a chat completion model.

The setup can easily be extended to handle type of data that you want to make searchable with natural language queries.

The user will specify a question, the query is embedded and compared against the embedded contexts stored in the vector database. The nearest matches are extracted and used as context for a chat completion call to answer the initial query.

For the code see github

I have used the embedding model text-embedding-ada-002 and gpt-3.5-turbo (I do not have API-access to gpt-4) for the chat completion part. Both models seems to struggle a bit with Swedish and hallucinations are a recurring problem. gpt-3.5-turbo has also a short context length (compared to gpt-4) of around 4,000 tokens, which makes the size of the context that I can add smaller than for a model with a larger context length. In those cases where gpt-3 fails I have tested gpt-4 in the web interface, gpt-4 produces much fewer hallucinations, and the response follows directly from the added context.

Embeddings

Embedding is a powerful tool for representing complex and high-dimensional data in a more compact and efficient manner. In this project, embeddings are used to convert the text data of Swedish legislation and regulations into vectors containing floats. By doing so, we can compare and measure the similarity between different pieces of legislations and queries in a efficient way. An OpenAI’s embedding model is used to generate these embeddings.

Vector databases

A vector database is a specialized storage database designed to handle high-dimensional data, such as embeddings. It setup efficient storage, indexing, and retrieval of these vectors and accompanying data while having the ability to perform similarity-based searches. I have used a Weaviate vector database to store the embeddings. The use of a vector database allows for quick retrieval of relevant text when the user supply a free-text search query.

Hierarchical Navigable Small World (HNSW)

The Hierarchical Navigable Small World (HNSW) algorithm is an efficient graph-based search algorithm for approximate nearest neighbor search in high-dimensional spaces. It has gained popularity due to its ability to handle large-scale datasets while providing fast and accurate search results. The key concepts behind the HNSW algorithm are the small-world graph and its hierarchical structure. A small-world graph is a graph where most vertices are not neighbors with each other, but the neighbors of any vertex in the graph are, with high likelihood, a neighbor of each other. Because of this, these types of graphs normally have a small number of degrees of separation, meaning that vertices are typically reachable from another with few steps $L$. The typical distance $L$ increases with $L \propto \log N$, where N is the number of vertices of the graph. HNSW extends small-world graphs to make the edges separated across multiple different hierarchical layers. Where the top layers have few vertices with long edges, and further down you go, the more vertices exist with short edges. During a search in an HNSW graph, we start at the top layer and traverse the edges in a greedy fashion until we have found the local minimum. We then start the search anew at the layer below, starting at the current vertex. This zoom-in search continues until we have reached the lowest layer.

It should be known that HNSW, as well as any approximate nearest neighbors (ANN) methods, are just approximate; you will not get perfect recall that you would get with the otherwise untractable (with large datasets) K-Nearest Neighbor.

After the legislation and regulations have been populated in the vector database, we can find the nearest matches to a given query using the HNSW algorithm. The nearest matches are then used as context for a chat completion call to answer the initial query.

Examples

For setting up the different classes to enable the search see the notebook

from search import SwedishLegislationSearch
law_search = SwedishLegislationSearch(
    embedd_function=EmbedderSwedishLegislation.embedd_single,
    vector_db=vector_db,
)
law_search.query("Vilket ansvar har polisen under ett krigstillstånd? Svara detaljerat", limit=25, print_debug=True)

Example Query

Here’s an example query and the system’s response:

Query:

Vilket ansvar har polisen under ett krigstillstånd? Svara detaljerat?

Context:

…. …. ….

Lag (1943:881) om polisens ställning under krig||varsmakten. Lag (2014:581). 3 § Har upphävts genom lag (2014:581). 4 § Har upphävts genom lag (1980:587). 5 § Regeringen meddelar närmare föreskrifter om tillämpningen av denna lag. Lag (2014:581). Lag (1979:1088) om gränsövervakningen i krig m.m.||ten ska bedrivas, 2. vilka polismän, särskilt förordnade passkontrollanter och tulltjänstemän som ska avdelas för verksamheten och vilka militära enheter som ska medverka i denna, 3. hur personalen ska utbildas och utrustas, 4. vilken mate Lag (1979:1088) om gränsövervakningen i krig m.m.||atta åtgärder. Om Försvarsmakten, Polismyndigheten eller Tullverket ska besluta om fortsatta åtgärder ska ärendet i stället överlämnas dit. Lag (2014:684). 8 § Om en skärpning av kontrollen av persontrafiken till eller från utlandet med visst Förordning (1982:756) om Försvarsmaktens ingripanden vid kränkningar av Sveriges territorium under fred och neutralitet, m.m. (IKFN-förordning)||smyndigheten. Om Polismyndigheten begär det, ska personalen hållas kvar i avvaktan på vidare åtgärder. Om det är nödvändigt får Försvarsmakten tillgripa vapenmakt. Det som nu har sagts gäller inte ambulansluftfartyg. Förordning (2014:1213). 54

…. ….

Response:

Enligt Lag (1943:881) om polisens ställning under krig är en polisman skyldig att delta i rikets försvar i den omfattning regeringen föreskriver. En polisman som enligt särskilda bestämmelser ska delta i rikets försvar tillhör under krig Försvarsmakten. Polismyndigheten kan begära medverkan av Polismyndigheten om polisiär medverkan är nödvändig för att avvärja ett fientligt angrepp.

Enligt Kungörelse (1958:262) om tillämpning av lagen (1943:881) om polisens ställning under krig kan polismän avdelas för att fullgöra nödvändiga civila försvarsuppgifter eller medverkan allvarligt skulle äventyra den allmänna ordningen och säkerheten. När den begärda uppgiften har slutförts får Försvarsmakten inte förfoga över polismännen.

Polismyndigheten har även ansvar för gränsövervakningen i krig enligt Lag (1979:1088) om gränsövervakningen i krig m.m. Polismyndigheten, Tullverket, Försvarsmakten och Kustbevakningen ska samordna övervakningen av trafiken över rikets gränser.

Vidare kan polisen använda skjutvapen om gruppen, avdelningen eller någon annan utsätts för ett så allvarligt angrepp eller hot om angrepp att vapen måste användas omedelbart enligt Kungörelse (1969:84) om polisens användning av skjutvapen.

Referenser:

  • Lag (1943:881) om polisens ställning under krig
  • Kungörelse (1958:262) om tillämpning av lagen (1943:881) om polisens ställning under krig
  • Lag (1979:1088) om gränsövervakningen i krig m.m.
  • Kungörelse (1969:84) om polisens användning av skjutvapen

Conclusion

In conclusion, this project show how you can combine embeddings, vector databases, and chat completion models, to create an efficient semantic search engine for your own data. By using a chat interface, the system allows users to interact naturally making your data searchable with natural language.

Although the current setup uses gpt-3.5-turbo, the potential for even better results with gpt-4 is clear, with fewer hallucinations and more accurate responses based on the provided context.

The approach presented can be easily adapted and extended to handle other types of data, making it a versatile solution for different applications.