Managing Permissions for Vector Embeddings

Ravin Thambapillai

Ever since ChatGPT burst onto everybody’s radar in November, the demand for Enterprises to be able to safely use this technology on their own data has been almost universal. The angle of how OpenAI, Google, Anthropic and other model providers use this data has been discussed ad nauseam, but one thing often under explored is how companies can protect information that is access controlled within their organization, as they connect it to large language models.

Each individuals and team at a company has access to different shared documents, and spreadsheets in the company OneDrive or GoogleDrive or Box, as well as different access to data in operational tools like Salesforce, JIRA, e.t.c.

Enterprises have to consider how to represent both the access controls and the privacy or compliance requirements associated with each of these data sources, before connecting them to productivity apps that use embeddings on this data, since the embeddings themselves when stored or searched should respect the permissions and privacy policies of the underlying data itself.

This guide is about how to think about security, compliance and permissions for embeddings, including when and why embeddings should be considered as sensitive as the underlying data itself, when embedding permissions are actually important, and what the best ways to implement them are.

What are vector embeddings and why are they sensitive?

First and foremost, it's crucial for those in charge of the security of these systems to understand what vector embeddings are, and the potential security considerations around them. Vector embeddings translate complex data such as words, sentences, and even entire documents into a list of numbers, with the length of this list representing the ‘dimensionality’ of the vector. Each number in this list represents some aspect of the text, so for an example, you might have a list of numbers in which the first number tells you how friendly the tone is (with 1 being very rude and 10 being friendly), the second number telling you how readable the text is, and so on. In very highly dimensional vectors, you can capture a significant amount of detail of the context of text.

Here’s a very simplified example, where we have 4 example words, each represented in an list of 5 numbers (aka, an embedding with dimensionality “5”)

In this table, we can see that Dogs and Cats are quite similar (the only difference is that one can bark and the other can’t). With this (obviously very primitive) embedding model, Trees are more similar to Houses than they are to Cats and Dogs, but a Dog is still closer to a Tree than a House.

This model is extremely primitive because firstly it only has 5 “dimensions”, secondly, each dimension can only be either 1, or 0, and thirdly, we are only embedding one word at a time. In reality, modern Vector embeddings can have 1000+ dimensions, with the highest ranked open source embeddings model using 1024 dimensions, and OpenAIs embeddings models using 1536. Each dimension can take on many possible values (instead of just 1 or 0), and the dimensions are rarely as simple to understand in real life as “is animal”.

This way, an ‘embedding’ (which is just the technical name for this list of numbers) can capture how similar two bits of text are, by checking how similar the two arrays of numbers are; just as the words "dog" and "cat" are nearby in the above example, indicating their similarity.

As the dimensions increase, the extent to which the original meaning of the text is captured can also be increased, meaning that highly dimensional embeddings of very sensitive data should be treated as almost as sensitive as the data itself, since it captures much of the meaning of the text.

As you can see below, the top performing embeddings models on hugging face (a popular repository of AI models), range from 384 to 1536 dimensions (i.e. traits by which each piece of text is measured).

What are the main security concerns that enterprises have around managing embeddings?

There are a variety of concerns associated with managing embeddings, including how best to propagate permissions from source systems, audit and monitor usage and enforce privacy and compliance. Some of this follows standard industry best practices, but some require understanding the nuances associated with embeddings.

Mirroring source system permissions

The most significant security risk associated with vector embeddings lies in the fact that while traditional text data lives in carefully access controlled systems like OneDrive, Google Drive, Email, Confluence, etc, these embeddings are typically generated programmatically by scripts or apps that connect to that data. Although someone needs to have access to the data to generate the embedding, once generated, there is nothing that enforces that the embedding, wherever it is saved, is subject to the same security constraints as the original data. Therefore when using open source data connectors like Langchain, Llama Index, etc, it’s extremely easy to take highly sensitive data and accidentally make that data available to anyone at your organization.

Similarly many systems like Microsoft Office have a notion of “password protected” documents etc. Simply taking all your documents and dumping them in a Vector database attached to an LLM that anyone can access, risks throwing out of the window all of the permissions you’ve carefully curated over months. What enterprises typically want to do is mirror the source permissions from the raw data when storing the embeddings, or if they are performing transformations on the data before embedding it, make intentional choices about how the transformed data ought to be access controlled.

What’s more, its typically unwise to try and ask developers of internal applications to try and reimplement the security and permissions models of all of this data, across every single application they build: as the number of applications increases, the probability of data leaking out to the wrong users approaches 100%. Moreover, whilst many companies with language data do offer APIs that expose the relevant permissions and access controls data (such as Google, and Microsoft), not all do (such as Notion), and so for applications where enforcing embedding permissions is about more than just the right API calls, defining them once where they are API accessible helps ensure that there is a single, programmatically accessible source of truth over who should have access to this embedding.

Instead of having each application try to mirror these permissions themselves, having a single enterprise embeddings store, which contains all the relevant embeddings, alongside their data classifications and permissions, enables developers to access that data in a permissions aware way across all the applications they build. That both reduces the surface area of attack and increases developer productivity because instead of having every application reimplement the same logic - each with some probability of error - you can centralize that logic and relieve each developer of the burden of reimplementing permissions code that is not key to value creation of the application they are trying to build.

Auditing and monitoring

Regular auditing and monitoring of both the training process (for companies actually training their own embedding models) and the actual usage of embeddings is crucial to ensuring ongoing security and privacy requirements are actually being met. Once developer keys and access to tools are handed out, making sure that expectations set up front - about whether customer data could be used with LLMs, or PII etc are enforced in an ongoing manner is critical to maintain confidence that employee trainings and policies are actually working. It will also help the company to keep tabs on the state of its embeddings and to ensure that they aren't inadvertently capturing sensitive information. Data that is being embedding should be classified according to the companies classification systems, and, typically, the embeddings should inherit the same classifications or sensitivity labels that the original data had. Monitoring who is using this data across all the applications that touch them (such as the enterprise search models, translation models, etc).

Proper data governance

Good data governance practices are essential for securing vector embeddings, or any form of data for that matter. This covers areas like who has access to the data, and maintaining consistency across systems in which the data travels, how the data is classified and stored, and how data protection laws are complied with. Sticking to best practices in data governance will actually accelerate your ability to build applications on top of this data, because it gives your data teams the confidence to experiment with applications without fearing a significant breach or compliance mistake.

One nuance about these principles as regards embeddings is that privacy and compliance requirements may vary with the fidelity of the embeddings themselves. If the embeddings are very highly dimensional and LLMs (or “decoder” models) are able to infer the underlying data that generated the embedding with high accuracy, then the privacy and compliance requirements on the embedded data should more or less match the requirements on the underlying data. But when the embedded data has smaller dimensions, then there may be room for treating the embeddings as containing less “private” or personal information than the underlying source system. Ultimately, there is no simple equation to determine this and knowing exactly what parts of the original data are stored with high fidelity within a 1k dimension vector can be extremely tricky, and is still an area of research.

Consider creating a single source of truth for your vector permissions

Securing vector embeddings is a complex process that involves considering potential threats, securing the data pipeline, employing privacy-enhancing techniques, regular monitoring, sanitizing data, and ensuring good data governance. By effectively combining these strategies, companies can ensure the security of their vector embeddings and maintain the trust and confidence of their users.

While it may seem like a daunting task, security and privacy should always be at the forefront of an enterprise’s considerations, especially when dealing with something as inherently complex and data-driven as AI models and vector embeddings. Not only is it beneficial from a purely business standpoint, but it's a crucial part of respecting and protecting user data. For those interested, Credal.ai offers the fastest way to empower all of your developers with a single integrated store of all your business’ language data and associated embeddings - respecting the permissions in each case.

‍