How we built a system to create clinical taxonomies from healthcare policies using embeddings, clustering, and LLMs.
At NOF1, we're always looking for ways to use AI in workflows to cut down research and synthesis time. We also primarily work with text-data, which means we don't always get to leverage typical statistical methods for our analysis.
In this article, we walk you through how we built a system to create clinical taxonomies from healthcare policies using embeddings, clustering, and a bit of LLM magic.
Imagine you have a set of healthcare policies, and you need to group them in a way that makes sense—like sorting them into themed categories. For instance:
This grouping can come in handy in different healthcare-related scenarios, such as:
… and really many non-healthcare related workflows.
Initially, we thought, "Why not just ask an LLM to do this?" But as pretty commonly we find, that's not going to be enough. Here's what we ran into:
We then tried breaking the policies into smaller chunks. This helped a bit but ended up with themes that had to be reworked several times—again, which leads to loss of information and unpredicatable results.
Embeddings are an underrated gem out of the LLM technology - essentially turning words into numbers. This step converts the text of your policies into numerical vectors that look similar for similar contexts. This opens up a lot of ways for us to figure out similar looking policies.
In the figure below, we show the distance matrix of the embeddings of 10 policies. The darker the color, the more similar the policies. A few observations you'll notice - the distance already shows some interesting groupings. For example, the policies related to hip and knee surgeries are closer to each other than to the other policies, similarily with mental health related policies.

If you want to read more about embeddings, see the OpenAI and Gemini documentation
Now that we have the numbers, we need to group them. We used K-clustering (a version of K-means) to do just that. It's like sorting items into fixed buckets based on how similar they are. Read more.

Why It Works:
The final step is where we bring in the LLM again, but this time with a much simpler ask. Instead of expecting it to group policies, we let it focus on summarizing each cluster. You just feed the model a group of policies and ask, "What's a good theme for these?" This makes the task much simpler for the LLM and leads to clearer, more accurate results.
This cuts down on confusion and ensures the themes make sense based on the groupings.
... here is an example of the output of this step:| Theme | Policies |
|---|---|
| This policy cluster outlines non-invasive treatment protocols for managing pain and musculoskeletal conditions using both pharmaceutical (ibuprofen) and rehabilitative (physical and occupational therapy) approaches. |
|
| This cluster of policies focuses on emergency cardiovascular care, outlining guidelines and protocols for managing acute heart conditions such as heart attacks and cardiac arrests. |
|
| This cluster of policies focuses on orthopedic joint surgeries, specifically addressing surgical interventions for hip and knee conditions. |
|
| This cluster of policies focuses on comprehensive treatment guidelines and management strategies for chronic physical and behavioral/mental health conditions. |
|
NOF1 is a payer policy intelligence platform across clinical and reimbursement policies. Our goal is to transform the way payers, providers, and other stakeholders navigate the complex landscape of healthcare policy to transform the way healthcare is delivered.
A competitive intelligence platform with over 10K+ clinical policies across payers, UM vendors and CMS. Designed for payers to assess their policy positioning, rapidly research alignment and differences relative to peers.
An EMR platform that allows providers to understand clinical policy requirements at point of care, drastically improving documentation quality and compliance while reducing unnecessary denials.
APIs that allow the retrieval of clinical policies in machine readable form and criteria to allow for integration into your enterprise software
To learn more, please reach out to ahmed@nofone.io