Deploying LLM Inference Endpoints & Optimizing Output with RAG
In this blog post, guest blogger Martin Bald, Sr. Manager DevRel and Community at Microsoft Partner Wallaroo.AI, will go through the steps to easily operationalize LLM models and put in place measures to help ensure model integrity. He will touch on the staples of security, privacy, and compliance in avoiding outputs such as toxicity, hallucinations etc. using RAG (Retrieval-Augmented Generation).
Introduction
With the emergence of GenAI and services associated with it such as ChatGPT, enterprises are feeling the pressure to jump on the GenAI train and make sure they are not left behind in the AI adoption stampede.
AI adoption has been a bumpy ride for a great deal of organizations due to underestimating the time, effort and cost it typically takes to get effective, reliable, and robust LLMs into production.
LLM Deployment in Wallaroo
LLM models can range in size from a few hundred megabytes to over hundreds of gigabytes, often needing GPU resources. Because of this it’s important to configure the LLM production environment to ensure model function and performance for things such as latency, and output accuracy.
Pre-production for LLM model development and testing gives an understanding of the system requirements needed to deploy to production for optimal performance. For example the Standard LLama 3 8B or llama 3 70B models would need at least one GPU. You could also take advantage of a quantized LLM. This can reduce the size of the LLM by adjusting the precision of their weights by mapping values to a smaller set of discrete values. This helps make the model more efficient for memory usage, CPU hardware and compute speed without giving up accuracy on a specific known task.
Aside from this, deploying LLMs to inference endpoints in Wallaroo is the same as for any other model framework such as CV, Forecasting or custom Arbitrary Python models.
Let’s look at deploying an LLM to production using the Wallaroo SDK and the following process. This example leverages the llamacpp library.
For brevity we will skip the steps of importing and uploading the model. You can go through the process in this LLM Deploy Tutorial link.
LLM Deployment
LLM’s are deployed via the Wallaroo SDK through the following process:
After the model is uploaded, get the LLM model reference from Wallaroo.
Create or use an existing Wallaroo pipeline and assign the LLM as a pipeline model step.
Set the deployment configuration to assign the resources including the number of CPUs, amount of RAM, etc for the LLM deployment.
Deploy the LLM with the deployment configuration.
LLM’s previously uploaded to Wallaroo can be retrieved without re-uploading the LLM via the Wallaroo SDK method wallaroo.client.Client.get_model(name:String, version:String) which takes the following parameters:
name: The name of the model.
The method wallaroo.client.get_model(name) retrieves the most recent model version in the current workspace that matches the provided model name unless a specific version is requested.
The following demonstrates retrieving an uploaded LLM and storing it in the variable model_version.
Once the model is imported and uploaded, we create our pipeline and add the LLM as a pipeline step as seen in the code below.
import wallaroo
# connect with the Wallaroo client
wl = wallaroo.Client()
llm_model = wl.get_model(name=model_name)
llm_pipeline = wl.build_pipeline(“llama-pipeline”)
llm_pipeline.add_model_step(llm_model)
LLMs are deployed via Wallaroo pipelines. Wallaroo pipelines are created in the current user’s workspace with the Wallaroo SDK wallaroo.client.Client.build_pipeline(pipeline_name:String) method. This creates a pipeline in the user’s current workspace with provided pipeline_name, and returns a wallaroo.pipeline.Pipeline, which can be saved to a variable for other commands.
Pipeline names are unique within a workspace; using the build_pipeline method within a workspace where another pipeline with the same name exists will connect to the existing pipeline.
Once the pipeline reference is stored to a variable, LLMs are added to the pipeline as a pipeline step with the method wallaroo.pipeline.Pipeline.add_model_step(model_version: wallaroo.model_version.ModelVersion).
This code example below demonstrates creating a pipeline and adding a model version as a pipeline step.
# create the pipeline
llm_pipeline = wl.build_pipeline(‘sample-llm-pipeline’)
# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)
Next, before deploying the LLM, a deployment configuration is created. This sets how the cluster’s resources are allocated for the LLM’s exclusive use. Depending on the model needs you can allocate CPU or GPU and memory resources for optimized model performance while keeping cloud costs in check.
In the example in the code below, we will build the deployment configuration with 32 CPUs and 40 Gi RAM allocated to the LLM. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.
deployment_config = DeploymentConfigBuilder()
.cpus(0.5).memory(‘2Gi’)
.sidekick_cpus(llm_model, 32)
.sidekick_memory(llm_model, ’40Gi’)
.build()
llm_pipeline.deploy(deployment_config)
With the model deployed, we can check the LLM deployment status via the wallaroo.pipeline.Pipeline.status() method. We can see in the code below that the status shows as running.
{‘status’: ‘Running’,
‘details’: [],
‘engines’: [{‘ip’: ‘10.124.6.17’,
‘name’: ‘engine-77b97b577d-hh8pn’,
‘status’: ‘Running’,
‘reason’: None,
‘details’: [],
‘pipeline_statuses’: {‘pipelines’: [{‘id’: ‘llama-pipeline’,
‘status’: ‘Running’,
‘version’: ’57fce6fd-196c-4530-ae92-b95c923ee908′}]},
‘model_statuses’: {‘models’: [{‘name’: ‘llama3-instruct-8b’,
‘sha’: ‘b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec’,
‘status’: ‘Running’,
‘version’: ‘a3d8e89c-f662-49bf-bd3e-0b192f70c8b6’}]}}],
‘engine_lbs’: [{‘ip’: ‘10.124.6.16’,
‘name’: ‘engine-lb-767f54549f-gdqqd’,
‘status’: ‘Running’,
‘reason’: None,
‘details’: []}],
‘sidekicks’: [{‘ip’: ‘10.124.6.19’,
‘name’: ‘engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj’,
‘status’: ‘Running’,
‘reason’: None,
‘details’: [],
‘statuses’: ‘n’}]}
Inference
With the LLM deployed, the model is ready to accept inference requests through wallaroo.pipeline.Pipeline.infer which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the output as the same.
data = pd.DataFrame({‘text’: [‘Summarize what LinkedIn is’]})
result = llm_pipeline(data)
result[“out.generated_text”][0]
‘LinkedIn is a social networking platform designed for professionals and businesses to
connect, share information, and network. It allows users to create a profile
showcasing their work experience, skills, education, and achievements. LinkedIn is
often used for:nn1. Job searching: Employers can post job openings, and job seekers
can search and apply for positions.n2. Networking: Professionals can connect with
colleagues, clients, and industry peers to build relationships and stay informed about
industry news and trends.n3. Personal branding: Users can showcase their skills,
expertise, and achievements to establish themselves as thought leaders in their
industry.n4. Business development: Companies can use LinkedIn to promote their
products or services, engage with customers, and build brand awareness.n5. Learning
and development: LinkedIn offers online courses, tutorials, and certifications to help
professionals upskill and reskill.nnOverall, LinkedIn is a powerful tool for
professionals to build their professional identity, expand their network, and advance
their careers.’
That’s it! This model will run continuously and produce relevant, accurate, unbiased generated text and not need any monitoring or updating right? Not by a long shot.
RAG LLMs In Wallaroo
When LLMs are deployed to production the output generated is based on the training data of the model at the time. The model will take the user input and generate a text response based on the information it was trained on. As time goes by the model will gradually go out of date which can result in inaccurate generated text, hallucinations, bias etc. So how do you go about making your LLMs accurate, relevant, and free of bias and hallucinations without having to constantly retrain the model?
Enter RAG. Retrieval-Augmented Generation (RAG) is one method that helps LLMs to produce more accurate and relevant outputs, effectively overcoming some of the limitations inherent in their training data. This not only enhances the reliability of the generated content but also ensures that the information is up to date, which is critical for maintaining and enhancing user trust and delivering accurate responses while adapting to changing information.
RAG works by improving the accuracy and reliability of generative AI models by allowing the LLM to reference an authoritative knowledge base outside of its training data sources before generating a response.
RAG is also a good alternative to fine tuning the model. Fine tuning tends to be expensive because of its intensive resource consumption and also produces diminishing returns on accuracy when compared to RAG. There are use cases for when to go with fine tuning, but we’ll save that for another blog.
If we take a simple example for RAG. I’m a soccer (football) fan, and I like to think I know about what team won what championship, cup, etc.
Let’s say that my soccer knowledge is a LLM and I was asked which men’s teams have won the most European Champions titles (UCL) since the competition started in 1955. Now if I’m relying on my memory (never a good thing in my case) the generated text for this query would be “Real Madrid with 11 titles.” :thinking_face:
That input query and generated text process would look like the diagram below Fig 1.
Fig 1.
My answer of Real Madrid with 11 UCL trophies is incorrect. There are a couple of reasons for this:
I’m using my memory, but I cannot remember all the winners and might not have kept up with the game for a few years, so it’s a confident guess at best.
I didn’t take time to check authoritative sources to verify my answer.
The outcome is that I generate an answer that -I- think is correct but is not. This is where you begin to see situations such as hallucinations or bias etc.
To fix this without retraining our model (my memory) we can introduce an authoritative source or sources like RAG. So, when I come up with the answer of Real Madrid and 11 titles, before responding with the generated text I stop to check an authoritative source. This data source tells me that the correct answer is Real Madrid with 15 titles.
When we use RAG LLM we create an authoritative source for our model that is up to date and can quickly incorporate the latest data and provide accurate up-to-date responses.
This final section will go through the code step examples to successfully deploy RAG LLM to production with Wallaroo and help generate text outputs that are accurate and relevant to the user.
We will look at an example of using RAG with your LLM inference endpoints. The RAG LLM process takes the following steps:
Input text first passes through the feature extractor model that outputs the embedding. This is a list of floats that the RAG LLM uses to query the database for its context.
Both the embedding and the origin input is passed to the RAG LLM.
The RAG LLM queries the vector indexed database for the context from which to build its response. As we have discussed above this context prevents hallucinations by providing guidelines that the RAG LLM uses to construct its response.
Once finished, the response is submitted as the generated text as seen in Fig 2 below.
Fig 2.
Feature Extractor Details
The first step in setting up RAG is the Feature Extractor seen in the diagram above Fig 2. The feature extractor performs two functions:
Passes the input text to the RAG LLM.
Converts the input text into the embedding that the RAG LLM uses to query the database for the proper context.
The code snippet below demonstrates the predict function that receives the input data, tokenizes it, and then extracts the embeddings from the model. The embeddings are then normalized and returned alongside the original input text.
In our two-step pipeline, this output is then passed to the RAG LLM.
(Note that the code example is Arbitrary Python code which you can find more about in this BYOP -Bring Your Own Predict tutorial)
def _predict(self, input_data: InferenceData):
inputs = input_data[“text”].tolist()
texts = np.array([str(x) for x in input_data[“text”]])
encoded_inputs = self.model[“tokenizer”](
inputs, padding=True, truncation=True, return_tensors=”pt”
)
with torch.no_grad():
model_output = self.model[“model”](**encoded_inputs)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(
sentence_embeddings, p=2, dim=1
)
embeddings = np.array(
[sentence_embeddings[i].cpu().numpy() for i in range(len(inputs))]
)
return {“embedding”: embeddings, “text”: texts}
Next we will view the details of the RAG LLM itself.
The following sample RAG LLM packaged as a BYOP framework model performs the following:
Receives the input query text and the embedding generated by the Feature Extractor Model.
Query the MongoDB Atlas database vector index based on the embedding as the context.
This example queries the 10 most similar documents to the input based on the provided context.
Using the returned data as context, generate the response based on the input query.
The BYOP predict function shown below processes the request from the RAG LLM with the context model.
def _predict(self, input_data: InferenceData):
db = client.sample_mflix
collection = db.movies
generated_texts = []
prompts = input_data[“text”].tolist()
embeddings = input_data[“embedding”].tolist()
for prompt, embedding in zip(prompts, embeddings):
query_results = collection.aggregate(
[
{
“$vectorSearch”: {
“queryVector”: embedding,
“path”: “plot_embedding_hf”,
“numCandidates”: 50,
“limit”: 10,
“index”: “PlotSemanticSearch”,
}
}
]
)
context = ” “.join([result[“plot”] for result in query_results])
result = self.model(
f”Q: {prompt} C: {context} A: “,
max_tokens=512,
stop=[“Q:”, “n”],
echo=False,
)
generated_texts.append(result[“choices”][0][“text”])
return {“generated_text”: np.array(generated_texts)}
This example demonstrates a quantized version of Llama V2 Chat that leverages the llamacpp library.
We will skip over the model upload steps but if you would like to go through them, they are in the RAG LLM Tutorial.
Deploying the RAG LLM
As mentioned the following example assumes that the two models are already uploaded and saved to the following variables:
bge: The Feature Extractor that generates the embedding for the RAG LLM.
rag-llm: The RAG LLM that uses the embedding to query the vector database index, and uses that result as the context to generate the text.
Now that the models are uploaded, they are deployed in a Wallaroo pipeline through the following process:
Define the deployment configuration: This sets what resources are applied to each model on deployment. For more details, see Deployment Configuration.
Add the feature extractor model and RAG LLM as model steps: This sets the structure where the feature extractor model converts the request to a vector, which is used as the input by the RAG LLM to generate the final response.
Deploy the models: This step allocates resources to the feature extractor and LLM. At this point, the models are ready for inference requests.
Next we will set the deployment configuration for both the Feature Extractor and the RAG LLM. We have flexibility here to deploy the models to the hardware configurations that optimize their performance for output and can be adjusted as required based on attributes including model size, throughput, latency, and performance requirements. Note that deployment configuration changes do not impact Wallaroo Inference endpoints (including name, url, etc), providing no interruption for production deployments.
In this example we will deploy the following configuration.
deployment_config = DeploymentConfigBuilder()
.cpus(1).memory(‘2Gi’)
.sidekick_cpus(bge, 4)
.sidekick_memory(bge, ‘3Gi’)
.sidekick_cpus(rag-llm, 4)
.sidekick_memory(rag-llm, ‘6Gi’)
.build()
Next we will add the feature extractor model and the RAG LLM as pipeline steps.
We create the pipeline with the wallaroo.client.Client.build_pipeline, then add each model as pipeline steps with the feature extractor as the first step with the wallaroo.pipeline.Pipeline.add_model_step method.
This sets the stage for the feature extractor model to provide its outputs as the inputs for the RAG LLM.
pipeline = wl.build_pipeline(“byop-rag-llm-bge-v1”)
pipeline.add_model_step(bge)
pipeline.add_model_step(rag-llm)
Everything is now set and we deploy the models through the wallaroo.pipeline.Pipeline.deploy(deployment_config) method, providing the deployment configuration we set earlier. This assigns the resources from the cluster to the model’s exclusive use.
Once the deployment is complete, the RAG LLM is ready for inference requests.
pipeline.deploy(deployment_config=deployment_config)
Inference
Finally, we are ready to run a test inference. Inference requests are submitted either as pandas DataFrames or Apache Arrow tables.
The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.
data = pd.DataFrame({“text”: [“Suggest me an action movie, including it’s name”]})
result = pipeline.infer(data)
print(result[‘out.generated_text’].values[0])
Conclusion
In this blog we have seen how to easily deploy LLMs to production inference endpoints in addition implementing RAG LLM as an authoritative source for our model to enhance the reliability of the generated text and help ensure that the generated text is up-to-date, and free from potential issues such as hallucinations and toxicity helping to avoid potential risks and safeguard accurate and relevant outputs.
If you would like to try these examples yourself, you can access the LLM tutorials and request a demo at the links below.
Wallaroo LLM Operations Docs: https://docs.wallaroo.ai/wallaroo-llm/
Request a Demo: https://wallaroo.ai/request-a-demo/
Microsoft Tech Community – Latest Blogs –Read More