Overview
NVIDIA NIM inference microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker, to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise, available in the AWS Marketplace, NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models from open-source community models to NVIDIA AI Foundation and custom models. NIM microservices are deployed with a single command for easy integration into generative AI applications using industry-standard APIs and just a few lines of code. Engineered to facilitate seamless generative AI inferencing at scale, NIM ensures generative AI applications can be deployed anywhere.

Benefits
Performance
As part of the NVIDIA AI Enterprise suite of software, NIM goes through exhaustive tuning to ensure the high-performance configuration for each model. Using NIM, throughput and latency improve significantly. For example, the NVIDIA Llama 3.1 8B Instruct NIM has achieved 2.5x improvement in throughput, 4x faster “time to first token” (TTFT), and 2.2x faster “inter-token latency” (ITL) compared to the best open-source alternatives.
Stats
Faster TTFT on Llama 3.1 8B Instruct with NIM On versus NIM Off
Faster ITL on Llama 3.1 8B Instruct with NIM On versus NIM Off
Features
Prebuilt containers
NIM offers a variety of prebuilt containers and Helm charts, which include optimized generative AI models. NIM seamlessly integrates with Amazon EKS to deliver a high-performance and cost-optimized model serving infrastructure.
Standardized APIs
Simplify the development, deployment, and scaling of generative AI-based applications, with industry-standard APIs for building powerful copilots, chatbots, and generative AI assistants on AWS. These APIs are compatible with standard deployment processes, meaning teams can update applications quickly and easily.
Model support
Deploy custom generative AI models that are fine-tuned to specific industries or use cases. NIM supports generative AI use cases across multiple domains including LLMs, vision language models (VLMs), and models for speech, images, video, 3D, drug discovery, medical imaging, and more.
Domain-specific
NIM includes domain-specific NVIDIA CUDA libraries and specialized code, covering areas such as speech, language, and video processing.
Inference engines
Optimized using Triton Inference Server, TensorRT, TensorRT-LLM, and PyTorch NIM maximizes throughput and decreases latency, thereby reducing the cost of running inference workloads as they scale.