OpinionHow to manage AI operations efficiently when you have 150 models in production
Opinion
How to manage AI operations efficiently when you have 150 models in production
"Organizations can build on their successes by prioritizing responsible AI and leveraging purpose-built platform engineering to enable a centralized operating model," writes Vultr CMO Kevin Cochrane
Enterprises are increasingly transitioning from the experimental phase of AI to deploying multiple models into production, signifying a pivotal shift in integrating AI technologies within business operations.
The number of models actively used within an organization says much about a company’s AI maturity level. According to a new study of 1,000 IT leaders and practitioners conducted by S&P Global Market Intelligence and commissioned by Vultr, enterprises – from those at the highest levels of AI maturity to those at more aspirational stages – have deployed, on average, 158 models in production. That number will increase by over 10% within the following year.
According to the same research, 80% of survey respondents anticipate adopting AI across most business functions within two years. At the same time, 85% of survey respondents say they will move more models to edge environments to conduct more training and AI inference for low latency and high performance.
AI: The Bolder the Aspiration, the Bigger the Hurdles
Enterprises deploying AI across functions, departments, and models face numerous challenges. Those most frequently cited by survey respondents include the following:
- A lack of robust infrastructure, standardized processes, and automated workflows to resolve the complexity of managing model lifecycles
- Data integration and quality issues
- Lack of processes for maintaining model accuracy and handling model drift
- Difficulty gaining compliance with internal security and privacy policies and regulatory standards.
Additionally, organizations must address cost management, bridge internal talent and skills gaps, and effectively allocate resources to sustain their AI strategies.
The Centralized AI Operating Model
Enterprises can more easily manage the efficient scaling of AI operations using a tried-and-true operating model that allows organizations to develop and train their models centrally, fine-tune them regionally, and deploy and monitor them locally. It works as follows:
- Model development starts in an AI Center of Excellence (centralized hub) housing the organization’s top data science team.
- Open-source models from public registries form the foundation of the enterprise’s AI model inventory. These models are trained on proprietary company data, thereby creating proprietary models.
- Proprietary models are containerized and stored in a private registry housing the entire inventory of the enterprise’s models.
- Model development continues with fine-tuning localized data to account for regional characteristics and data governance requirements.
- Data science teams set up Kubernetes clusters in edge locations to deploy the containerized AI models.
- AI engineers store additional relevant data they wish to exclude from the core training data as embeddings in vector databases.
- AI operations culminate in model deployment and inference in edge environments.
- Data science teams leverage observability tools to continuously monitor model performance and correct any instances of drift or bias.
Proactive, Responsible AI vs. Toeing the Regulatory Line
Proactively embedding responsible AI practices – including end-to-end model observability and robust data governance – requires clear assignment of roles and responsibilities across stakeholders, privacy-adherent data management, and governance councils to align practices organization-wide. This includes:
- High-quality data rules with traceability of data lineage and automated data quality checks.
- Model governance that includes bias testing, ongoing monitoring, enforcement of ethical AI principles (fairness, transparency, privacy, etc.), automated model validation, drift detection, and compliance checks.
- Proper security and privacy, including data access controls, encryption, and privacy-enhancing techniques such as differential privacy and federated learning.
Responsible AI via Platform Engineering
Purpose-built platform engineering can automate the provisioning of the necessary tools that enable the observability and governance that underpin responsible AI. This approach includes:
- Self-service access with integrated observability: Platform engineering’s fundamental value is empowering each machine learning engineer and data scientist to configure their ideal development environment, including self-service access to AI/ML infrastructure that includes GPUs, CPUs, and vector databases.
- Curated templates with built-in governance and observability: Organizations can ensure compliance with data privacy, ethical standards, and regulatory requirements – all while streamlining development and deployment processes – through vetted templates for common AI/ML workflows that include observability and governance features.
- Automated workflows with observability checks: Intelligent automation streamlines the AI development lifecycle from testing to deployment while checking for model drift, bias detection, and ethical AI usage–all with reduced manual oversight.
- Internal red team to probe for vulnerabilities: Dedicated teams test and tune models before moving them to production to eliminate errors and biases.
- Centralized management and continuous monitoring: A centralized observability framework provides a unified view of all AI models across the organization, while constant monitoring maintains model accuracy and effectiveness over time.
- Collaboration and feedback loops: End-to-end observability facilitates structured feedback loops among data scientists, engineers, and stakeholders, aligning models with evolving business objectives, regulatory requirements, and ethical considerations.
Organizations can build on their successes by prioritizing responsible AI and leveraging purpose-built platform engineering to enable a centralized operating model, further feed their AI ambitions, and navigate the complexities of advanced AI deployment.
The author is CMO of Vultr, a full-stack cloud compute and cloud GPU provider