Smart Business / MLSecOps : How security should drive Machine Learning project?

Working with Data in an AI or Machine Learning project raises several security questions that all companies should be able to answer. In this article, we will approach security in different angles and won’t go into too many details about MLOps itself but mainly about the place of security while designing Machine Learning solution. Here are the main questions we will try to answer:
- What are the key pillars of Security in a modern Cloud AI Architecture?
- What are the main stages of Machine Learning industrialization project or MLOps?
- What security features should always be implemented in all Machine Learning Solutions?
- How to secure Business Value over time?
Motivation
MLSecOps is a new discipline that brings together MLOps and security in an agile way. While MLOps aims to provide DevOps practice to AI world, MLSecOps aims to remind us why we should always think smart, adapt quick and overcome fast when interacting with company data internally or externally.
Cloud AI Architecture: Let’s talk PPT
PPT stands for “People Process and Technologies”, it should always be addressed when building any (Cloud) Architecture or when designing any target operating model of a platform. The question is: why PPT should matter while talking about security? Mainly because most company activities are performed or at least triggered by people, using some process in place and relying on technologies for efficiency. So, these 3 components can be critical in terms of security breach and should then be addressed at the different levels.
People
You should clearly identify the roles and responsibilities of any person or object that will interact with your platform in a way that lower and higher risks can be measured and considered acceptable or not. For instance, allowing a data scientist to perform more actions over a certain scope that he requires or he should have, can be critical. In Azure for instance, Contributor role is too large for critical environments (such as production for instance). Most common profiles working on AI project are: Data engineer, AI Engineer, MLOps Engineer or Data Scientist. You should clearly identify those people responsibilities as “actions ” that they are allowed or not to perform over a specific scope over time . You must rely on technologies and process to control that but first, you must design it.
Technologies
Choose carefully the technologies because they will help you implementing business value you are looking for in respect of security constraints you or your company have designed. Modern technologies in a Cloud context or not, should provide you at least two main security components:
- Authentication: Saying who you are to the platform
- Authorization: Saying what you can or can’t do on the platform
With that, you can basically control any interactions people or object will have with the technologies you have put in place. In Microsoft Azure for instance, you can rely on Azure Active Directory to provide that security layer onto your platform. And if we look at the AI solutions available on this Cloud Provider such as Azure Machine Learning solutions or Azure Databricks, they all support it, at different maturity levels for sure, but they do. So, use it!
Logging is also a key security component that should be always present at any layer of your architecture. Very useful when “Men In Black” from audit service comes in and ask for it. Normally, you should be able to know:
- Who logged in, what he has done and how the system reacted.
- How are people or object respecting the security design you have put in place?
- If the expecting “value” is still the case over time.
- If your business value can be reversed engineered, how impacting is it for the company? For instance, your Machine Learning Model predictions can be predicted by an external tool that forces your model to learn the underlying functions. You should control it rather by acting directly at model building stage (introducing noise for example) or put some rate limiter features when exposing the models through API.
Here are the 3 main layers you should considerer strongly while approaching security topics:
So, as you are exposing your data through applications that are being hosted on certain infrastructure, each layer is a key component of the core architecture you are designing.
Process
People or Object interacts with Technologies by using (or not) process. So controlling the way those interactions happen can already remove some risk of security failures. For example, many questions can be raised when talking about Data Scientists team :
- How can Data Scientists discover company data?
- How can they have access to the company database or datalake?
- What data can they see – or not – from that database?
- How can they consume data and share information to the company?
- How can they bring business value to external customers?
- …
Writing down all this kind of questions will force you defining and designing process of interactions between People and Technologies.
MLSecOps : Top 5 technical challenges you have to deal with
MLOps is a recent discipline that relies on DevOps practice to build and run Machine Learning Solution. Unlike traditional applications, Machine Learning applications are a bit tricky to deliver because they mainly rely on data and as you might know, data changes constantly. So you have to frequently make sure that the business value you are bringing with your ML model is still present over time. Here is a typical Machine Learning journey:
MLops main Steps
Now let’s see why security should always drive the way you design Machine Learning Solution.
Gather Data
Providing data that will be used to train Machine Learning is critical in terms of security. So as a security architect, always bring answers to following questions:
- Who is asking for data? Will help you identify roles and responsibilities
- What kind of data is he asking for? Will help you identify the scope and then potential technical and business impacts.
- What can he do with that data? Will help you approximate the level of compliance that project team or Data Scientists should respect before releasing any solution. For instance, is he manipulating personal data to provide some business value?
- How often should data be provided? If not a one-shot export, that will imply technical process to provide data, that means you should identify that process and know how to secure it.
Train
This is where AI magic happens, so be careful as well because training is a heavy resource consuming step that basically transforms data into information using complex algorithms that runs on (big) computers. So, you must control on which computers company data will be processed and how. In a Cloud context for instance, make sure Virtual Machine or clusters are secure enough to run this workload. Here are items to check:
- Are VMs that run ML workload network protected? Will prevent you from external access and then potential data leakage.
- What identities (users, group of users or applications) can access those VMs or clusters? Will prevent you from undesired actions or accesses over data manipulated by these VMs.
- What other system dependencies (storage, log file, API…) will be implied at this stage? Will help you identify and manage inbound and outbound traffic.
Package
Training Machine Learning algorithms is important but packaging the resulting model can be quite complicated, because you must know the target packaging format and how it works. You have tons of potential formats that Data Scientists can choose, but your responsibility is to assess all potential risks that present the chosen one. Most popular are pickle, ONNX, zip… and each of them has different specifications. The most used is the pickle format (.pkl) because it’s just a piece of running codes that has been serialized on disk or any storage and that can be easily reran.
But it means that, if that piece of codes contains security breach, you will bring it to production if you don’t look at it. SAST and DAST solutions can start help you on that, but remember: all is about “lines of codes” mixed with “data” that brings value, so you must control it!
Deploy
You have trained and packaged your Machine Learning model. You are all set up for deploying. What does it mean? At this stage you must choose on which compute type your model will run. It can be a VM, a compute cluster, a PaaS component… Each deployment target might have different constraints but here are the questions I always ask:
- Where is the packaged model being stored? Will help you raise security questions about model storage location (for instance Model Registry in AzureML or Databricks).
- How often will it change? (approximatively of course). Will help you define the level of automation, technologies and then process you secure to frequently load/unload models.
- How will the running environment have to load the model: Will help you identify inbound and outbound traffic. By traffic I mean source and destination location, protocol used for network communication and status (allowed or denied).
One of the most popular deployment format that is being used out there is Docker Image. It provides flexibility, consistency, resiliency, and many other adjectives to say how good it is to use that format. But Docker Image relies on Operating System images, so if you just use default one, you might face security issues at different levels.
For instance, Python packages used to build the model might be not authorized in terms of company standard, company certificate not used when exposing API using basic Flask or Fast API Docker Images and so on… so you must control it as well.
Serve
Last but not the least, at this step you want to share your model internally or externally. Different ways of exposing a Machine Model to users or other applications, but in all case, you want your model to run on data. So here are examples of questions you must ask:
- Who can call your model and how ? Will help you defining the concept of “identity” in your context. Remember: an identity is anything that can be authenticated, so it can be user, group of users or applications. The way of calling the model also matters. For instance, passing a JSON or binary file to a model through an API is not the same security concerns. So, control it!
- How often is a certain identity requesting something from the model and how impacting is it? It will help you identify potential reverse engineering attack and keep your business value safe.
One last thing
Remember, Machine Learning is a discipline that builds information from company data using some pieces of codes mostly wrote in Python or R. It’s precious, so you must control many aspects such as:
- The way internal or external people will interact with the platform by exploring and designing the notion of identity at company level. Once again, we call identity any object that can be authenticated such as user, group of users or applications.
- The tools or technologies they will use to build that business value. There are tools out there such as Azure Machine Learning or Azure Databricks that are best fits for Machine Learning workload, so use them if you are in Azure context for instance.
- A Docker Image is a good deployment format but can bring many securitiy breaches. Don’t necessarily use the default image with default settings. You should control packages that are installed on it and users access level as well. A good practice can be to rebuild a company image with all requirements already set up (certificates, repository connection…)
- The search for more Business Value is what is driving company to build complex algorithms to better understand data, so you must make sure the targeted value is still present, that no one can easily reverse engineer the model you are serving. Control input and behaviors to detect and qualify threats as early as possible. Also, Use Drift technics to prevent model to drift whenever core data or concepts change.
Cellenza supports businesses with Cloud security issues. Do you want to learn more about our “Cloud Security” offer? Contact us!