Content Assessment: Cybersecurity Challenges for Artificial Intelligence: Considering the AI Lifecycle
Information - 95%
Insight - 95%
Relevance - 95%
Objectivity - 95%
Authority - 100%
A short percentage-based assessment of the qualitative benefit of the recently published European Union Agency for Cybersecurity (ENISA) report on cybersecurity challenges for artificial intelligence.
Editor’s Note: The European Union Agency for Cybersecurity, ENISA, is the Union’s agency dedicated to achieving a high common level of cybersecurity across Europe. In December of 2020, ENISA published the report AI Cybersecurity Challenges – Threat Landscape for Artificial Intelligence. The report presents the Agency’s active mapping of the AI cybersecurity ecosystem and its Threat Landscape. As part of the report, a generic lifecycle reference model for AI is provided to allow for a structured and methodical approach to understanding the different facets of AI. This generic AI lifecycle may be beneficial for legal, business, and information security professionals in the eDiscovery ecosystem beginning to consider cybersecurity and its relationship with AI.
AI Cybersecurity Challenges – European Union Agency for Cybersecurity
Report Extract on AI Lifecycle Shared with Permission*
AI Lifecycle Phases
Figure – AI Lifecycle Generic Reference ModelAI Lifecycle Generic Reference Model
In this section, we provide a short definition for each stage of the AI Lifecycle and recap the individual steps it involves (“Phase in a Nutshell”).
Business Goal Definition
Prior to carrying out any AI application/system development, it is important that the user organization fully understand the business context of the AI application/system and the data required to achieve the AI application’s business goals, as well as the business metrics to be used to assess the degree to which these goals have been achieved.
Business Goal Definition Phase in a Nutshell: Identify the business purpose of the AI application/system. Link the purpose with the question to be answered by the AI model to be used in the application/system. Identify the model type based on the question.
Data Ingestion is the AI life cycle stage where data is obtained from multiple sources (raw data may be of any form structured or unstructured) to compose multi-dimensional data points, called vectors, for immediate use or for storage in order to be accessed and used later. Data Ingestion lies at the basis of any AI application. Data can be ingested directly from its sources in a real-time fashion, a continuous way also known as streaming, or by importing data batches, where data is imported periodically in large macro-batches or in small micro-batches.
Different ingestion mechanisms can be active simultaneously in the same application, synchronizing or decoupling batch and stream ingestion of the same data flows. Ingestion components can also specify data annotation, i.e., whether ingestion is performed with or without metadata (data dictionary, or ontology/taxonomy of the data types). Often, access control operates during data ingestion modeling the privacy status of the data (personal / non-personal data.), choosing suitable privacy-preserving techniques and taking into account the achievable trade-off between privacy impact and analytic accuracy. Compliance with applicable EU privacy and data protection legal framework needs to be ensured in all cases.
The privacy status assigned to data is used to define the AI application Service Level Agreement (SLA) in accordance with the applicable EU privacy and data protection legal framework, including–among other things- the possibility for inspection/auditing competent regulatory authorities (such as Data Protection Authorities). It is important to remark that, in ingesting data an IT governance conflict may arise. On the one hand, data is compartmentalized by its owners in order to ensure access control and privacy protection; on the other hand, it must be integrated in order to enable analytics. Often, different policies and policy rules apply to items of the same category. For multimedia data sources, access protocols may even follow a Digital Right Management (DRM) approach where proof-of-hold must be first negotiated with license servers. It is the responsibility of the AI application designer to make sure that ingestion is done respecting the data providers’ policies on data usage and the applicable EU privacy and data protection legal framework.
Data Collection/Ingestion Phase in a Nutshell: Identify the input (dynamic) data to be collected and the corresponding context metadata. Organize ingestion according to the AI application requirements, importing data in a stream, batch or multi-modal fashion.
Data Exploration is the stage where insights start to be taken from ingested data. While it may be skipped in some AI applications where data is well understood, it is usually a very time-consuming phase of the AI life cycle. At this stage, it is important to understand the type of data that were collected. A key distinction must be drawn between the different possible types of data, with numerical and categorical being the most prominent categories, alongside multimedia data (e.g., image, audio, video, etc.). Numerical data lends itself to plotting and allows for computing descriptive statistics and verifying if data fits simple parametric distributions like the Gaussian one. Missing data values can also be detected and handled at the exploration stage. Categorical variables are those that have two or more categories but without an intrinsic order. If the variable has a clear ordering, then it is considered as an ordinal variable.
Data Validation/Exploration in a Nutshell: Verify whether data fit a known statistics distribution, either by component (mono-variate distributions) or as vectors (multi-variate distribution). Estimate the corresponding statistic parameters.
The data pre-processing stage employs techniques to cleanse, integrate and transform the data. This process aims at improving data quality that will improve performance and efficiency of the overall AI system by saving time during the analytic models’ training phase and by promoting better quality of results. Specifically, the term data cleaning designates techniques to correct inconsistencies, remove noise and anonymize/pseudonymize data.
Data integration puts together data coming from multiple sources, while data transformation prepares the data for feeding an analytic model, typically by encoding it in a numerical format. A typical encoding is one-hot encoding used to represent categorical variables as binary vectors. This encoding first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the position of the integer, which is marked with a 1.
Once converted to numbers, data can be subject to further types of transformation: re-scaling, standardization, normalization, and labeling. At the end of this process, a numerical data set is obtained, which will be the basis for training, testing and evaluating the AI model.
Since having a large enough dataset is one of the key success factors when properly training a model, it is common to apply different data augmentation techniques to those training datasets that are too small. For instance, it is common to include in a training dataset different scaled or rotated versions of images, which were already in that dataset. Another example of data augmentation technique which can be used when processing text is replacing a word by its synonym. Even in those cases in which the training dataset is large enough, data augmentation techniques can improve the final trained model. Data can also be augmented in order to increase its quantity and the diversity of scenarios covered. Data augmentation usually consists in applying transformations which are known to be label-preserving, i. e. the model should not change its output (namely prediction) when presented with the transformed data items. Data augmentation can serve to improve the performance of a model and in particular its robustness to benign perturbations. One task where data augmentation is used by default is image classification, where data can be augmented by for instance applying translations, rotations and blurring filters.
Data pre-processing in a Nutshell: Convert ingested data to a metric (numerical) format, integrate data from different sources, handle missing/null values by interpolation, densify to reduce data sparsity, de-noise, filter outliers, change representation interval, anonymize/pseudonymize data, augment data.
Feature Selection (in general feature engineering) is the stage where the number of components or features (also called dimensions) composing each data vector is reduced, by identifying the components that are believed to be the most meaningful for the AI model. The result is a reduced dataset, as each data vector has fewer components than before. Besides the computational cost reduction, feature selection can bring more accurate models.
Additionally, models built on top of lower-dimensional data are more understandable and explainable. This stage can also be embedded in the model building phase (for instance when processing image or speech data), to be discussed in the next section.
Feature selection in a Nutshell: Identify the dimensions of the data set that account for a global parameter, e.g., the overall variance of the labels. Project data set along these dimensions, discarding the others.
This stage performs the selection/building of the best AI model or algorithm for analyzing the data. It is a difficult task, often subject to trial and error. Based on the business goal and the type of available data, different types of AI techniques can be used. The three commonly identified major categories are supervised learning, unsupervised learning and reinforcement learning models. Supervised techniques deal with labeled data: the AI model is used to learn the mapping between input examples and the target outputs.
Supervised models can be designed as Classifiers, whose aim is to predict a class label, and Regressors, whose aim is to predict a numerical value function of the inputs. Here some common algorithms are Support Vector Machines, Naïve Bayes, Hidden Markov Model, Bayesian networks, and Neural Networks.
Unsupervised techniques use unlabelled training data to describe and extract relations from it, either with the aim of organizing it into clusters, highlight association between data input space, summarize the distribution of data, and reduce data dimensionality (this topic was already addressed as a preliminary step for data preparation in the section on feature selection). Reinforcement learning maps situations with actions, by learning behaviors that will maximize a desired reward function.
While the type of training data, labeled or not, is key for the type of technique to be used and selected, models may also be built from scratch (although this is rather unlikely), with the data scientist designing and coding the model, with the inherent software engineering techniques; or building a model by combining a composition of methods. It is important to remark that model selection (namely choosing the model adapted to the data) may trigger further transformation of the input data, as different AI models require different numerical encodings of the input data vectors.
Generally speaking, selecting a model also includes choosing its training strategy. In the context of supervised learning for example, training involves computing (a learning function of) the difference between the model’s output when it receives each training set data item D as input, and D’s label. This result is used to modify the model in order to decrease the difference.
Many training algorithms for error minimization are available, most of them based on gradient descent. Training algorithms have their own hyperparameters, including the function to be used to compute the model error (e.g., mean squared error), and the batch size, i.e., the number of labeled samples to be fed to the model to accumulate a value of the error to be used for adapting the model itself.
AI Model Selection in a Nutshell: Choose the type of AI model suitable for the application. Encode the data input vectors to match the model’s preferred input format.
Having selected an AI model, which in the context of this reference model mostly refers to a Machine Learning (ML) model, the training phase of the AI system commences. In the context of supervised learning, the selected ML model must go through a training phase, where internal model parameters like weights and bias are learned from the data. This allows the model to gain understanding over the data being used and thus become more capable in analyzing them. Again, training involves computing (a function of) the difference between the model’s output when it receives each training set data item D as input, and D’s label. This result is used to modify the model in order to decrease the difference between inferred result and the desired result and thus progressively leads to more accurate, expected results.
The training phase will feed the ML model with batches of input vectors and will use the selected learning function to adapt the model’s internal parameters (weights and bias) based on a measure (e.g., linear, quadratic, log loss) of the difference between the model’s output and the labels. Often, the available data set is partitioned at this stage into a training set, used for setting the model’s parameters, and a test set, where evaluation criteria (e.g., error rate) are only recorded in order to assess the model’s performance outside the training set. Cross-Validation schemes randomly partition multiple times a data set into a training and a test portion of fixed sizes (e.g., 80% and 20% of the available data) and then repeat training and validation phases on each partition.
AI Model Training in a Nutshell: Apply the selected training algorithm with the appropriate parameters to modify the chosen model according to training data. Validate the model training on test set according to a cross-validation strategy.
Model tuning usually overlaps with model training, since tuning is usually considered within the training process. We opted to separate the two stages in the AI lifecycle to highlight the differences in terms of functional operation, although it is most likely that in the majority of the AI systems they will be both part of the training process.
Certain parameters define high-level concepts about the model, such as their learning function or modality, and cannot be learned from input data. These special parameters, often called hyper-parameters, need to be set up manually, although they can under certain circumstances be tuned automatically by searching the model parameters’ space. This search, called hyper-parameter optimization, is often performed using classic optimization techniques like Grid Search, but Random Search and Bayesian optimization can be used. It is important to remark that the Model Tuning stage uses a special data set (often called validation set), distinct from the training and test sets used in the previous stages. An evaluation phase can also be considered to estimate the outputs limits and to assess how the model would behave in extreme conditions, for example, by using wrong/unsafe data sets. It is important to be noted that, depending on the number of hyper-parameters to be adjusted, trying all possible combinations may just not be feasible.
AI Model Tuning in a Nutshell: Apply model adaptation to the hyper-parameters of the trained AI model using a validation data set, according to deployment condition.
In this phase, the user organization sources a pre-trained and pre-tuned AI model and uses it as starting point for further training to achieve faster and better convergence. This is commonly the case when few data are available for training. It should be noted that all steps described above (tuning, testing, etc.) also apply for transfer learning. Moreover, since in many cases transfer learning is being applied, one can consider transfer learning as a part of model training phase, given that transfer learning usually serves as a starting point of the training algorithm. To ensure wider scope, we distinguish transfer learning into a distinct phase in the AI lifecycle presented here.
Transfer Learning in a Nutshell: Source a pre-trained AI model in the same application domain, and apply additional training to it, as needed to improve its in-production accuracy.
A Machine Learning model will bring knowledge to an organization only when its predictions become available to final users. Deployment is the process of taking a trained model and making it available to the users.
Model Deployment in a Nutshell: Generate an in-production incarnation of the model as software, firmware or hardware. Deploy the model incarnation to edge or cloud, connecting in-production data flows.
After deployment, AI models need to be continuously monitored and maintained to handle concept changes and potential concept drifts that may arise during their operation. A change of concept happens when the meaning of an input to the model (or of an output label) changes, e.g., due to modified regulations. A concept drift occurs when the change is not drastic but emerges slowly. Drift is often due to sensor encrustment, i.e., slow evolution over time in sensor resolution (the smallest detectable difference between two values) or overall representation interval. A popular strategy to handle model maintenance is window-based relearning, which relies on recent data points to build a ML model. Another useful technique for AI model maintenance is back testing. In most cases, the user organization knows what happened in the aftermath of the AI model adoption and can compare model prediction to reality. This highlights concept changes: if an underlying concept switches, organizations see a decrease of performance. Another way of detecting these concept drifts may involve statistically characterizing the input dataset used for training the AI model, so that it is possible to compare this training dataset to the current input data in terms of statistic properties. Significant differences between datasets may be indicative of the presence of potential concept drifts which may require a relearning process to be carried out, even before the output of the system is significantly affected. In this way, retraining/relearning processes, which may be potentially time and resource consuming, can be carried out only when required instead of periodically, like in the above-mentioned window-based relearning strategies. Model maintenance also reflects the need to monitor the business goals and assets that might evolve over time and accordingly influence the model itself.
Model Maintenance in a Nutshell: Monitor the ML inference results of the deployed AI model, as well as the input data received by the model, in order to detect possible concept changes or drifts. Retrain the model when needed.
Building an AI model is often expensive and always time-consuming. It poses several business risks, including failing to have a meaningful impact on the user organization as well as missing in-production deadlines after completion. Business understanding is the stage at which companies that deploy AI models gain insight on the impact of AI on their business and try to maximize the probability of success.
Business Understanding in a Nutshell: Assess the value proposition of the deployed AI model. Estimate (before deployment) and verify (after deployment) its business impact.
Artificial Intelligence Cybersecurity Challenges*ENISA Report – AI Cybersecurity Challenges
*Shared with permission under Creative Commons – Attribution 4.0 International (CC BY 4.0) – license.
- SOARing Costs? Considering Data Breach Economics
- Luck of the Irish? Data Protection Commission of Ireland Publishes Annual Report