Specializing network analysis to detect anomalous insider actions

Collaborative information systems (CIS) enable users to coordinate efficiently over shared tasks in complex distributed environments. For flexibility, they provide users with broad access privileges, which, as a side-effect, leave such systems vulnerable to various attacks. Some of the more damaging malicious activities stem from internal misuse, where users are authorized to access system resources. A promising class of insider threat detection models for CIS focuses on mining access patterns from audit logs, however, current models are limited in that they assume organizations have significant resources to generate label cases for training classifiers or assume the user has committed a large number of actions that deviate from "normal" behavior. In lieu of the previous assumptions, we introduce an approach that detects when specific actions of an insider deviate from expectation in the context of collaborative behavior. Specifically, in this paper, we introduce a specialized network anomaly detection model, or SNAD, to detect such events. This approach assesses the extent to which a user influences the similarity of the group of users that access a particular record in the CIS. From a theoretical perspective, we show that the proposed model is appropriate for detecting insider actions in dynamic collaborative systems. From an empirical perspective, we perform an extensive evaluation of SNAD with the access logs of two distinct environments: the patient record access logs a large electronic health record system (6,015 users, 130,457 patients and 1,327,500 accesses) and the editing logs of Wikipedia (2,394,385 revisors, 55,200 articles and 6,482,780 revisions). We compare our model with several competing methods and demonstrate SNAD is significantly more effective: on average it achieves 20-30% greater area under an ROC curve.


Introduction
The popularity of collaborative information systems (CIS) has exploded over the past decade. CIS are now considered to be critical infrastructure in a wide-range of application domains. They are utilized in popular Web 2.0 applications, such as wikis, dynamic bookmarking, social networking, and groupware [1]. CIS have also transitioned beyond public systems and have become central to environments that handle personal or strategic knowledge [2], such as healthcare operations [3,4] and intelligence-related activities [5,6].
The escalated use of CIS is due, in part, to their ability to offer several major benefits in comparison to their predecessors. First, there is evidence that they can increase the

A local network approach to anomalous access detection
We recognize that detailed and well-defined semantics about the users and data in a CIS are not always available. As such, our primary goal is to develop an analytic framework to detect anomalous accesses by leveraging the inherent collaborative nature of the users in a CIS. We hypothesize that if a user is a threat, the similarity of the network of users will be higher when the user is suppressed from the network. Based on this hypothesis, we build a model for each subject (e.g., patient's medical record) in the form of a subnetwork of users (e.g., the set of authenticated healthcare workers); i.e., the specialized network. Our model then assesses if the similarity of the network with and without the user are significantly different. In contrast to [28], we provide justification for certain modeling decisions and the similarity measures employed.

An empirical evaluation of the capabilities of our method
Beyond developing the model, we perform an empirical investigation with the access logs of two distinct CIS. The first consists of the access logs from a large restrictedaccess EHR system. The second consists of the editing logs from a publicly-accessible online wiki. In the context of these real domains, we simulate intruding behavior in several manners that are indicative of various known illicit actions. Our results illustrate that when a specialized network is intruded upon, its similarity often sufficiently decreases to detect the intrusion. Additionally, and perhaps more importantly, we demonstrate that relatively simple data mining techniques are more effective than complex network decomposition methods for this specific detection problem. Beyond the material in [28], this paper provides a more comprehensive analysis of the competing anomaly detection techniques and variants of the specialized network analysis approach.
The remainder of the paper is organized as follows. In Section 2, we review related research in methodologies for preventing and detecting insider threats. Next, in Section 3, we introduce our model, which is called specialized network anomaly detection, or SNAD. Then, in Section 4, we report on our empirical study and highlight specific findings. Finally, in Section 5 we discuss the merits, as well as limitations, of our approach and in Section 6 we conclude the paper with a summary of the work.

Related Work
Prior research into insider threats in collaborative environments can be roughly partitioned into two types: prevention and detection.

Prevention of Insider Threats
The prevention of insider threats requires a combination of policy and technical approaches. From a policy perspective, it is recommended that organizations' perform various duties, such as assess the trustworthiness of potential employees (or group members), clearly document procedures, and provide periodic security awareness training. The focus of our paper is on technical approaches, but we refer the reader to [29] for an excellent summary of policies and insider threat management.
From a technical perspective, organizations are encouraged to enforce separation of duties and limit privileges in a need-to-know manner. In this regard, formal access control frameworks (e.g., [19,20]), which are designed to prevent illicit accesses from authenticated users by appropriately defining (and restricting) permissions, have been proposed. To enable efficient and effective management of users and their privileges, users are often assigned to roles; i.e., role-based access control (RBAC). Beyond the basic RBAC model, certain access control frameworks address team [23] and contextenhanced scenarios [22,24,25,30]. For instance, [25,31] demonstrated that RBAC models can include logical contextual rules for expressing the relationship between a user and a subject. This situation-based access control model, or SitBAC, enables formal representation of access scenarios to subjects as an ontology of entities involved in data access, in the form of i) the subject, ii) the user, iii) the task, iv) the legal authorization and v) their relationships. Though more specific than RBAC, contextual access control models do not readily capture the dynamic relationships among users, the hallmark of CIS.
Moreover, if an organization is to apply RBAC-like models, they must define roles. These may be derived through either a top-down or bottom-up approach. In the topdown sense, roles can be engineered through scenario-driven interviews with members of an organization to address the expected needs of an organization [32,33]. Given the scale and complexity of modern organizations, it has been suggested that roles may be derived through an alternative bottom-up manner, via data mining techniques collectively referred to as role mining [34,35]. The mined roles are subsequently adjudicated by expert security engineers and administrators. For instance, [34] proposed a method to discover RBAC roles by extracting patterns from an organization's database of users' permission lists. These types of role discovery techniques are scalable, but, to the best of our knowledge, they have yet to be applied to dynamic systems. It is not clear that they are sufficiently flexible to model stable role relationships.

Detection of Insider Threats
In a CIS, users tend to function as dynamic teams [36], which makes it difficult to differentiate between normal and abnormal accesses based solely on roles or permissions. As a result, technologies offered by industry tend to focus less on insider threats than on external vulnerabilities [17,37]. To overcome this deficiency, insider threat detection methods have been proposed to supplement access control frameworks. Insider threat detection can be categorized into two types: supervised and unsupervised learning strategies.

Supervised Methods
Supervised methods assume that there are examples of insider threats available for building a classification tool. These examples are often provided through documented evidence or derived from expert knowledge. Of particular note to our investigations, there have been several recent approaches proposed for detecting insider threat detection in the context of EHRs. [27], for instance, described an approach that assumes most accesses to EHRs occur for a valid clinical or operational reason. Based on this belief, the method filters out insider actions when any of three general types of "explanations" are observed: i) direct, ii) group, and iii) consultation. The remaining actions are subsequently considered interesting for investigation by a human.
As an alternative, [26] recently proposed a machine learning approach to suspicious access detection based on interviews with the privacy officials of several healthcare organizations. In this approach, access events are defined over a collection of twentysix features about the subjects (i.e., patients) and users (i.e., care providers). Privacy officials then label a set of suspicious and non-suspicious access events using an iterative refinement process. Based on these events, a classification model is trained (logistic regression and support vector machines were specifically applied in the study) to predict which events in a test set of EHR accesses were suspicious. [26] demonstrated that machine learning approaches were more effective detectors than the strict rule-based approaches the privacy officials tend to use in their current investigations. However, this model assumes that experts are aware of the features associated with insider threats and that they are readily available.

Unsupervised Methods
Unsupervised learning methods model inherent patterns in a system, so that, in the context of insider threat detection, anomalies can be pinpointed. In collaborative systems, certain data structures and theories have shown promise. These include techniques rooted in behavioral modeling and community detection [38][39][40][41]. SNAD falls into this category and here we review several related techniques.
In the set of methods most related to SNAD [39,40,42], the system is initially represented as the adjacency of a bipartite graph with subjects as rows and users as columns. The cells of the matrix are filled with values representative of the affinity of a user to the subject in question. The premise of all of these methods is that typical users are likely to form and function as neighborhoods, such that the likelihood a strangely behaving user will be characterized by the neighborhoods is low.
The main difference between these methods is how such neighborhoods are discovered and applied for anomaly detection. In [40], for instance, a method to detect anomalies in networked systems is leveraged. Specifically, for each vertex of interest (e.g., user), the method performs a random walk over the graph to derive proximity to other vertices (e.g., other users), which are integrated into a neighborhood. The method identifies anomalies and vertices that are sufficiently distant from the neighborhood.
Instead of a random walk, [42] proposes a spectral method to detect anomalous instances. In this method, a spectral decomposition is applied to the adjacency matrix to derive communities in the form of eigenvectors, which are weighted by the strength of their corresponding eigenvalues. Each column in the original adjacency matrix is projected onto the new communities of users. Columns that are distant from the communities (weighted by the eigenvalues) are considered to be anomalies.
A social networking perspective is offered in [39] to explicitly focus on collaborative systems. In this method, the adjacency matrix is subsequently transformed into a social network composed of vertices from only one of the classes; i.e., the users. The edges of the network are weighted based on the degree to which users access common vertices of the other class; i.e., the patients. Then, like [42], the method applies a decomposition of the graph to derive communities. Anomalies are detected by comparing each user to its nearest neighbors in the network with respect to the communities.
SNAD differs from the current set of methods in several significant ways. First, none of these methods distinguish between the actions of a user. Instead, they assume all actions committed by an anomalous user are suspicious. Such techniques are relevant when a user's account has been compromised or the user is performing a significant number of actions outside of their normal routine, but their application in our scenario would lead to a gross inflation in the number of false alarms. Second, all of the methods use a global perspective of the system, which assumes that distant relations influence local behavior. While this may be true when considering aggregate behavior, we hypothesize that the global view is more error-prone than a more localized view when attempting to detect subtle illicit actions, such as the access (or amendment) of a single subject in the CIS.
Though all of these methods were designed for anomalous user detection, the methods in [40] and [39] cannot be directly applied to a local network. Consider, the random walk model in [40] was designed to walk over the entire network. As a result, a user would receive the same anomaly score regardless of the local network in which they are being investigated. The method in [39] suffers from a similar problem since it does not distinguish between users and their actions. In contrast, the method in [42] can be adapted for anomalous access detection because the it can infer communities from a local network based on a space which is different from that of the global network. As a result, we compare our method to a variation of the spectral method of [42] as described in the following section.

Intruding Access Detection Model
This section introduces our model, which we call specialized network anomaly detection (SNAD). The approach is dubbed "specialized" because it focuses on a local view of the information system (as opposed to a global view) that is conditioned on specific subjects. We begin with a high-level overview of SNAD and then delve into the details of the particular methods it incorporates.
SNAD functions under the premise that normal and abnormal accesses will have sufficiently different influence on the similarity of the users in an access network. As depicted in Figure 1, SNAD can be represented as two general components: 1) Similarity Measurement (SNAD-SM), which feeds into 2) Anomaly Evaluation (SNAD-AE).
The SNAD-SM component extracts networks of users from the access logs of a CIS. More specifically, this component constructs a local access network for each subject. It then calculates the similarity of the users' access patterns in the network. Rather than focus on the individual features of the users or the subjects, SNAD aims for a more general representation to model the social behavior in the system by constructing and measuring the similarity of users' access networks.
The SNAD-AE component evaluates each access by comparing the similarity of an access network to its subnetwork. More specifically, SNAD-AE measures the similarity of the users that access a particular subject. This network is then compared to the similarity of a subnetwork that suppresses one of the network's users. If the similarity between the network and subnetwork are significantly different, then SNAD claims the suppressed user's access was an anomaly.
For reference, Table 1 summarizes the variables and notation used throughout the paper.

Access Network Construction
The SNAD-SM component transforms the CIS access logs into networks. The transformation begins by constructing a bipartite graph of the users and subjects that interact during a particular time period. Figure 2(a) depicts an example with six users and seven subjects modeled as vertices. Note, an edge represents a user accessed the subject's record.
Based on the graph, we define a local access network as follows. Let S = {s 1 ,..., s m } and U = {u 1 , . . . , u n } be the set of subjects and users, respectively. We define U s i as the set of users that accessed s i in a certain time period (e.g., one day) and define Net s i as a complete graph of U s i , where the weight between a user pair is their similarity (defined below). For simplicity, we use cardinality | · | to represent the number of elements in a set. Figure 2(a) depicts an example where U s 3 = {u 1 , u 2 , u 4 , u 5 , u 6 } and the corresponding complete graph is Net s 3 , which is depicted in Figure 2(e).

User Modeling
Initially, we represent the subject-user bipartite graph as a binary matrix SU, as depicted in Figure 2(b). SU(i, j) = 1, if user u j accesses subject s i , and 0 otherwise. For reference, we represent u i as the column vector of subject accesses, denoted U i . We use a binary matrix because in certain CIS, multiple components of the system record an access simultaneously. As a result, the number of user accesses to a particular subject may be artificially inflated. Moreover, the degree to which inflation exists is nonuniform across the components of the system. For instance, in an EHR, a user may access different components of a patient's medical record, such as laboratory reports, progress notes, or consult requests. These components may be accessed at different rates by different users and not all patients may have each type of note in their record.
The set of users in the CIS.
u j s i An access of user u j to subject s i .

U s i
The set of users that access subject s i .

SU
A binary matrix of subjects and users, the size of which is m × n. If u i accesses s j , SU(j, i) = 1, else SU(j, i) = 0.

PC'
A matrix created from SU or SU_IDF, the size of which is l × n, where l is the number of selected principal components.
The sum of the l eigenvalues.

lPC'
Chen et al. Security Informatics 2012, 1:5 http://www.security-informatics.com/content/1/1/5 As a consequence, models based on raw values could be biased toward users that work with the system in a more heavy manner or by subjects that have more information recorded in their records. Prior research in social network analysis (e.g., [43]) suggests it is important to represent the affinity that a user has toward a particular subject when assessing the similarity of a group. There are various aspects of a user's relationship to subjects that could be leveraged for measuring similarity. To mitigate bias and develop a generic approach, we focus our attention on the number of subjects a user accessed. Using this feature, we employ the inverse document frequency (IDF) model, popularized by information retrieval systems and shown to be effective for weighting the affinity of individuals to Figure 2 An illustrative example for the SNAD model. This figure illustrates the workflow of the SNAD model. First, a bipartite graph of user-subject accesses is constructed (part a), which is subsequently represented as a binary matrix (part b). This matrix is then transformed using "inverse document frequency" (part c), from which the similarities between pairs of users are computed (part d) and from which a local access network associated with a subject s 3 is built(part e). The similarities of the local access network and its five subnetworks are calculated (part f) and finally, the scores for each access associated with subject s 3 are assessed.
Chen et al. Security Informatics 2012, 1:5 http://www.security-informatics.com/content/1/1/5 subjects in friendship networks [43]. IDF captures the affinity of a user to a subject relative to all subjects in the system. As such, the IDF transformation is defined as: where B = [1, 1, . . . , 1] is a vector of length m. Figure 2(c) provides an example of this transformation. After IDF transformation, the binary matrix SU is converted into IDF matrix SU_IDF. We use column vector IDF _U i = SU_IDF [:, i] to represent u i .
Relationships, or similarity, between pairs of users can be mined from their access vectors. The cosine similarity [44] is a measure that has been particularly successfully applied in various domains to measure the similarity of objects in a vector form. Following this reasoning, we compute the similarity of users u i , u j via the cosine of their IDF-transformed vectors: Figure 2(d) provides an example of user pair similarities.

Access Network Measurement
We hypothesize that if an insider wanders into a local network, then the network's similarity will decrease. To investigate this hypothesis we need to develop an appropriate similarity measure for an access network. Different subjects have distinct local access networks. In order to compare the similarities of these across local networks, we define the similarity of an access network as the average similarity of all user pairs: where |U s k | is the number of users in Net s k . When this value is high, the users are close to each other, such that they have a strong collaborative relationship with respect to subject s k .

Access Measurement and Anomaly Detection
SNAD-SM provides a measure of similarity for an access network; however, to leverage such measures for anomaly detection, we need a formal approach to determine when a particular access is anomalous in the access network. In this regard, SNAD-AE evaluates each user's access in a network by calculating how the similarity of the network changes after the suppression of the user. SNAD-AE assumes that intruding accesses will lower the similarity of a network at a greater rate than a typical access.
We evaluate each access of a local network through similarity changes of the access network and its subnetworks as follows: where Network s ij is the network without user u j . The larger the value for Score(u j s i ), the greater the likelihood that access u j s i is an anomalous access. Notice, this approach assumes that scores are centered. We empirically demonstrate that this is the case for our datasets in Appendix A.
As an example, Net s 3 in Figure 2(e) consists of five users who accessed s 3 . The process of access score calculation for every access associated with s 3 is depicted in Figure 3. In Net s 3 , the expectation is that if u j s 3 is an intrusion, Score(u j s 3 ) will be larger than the subnetwork sans a typical user. Similarities of access networks and their subnetworks are depicted in Figure 2(f). Figure 2(g) reports the scores (Equation 4) for all accesses involved with subject s 3 . These scores were calculated from access network Net s 3 . The larger the score, the greater the probability the access is an intrusion. For the five accesses associated with network Net s 3 , u 1 and u 6 have scores which are larger than u 2 , u 4 and u 5 ; 0.05 and 0.16, respectively. If we rank the scores and claim the highest as an anomaly, u 6 s 3 will be implicated by SNAD. Turning back to the SU matrix, it can be seen that u 2 , u 4 , and u 5 access common subjects, whereas u 6 only has s 3 in common. Except for s 2 , s 3 and s 6 , u 1 has no common subjects with u 2 , u 4 and u 5 .

Spectral Anomaly Detection Model
Though SNAD may appear to be a simplistic model, we find it is more appropriate for access-level insider threat detection in CIS than more sophisticated competitors. As evidence, we will compare SNAD to a well-regarded competitor, spectral anomaly detection [42], which we summarize here.
This model first decomposes the binary matrix SU (or IDF matrix IDF_SU) into its principle components. The results of the decomposition is a principal component matrix PC, where each row represents a principal component, and each column is a vector of real values that represent the affinity of a user to each principal component. Each vector is associated with an eigenvalue, which is proportional to the amount of variance each vector captures in the system.
The spectral anomaly detection method then computes how user relationships to subjects are related to the most significant principal components. To accomplish this computation, the model selects the l principal components with the highest scoring eigenvalues l k , which leads to a new principal component matrix PC'. The decomposed matrix PC' is then transformed into matrix lPC', where lPC'[k, :] = (l k /l total ) × PC'[k, : ]. We use a column vector C i to model u i on the selected l principal Then, the average similarity of an access network is defined as: Finally, and similar to SNAD, the spectral model computes an access score for each user by measuring the change of distance in the access network after suppressing the user. As an example, access u 6 u 3 is scored by the spectral model as We apply the spectral anomaly detection model on both the pre-and post-IDF transformed SU matrices and refer to these models as Spectral-Binary and Spectral-IDF, respectively.

Datasets
For evaluation, we utilize datasets from CIS in two distinct domains: healthcare and online wikis. The first dataset corresponds to the real access logs of the Vanderbilt University Medical Center (VUMC) EHR system. This system has been in application for over a decade and is well-ingrained in healthcare operations. The logs document when an authenticated VUMC employee accessed a patient's record. We refer to this set of access transactions as the EHR dataset. This dataset contains 1,327,500 accesses, 6,015 users, and 130,457 patients and were collected over 30 weeks during the year 2006. Further details and analysis of this dataset can be found in [36].
The second dataset corresponds to the wiki-talk logs from Wikipedia [45]. In this dataset, each registered user has a "talk" page that they and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. The logs cover the years 2002 to 2008. It contains 6,482,780 revisions, 2,394,385 users and 55,200 articles. For this study, we analyze the revisions documented over 50 weeks during the year 2007. We refer this set of transactions as the Wiki dataset.
In the EHR dataset, we refer to patient records as subjects, and user views of the records as accesses. Similarly, in the Wiki dataset, we refer to articles as subjects and user revisions as accesses. Summary statistics regarding the two datasets are provided in Table 2.

Detection Models
In this work, we evaluate four anomaly detection models. The first two are variants of SNAD, while the second two are variants of the spectral anomaly detection model. Specifically, we consider if the IDF transformation influences anomaly detection performance. As such, we evaluate both models using the raw binary matrix (SU) and the IDF-transormed matrix (IDF_SU). We refer to these models as SNAD-Binary, SNAD-IDF, Spectral-Binary, and Spectral-IDF.

Experimental Design
The datasets do not document which (if any) accesses were intrusions. As such, to conduct a controlled evaluation, we injected simulated actions into the logs (i.e., changed 0's to 1's in the SU access matrix).
For this study, we use three scenarios to assess the intrusion detection rate under various settings: 1. Accesses Per User: We select a user at random, inject between 1 to 100 new subject accesses, and execute the detection model. This process is repeated 15 times per week. 2. User Per Access Load: We investigate how the number of intruding users influences the detection rate. We select a set of users to inject three intruding accesses into. We perform this analysis over the range of 2 to 20 intruding users. 3. Diverse Setting: We emulate a more realistic environment by allowing for a variety of simultaneous intruding users and actions. Specifically, we inject a set of random subject accesses, between 1 and 100, into a random set of users, between 1 and 20.
Each of these scenarios is simulated on a per week basis. We measure the performance of the models using the receiver operating characteristic (ROC) curve [46]. This is a characterization of the true positive rate versus the false positive rate for a binary classifier as its discrimination threshold is varied. The area under the ROC curve (AUC) reflects the relationship between sensitivity and specificity for a given test. Sensitivity (also called recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of simulated accesses which are correctly identified as anomaly). Specificity measures the proportion of negatives which are correctly identified (e.g., the percentage of real access which are correctly identified as normality). A higher AUC indicates better performance. In the first two simulation settings, we report on the average AUC per simulation configuration.

Complexity and runtime Analysis
We analyzed the complexity of SNAD from a theoretical and an empirical runtime analysis.
From a theoretical perspective, the complexity of SNAD is defined as follows. Let m, and n be the number of subjects and users in the CIS. For every subject, SNAD constructs a local network, which is accomplished in O(m) time. For each local network, SNAD computes the similarity for all pairs of users, which consumes O(logm × (logn) 2 ) time. It is anticipated that the number of subjects accessed by a user is significantly smaller than m and thus the time to compute the similarity of a pair of users is O (logm). Since there are (logn) 2 pairs, the complexity of SNAD is O(m × logm × (logn) 2 ).
The average runtime of the four models for EHR and Wiki datasets are depicted in Table 3. The runtime is the average seconds per scored access. And it is calculated over all experiments and settings in this paper. We would like to highlight three points based on the table. First, all models applied to the Wiki dataset have smaller runtime than the EHR dataset. This is because, for the same number of accesses, Wiki has a smaller number of local networks than EHR. In other words, models have to be executed over more local networks in EHR than in Wiki. Second, the SNAD models have smaller runtime than the spectral models. This is because spectral analysis on the local network is more time consuming than similarity calculation of the local network. Third, the SNAD and spectral models have nearly the same runtime when executed over their binary or IDF matrices. This is because the time required for transforming the binary matrix to the IDF matrix is negligible. Figure 4 depicts the distributions of access network similarity in the EHR and Wiki datasets for an arbitrary week (similar results were observed with other weeks). Notably, these environments capture different social phenomena. In the EHR dataset, for instance, the majority of access networks are small in size. And, as shown in the upper plot of Figure 4, the similarity approaches zero as the network size grows. This suggests that when a user is suppressed from a large network in the healthcare setting, the average similarity has little change. The main driving factor of this phenomenon is that large access networks in the EHR system tend to be varied in the user composition (i.e., complex dynamic teams of care providers).

Similarity of Real Access Networks
In contrast, the lower plot of Figure 4 indicates that Wiki users in large access networks are relatively similar. This implies that when an intruder joins a large sized access network in the wiki world, the average similarity of the new access network could greatly decrease.
These observations suggest that SNAD may not be appropriate for a large sized access network whose average similarity is small. In this case, the removal of the intruder from an access network is expected to have little influence on the average similarity.

Accesses Per User
In the first experiment, we investigate how the number of intrusions committed by a single user influences detection. Figures 5 and 6 depict the average AUC and one standard deviation of the detection models as a function of the number of simulated intrusions. It can be seen that SNAD-IDF has equal or larger AUC than both spectral models and SNAD-Binary. We note there are only two points at which the spectral models and SNAD-IDF were equivalent (3 intruding accesses in the EHR dataset and 5 intruding accesses in the Wiki dataset). Additionally, unlike the spectral models, the AUC of both SNAD variants tends to increase with the number of accesses. When the insider has only one Table 3 Average runtime (second per access) of the four models for the datasets.  simulated access, SNAD-IDF's average AUC is nearly 0.65 compared to 0.59 of its nearest spectral decomposition competitor, Spectral-Binary in EHR dataset. When the number of simulated accesses is 30, SNAD-IDF's AUC reaches 0.9 in both datasets. Also, we can see that SNAD-IDF is more effective than SNAD-Binary. This indicates that the affinity of a user to a subject relative to all subjects is an important factor in detecting strange insider actions. However, for spectral models, neither Spectral-Binary nor Spectral-IDF is a clear winner in terms of performance with respect to both datasets.

Users Per Access Load
In this experiment, we investigate how the number of intruding insiders influences detection. We fix the number of simulated accesses to 3. The results are depicted in Figures 7 and 8, which demonstrate the AUC for all models increase with the number of accesses for the EHR dataset, but only the AUC of the SNAD variants increase in the Wiki dataset. Nonetheless, both SNAD variants greatly outperform the spectral models at all evaluation points. We suspect this is because the insiders with simulated accesses substantially amend the local access network, but have little influence on the global network, which the spectral approach depends upon. The implication is that indirect relations, which are critical to the discovery of intruding user behavior in general [39,42], may be less important than the direct relations in the detection of specific intruding accesses. However, we recognize that a more detailed investigation, perhaps with more datasets, is necessary before such a conjecture can be confirmed.    Here it can be observed that SNAD-IDF exhibits an AUC that is 20-30% higher, on average, than the spectral models, and 5-10% higher than SNAD-Binary.

Diverse Insider Setting
In the third experiment, we injected a random number of accesses into a random set of user vectors. Figures 9 and 10 provide a comparison of the ROC curves of the detection models for both datasets. It is apparent that SNAD has better performance than both of the spectral models and SNAD-Binary at every operating point. Table 4 summarizes the average AUC scores (+/-one standard deviation) of the anomaly detection models in this setting. The results indicate that SNAD-IDF achieves the highest AUC, 0.83 and 0.91 in the EHR and Wiki datasets, respectively. This translates into AUC scores that are 10% -20% higher, on average, than the spectral models.

Discussion
SNAD is an unsupervised learning model, based on social network analysis, for detecting anomalous accesses in CIS. SNAD uses a "local" view of a social network in that, for each subject, it considers only the direct relationships of the users that access the subject. This is a simpler approach than methods based on a "global" perspective, such as neighborhood formation, community detection, and spectral decomposition models. Though the latter models may invoke more heavy computational machinery, our empirical investigations with real world data from healthcare and an online wiki domains suggest that they are less effective for detecting specific behavioral deviations. Figure 9 ROC curves for the detection models in a diverse setting for the EHR dataset. This figure summarizes the ROC scores for the four detection models, as applied to the EHR dataset, when the number of intruders and the quantity of intruding accesses are randomly generated.
Our results suggest the local view is significantly more adept at determining when a user has wandered into a network of users associating with a particular subject.
Despite the development of a more robust CIS access anomaly detection model, there are several limitations of the study that we wish to point out to serve as a guidebook for future research on this topic.
First, the results of the experiments outline the scope and context in which SNAD is applicable. SNAD appears to be suited to environments where the access networks exhibit high similarity. Additionally, the performance of SNAD seems to increase as a function of the number of illicit insiders and the quantity of suspicious accesses they execute. However, SNAD may not be suited for large networks in environments where the users are collaborating in physical settings, such as healthcare. Our results suggest that in these environments, the similarity of the users in local networks is relatively low. We suspect that this is due to the ad hoc team-based nature of dynamic organizations that function in the real world. For instance, in a hospital setting, there often over 100 job titles or specialities and large teams are often constructed based on who  from which speciality is available to work. In contrast, in the online Wikipedia setting, it appears that large teams are not limited by speciality, but who is interested in collaborating over a subject. We believe further investigation is necessary before any concrete conclusions can be drawn about such behavior. In particular, users may have different roles, such as administrators and gateways, which may influence their relationships to other users, and thus some of accesses associated with these users will be falsely detected as anomalous by SNAD. In this paper, we do not consider the impact of creating this type of noise to our SNAD models, but believe it is a fruitful direction for future research. Additionally, SNAD accounts for the relationship between users and subjects, but neglects the semantics of the relation. For instance, SNAD does not model the intention of a user while executing an action, which serves as the foundation of the recent model proposed in [27]. In a CIS, the system is often mission-oriented, such that the semantics of the users and subjects are informative. Consider, in an EHR system, patients are assigned diagnoses and procedures, while users are affiliated with various departments and assigned certain roles within a healthcare organization. Rather than treat each user and patient equally, we believe that detection sensitivity could be improved by integrating such information into the network modeling process.
Finally, SNAD was evaluated on only one type of attack; i.e., when a user issues an intruding access randomly. Yet, in real systems, there may be many types of attacks [27], some which are more complex and require different simulation methods.

Conclusion
In this paper we proposed a specialized network anomaly detection model, or SNAD, to discover anomalous actions in collaborative information systems (CIS). SNAD differs from previous insider threat detection techniques in that it is engineered to assess specific event-related actions as opposed to global patterns. The foundation of SNAD is an efficient unsupervised learning method, such that it can be deployed in real systems. We evaluated our technique against several competitors, based on spectral decomposition, with real EHR access and Wiki "talk" logs. The empirical results demonstrate that SNAD exhibits better performance than its competitors in almost every assessed scenario. Nonetheless, there are limitations of SNAD, such as a requirement for highly similar behavior of all users accessing a specific subject, which may be difficult to realize in large networks in the physical world. We believe that extending the model to account for the semantics of the users and subjects will improve its effectiveness in such cases and anticipate extending SNAD in this direction.

A Validity of Similarity Measure
The anomaly detection measure proposed for SNAD assumes that the "similarity" scores are approximately distributed around a well-centered mean. To verify if this is the case, we modeled the distribution of access scores for an arbitrary week of the EHR and the Wiki datasets. The distributions of scores are depicted in Figure 11 and Figure 12. From the figures, we observed that the distribution of access scores in EHR and Wiki datasets is close to a Laplace distribution, which is centered. Figure 11 Density function of access score in an arbitrary week of the EHR dataset. In this figure, the upper plot corresponds to the access scores on access networks of various size in the EHR dataset. The lower plot corresponds to the distribution of the access scores. Chen et al. Security Informatics 2012, 1:5 http://www.security-informatics.com/content/1/1/5 Figure 13 Cumulative distribution function of access scores in an arbitrary week of the EHR. This figure reports the frequency of access scores, as a cumulative distribution function for one week of the EHR dataset. In this figure, the x-axis corresponds to the access scores. The y-axis corresponds the fraction of the scores with value less than a specific score X i . Figure 14 Cumulative distribution function of access scores in an arbitrary week of the Wiki. This figure reports the frequency of access scores, as a cumulative distribution function for one week of the Wiki dataset. In this figure, the x-axis corresponds to the access scores. The y-axis corresponds the fraction of the scores with value less than a specific score X i .
In order to assess the relationship between the distributions of access scores and a Laplace, we created a random dataset from an ideal Laplace distribution. Our simulation has two parameters: location a and scale b. The location a is set to 0, and the scale b is set to 0.05. In the simulation, we generated the cumulative distribution functions of the ideal Laplace and access scores, which are in Figure 13 and Figure 14.
We calculated the correlation between the distributions via a Pearson coefficient, r. The r value is 0.866 in the EHR dataset, and 1 in the Wiki dataset, which suggests that the real access score distributions are highly correlated with Laplace. The relation between the real and Laplace distributions can be observed in Figure 13 and Figure 14.