Estimating the sentiment of social media content for security informatics applications

Glass, Kristin; Colbaugh, Richard

doi:10.1186/2190-8532-1-3

Research
Open access
Published: 27 February 2012

Estimating the sentiment of social media content for security informatics applications

Kristin Glass¹ &
Richard Colbaugh²

Security Informatics volume 1, Article number: 3 (2012) Cite this article

9792 Accesses
16 Citations
Metrics details

Abstract

Inferring the sentiment of social media content, for instance blog posts and forum threads, is both of great interest to security analysts and technically challenging to accomplish. This paper presents two computational methods for estimating social media sentiment which address the challenges associated with Web-based analysis. Each method formulates the task as one of text classification, models the data as a bipartite graph of documents and words, and assumes that only limited prior information is available regarding the sentiment orientation of any of the documents or words of interest. The first algorithm is a semi-supervised sentiment classifier which combines knowledge of the sentiment labels for a few documents and words with information present in unlabeled data, which is abundant online. The second algorithm assumes existence of a set of labeled documents in a domain related to the domain of interest, and leverages these data to estimate sentiment in the target domain. We demonstrate the utility of the proposed methods by showing they outperform several standard techniques for the task of inferring the sentiment of online movie and consumer product reviews. Additionally, we illustrate the potential of the methods for security informatics by estimating regional public opinion regarding two events: the 2009 Jakarta hotel bombings and 2011 Egyptian revolution.

1. Introduction

There is increasing recognition that the Web represents a valuable source of security-relevant intelligence and that computational analysis offers a promising way of dealing with the problem of collecting and analyzing data at Web scale [e.g., [1–4]]. As a consequence, tools and algorithms have been developed which support various security informatics objectives [3, 4]. To cite a specific example, we have recently shown that blog network dynamics can be exploited to provide reliable early warning for a class of extremist-related, real-world protest events [5].

Monitoring social media to spot emerging issues and trends and to assess public opinion concerning topics and events is of considerable interest to security professionals; however, performing such analysis is technically challenging. The opinions of individuals and groups are typically expressed as informal com-munications and are buried in the vast, and largely irrelevant, output of millions of bloggers and other online content producers. Consequently, effectively exploiting these data requires the development of new, automated methods of analysis [3, 4]. Although helpful computational analytics have been derived for traditional forms of written content, less has been done to develop techniques that are well-suited to the particular characteristics of the content found in social media.

This paper considers one of the central problems in the new field of social media analytics: deciding whether a given document, such as a blog post or forum thread, expresses positive or negative opinion to-ward a particular topic. The informal nature of social media content poses a challenge for language-based sentiment analysis. While statistical learning-based methods often provide good performance in unstructured settings like this [e.g., [6–13]], obtaining the required labeled instances of data, such as a collection of "exemplar" blog posts of known sentiment polarity, is usually an expensive and time-consuming undertaking.

We present two new computational methods for inferring sentiment orientation of social media content which address these challenges. Each method formulates the task as one of text classification, models the data as a bipartite graph of documents and words, and assumes that only limited prior information is available regarding the sentiment orientation of any of the documents or words of interest. The first algorithm adopts a semi-supervised approach to sentiment classification, combining knowledge of the sentiment polarity for a few documents and a small lexicon of words with information present in a corpus of unlabeled documents; note that such unlabeled data are readily obtainable in online applications. The second algorithm assumes existence of a set of labeled documents in a domain related to the domain of interest, and provides a procedure for transferring the sentiment knowledge contained in these data to the target domain. We demonstrate the utility of the proposed algorithms by showing they outperform several standard methods for the task of inferring the sentiment polarity of online reviews of movies and consumer products. Additionally, we illustrate the potential of the methods for security informatics through two case studies in which sentiment analysis of Arabic, Indonesian, and Danish (language) blogs is used to estimate regional public opinion regarding the 2009 Jakarta hotel bombings and 2011 Egyptian revolution.

2. Preliminaries

We approach the task of estimating the sentiment orientations of a collection of documents as a text classification problem. Each document of interest is represented as a "bag of words" feature vector x∈ℜ^|V|, where the entries of × reflect some measure of the frequency with which the words in the vocabulary set V appear in the document. For example, the elements of × can be simple word-counts, or × can be normalized in various ways [6]. In this paper the elements x_i of × are defined to indicate the presence (x_i = 1) or absence (x_i = 0) of the corresponding words in the document; however, specifying × using word-counts yields similar results.) We wish to learn a vector c∈ℜ^|V| such that the classifier orient = sign(c^Tx) accurately estimates the sentiment orientation of document x, returning +1 (-1) for documents expressing positive (negative) sentiment about the topic of interest.

Knowledge-based classifiers leverage prior domain information to construct the vector c. One way to obtain such a classifier is to assemble lexicons of positive words V⁺⊆V and negative words V^-⊆V, and then to set c_i= +1 if word i∈V⁺, c_i= -1 if i∈V^-, and c_i=0 if i is not in either lexicon; this classifier simply sums the positive and negative sentiment words in the document and assigns document orientation accordingly. While this scheme can provide acceptable performance in certain settings, it is unable to improve its performance or adapt to new domains, and it is usually labor-intensive to construct lexicons which are sufficiently complete to enable useful sentiment classification performance to be achieved.

Alternatively, learning-based methods attempt to generate the classifier vector c from examples of positive and negative sentiment. To obtain a learning-based classifier, one can begin by assembling a set of n_l labeled documents {(x_i, d_i)}, where d_i∈{+1, -1} is the sentiment label for document i. The vector c then can be learned through "training" with the set {(x_i, d_i)}, for instance by solving the following set of equations for c:

[X^{T} X+ γ I_{|}_{V|}] {c=X}^{T} d,

(1)

where matrix X∈ℜ^nl×|V| has document vectors for rows, d∈ℜ^nl is the vector of document labels, I_|V| denotes the |V|×|V| identity matrix, and gγ0 is a constant; this corresponds to regularized least squares (RLS) learning [14]. Many other strategies can be used to compute c, including Naïve Bayes (NB) statistical inference [6]. Learning-based classifiers have the potential to improve their performance and to adapt to new situations, but standard methods for realizing these capabilities require that fairly large training sets of labeled documents be obtained and this is usually expensive.

Sentiment analysis of social media content for security informatics applications is often characterized by the existence of only modest levels of prior knowledge regarding the domain of interest, reflected in the availability of a few labeled documents and small lexicon of sentiment-laden words, and by the need to rapidly learn and adapt to new domains. As a consequence, standard knowledge-based and learning-based sentiment analysis methods are typically ill-suited for security informatics. In order to address this challenge, the sentiment analysis methods developed in this paper enable limited labeled data to be combined with readily available "auxiliary" information to produce accurate sentiment estimates. More specifically, the first proposed method is a semi-supervised algorithm [e.g., [9, 10]] which leverages a source of supplementary data which is abundant online: unlabeled documents and words. Our second algorithm is a novel transfer learning method [e.g., [11]] which permits the knowledge present in data that has been previously labeled in a related domain (say online movie reviews) to be transferred to a new domain (e.g., reviews of consumer products).

Each of the algorithms proposed in this paper assumes the availability of a modest lexicon of sentiment-laden words. This lexicon is encoded as a vector w∈ℜ^|Vl|, where V_l = V⁺∪V^- is the sentiment lexicon and the entries of w are set to +1 or -1 according to the polarity of the corresponding words. The development of the algorithms begins by modeling the problem data as a bipartite graph G_b of documents and words (see Figure 1). It is easy to see that the adjacency matrix A for graph G_b is given by

A = [\begin{matrix} 0 & X \\ X^{T} & 0 \end{matrix}]

(2)

where the matrix X∈ℜ^n×|V| is constructed by stacking the document vectors as rows, and each '0' is a matrix of zeros. In both the semi-supervised and transfer learning algorithms, integration of labeled and "auxiliary" data is accomplished by exploiting the relationships between documents and words encoded in the bipartite graph model. The basic idea is to assume that, in the bipartite graph G_b, positive documents will tend to be connected to (contain) positive words, and analogously for negative documents/words.

3. Semi-Supervised Sentiment Analysis

We now derive our first sentiment estimation algorithm for social media content. Consider the common situation in which only limited prior knowledge is available about the way sentiment is expressed in the domain of interest, in the form of small sets of documents and words for which sentiment labels are known, but where abundant unlabeled documents can be easily collected (e.g., via Web crawling). In this setting it is natural to adopt a semi-supervised approach, in which labeled and unlabeled data are combined and leveraged in the analysis process. In what follows we present a novel bipartite graph-based approach to semi-supervised sentiment analysis.

Assume the initial problem data consists of a corpus of n documents, of which n_l << n are labeled, and a modest lexicon V_l of sentiment-laden words, and suppose that this label information is encoded as vectors d∈ℜ^nl and w∈ℜ^|Vl|, respectively. Let d_est∈ℜⁿ be the vector of estimated sentiment orientations for the documents in the corpus, and define the "augmented" classifier c_aug = [d_est^T c^T]^T∈ℜ^n+|V| which estimates the polarity of both documents and words. Note that the quantity c_aug is introduced for notational convenience in the subsequent development and is not directly employed for classification. More specifically, in the proposed methodology we learn c_aug, and therefore c, by solving an optimization problem involving the labeled and unlabeled training data, and then use c to estimate the sentiment of any new document of interest with the simple linear classifier orient = sign(c^Tx). We refer to this classifier as semi-supervised because it is learned using both labeled and unlabeled data. Assume for ease of notation that the documents and words are indexed so the first n_l elements of d_est and |V_l| elements of c correspond to the labeled data.

We wish to learn an augmented classifier c_aug with the following three properties: 1.) if a document is labeled, then the corresponding entry of d_est should be close to this ±1 label; 2.) if a word is in the sentiment lexicon, then the corresponding entry of c should be close to this ±1 sentiment polarity; and 3.) if there is an edge X_ij of G_b that connects a document × and a word v∈V and X_ij possesses significant weight, then the estimated polarities of × and v should be similar. These objectives are encoded in the following minimization problem:

min_{c_{aug}} c_{aug}^{T} L c_{aug} + β_{1} \sum_{i = 1}^{n_{l}} {(d_{est,i} - d_{i})}^{2} + β_{2} \sum_{i = 1}^{|V_{l}|} {(c_{i} - w_{i})}^{2}

(3)

where L = D - A is the graph Laplacian matrix for G_b, with D the diagonal degree matrix for A (i.e., D_ii = ∑_j A_ij), and β₁, β₂ are nonnegative constants. Minimizing (3) enforces the three properties we seek for c_aug, with the second and third terms penalizing "errors" in the first two properties. To see that the first term enforces the third property, observe that this expression is a sum of components of the form X_ij(d_est,i - c_j)². The constants β₁, β₂ can be used to balance the relative importance of the three properties.

The c_aug which minimizes the objective function (3) can be obtained by solving the following set of linear equations:

[\begin{matrix} L_{11} + β_{1} I_{nl} & L_{12} & L_{13} & L_{14} \\ L_{21} & L_{22} & L_{23} & L_{24} \\ L_{31} & L_{32} & L_{33} + β_{2} I_{|V_{1}|} & L_{34} \\ L_{41} & L_{42} & L_{43} & L_{44} \end{matrix}] c_{aug} = [\begin{matrix} β_{1} d \\ 0 \\ β_{2} w \\ 0 \end{matrix}]

(4)

where the L_ij are matrix blocks of L of appropriate dimension. The system (4) is sparse because the data matrix × is sparse, and therefore large-scale problems can be solved efficiently. Note that in situations where the set of available labeled documents and words is very limited, say less than a couple hundred documents, sentiment classifier performance can be improved by replacing L in (4) with the normalized Laplacian L_n=D^-1/2LD^-1/2, or with a power of this matrix L_n^k (for k a positive integer). The Appendix of this paper demonstrates that replacing L with L_n^k serves to "smooth" the polarity estimates assigned to the vertices of G_b, thereby reducing the possibility for over-fitting and increasing the capability for generalization.

We summarize this discussion by sketching an algorithm for learning the proposed semi-supervised classifier:

Algorithm SS

1.
Construct the set of equations (4), possibly by replacing the graph Laplacian L with L_n^k.
2.
Solve equations (4) for c_aug = [ d_est^T c^T ]^T (for instance using the Conjugate Gradient method).
3.
Estimate the sentiment orientation of any new document × of interest as: orient = sign(c^Tx).

The utility of Algorithm SS is now examined through a case study involving a standard sentiment analysis task: estimation of the sentiment polarity of online movie reviews (an exercise which is known to be diffi-cult).

4. Case Study One: Movie Reviews

This case study examines the performance of Algorithm SS for the problem of estimating sentiment of online movie reviews. The data used in this study is a publicly available set of 2000 movie reviews, 1000 positive and 1000 negative, collected from the Internet Movie Database and archived at the website [15]. The Lemur Toolkit [16] was employed to construct the data matrix × and vector of document labels d from these reviews. A lexicon of ~1400 domain-independent sentiment-laden words was obtained from [17] and employed to build the lexicon vector w.

This study compares the movie review orientation classification accuracy of Algorithm SS with that of three other schemes: 1.) lexicon-only, in which the lexicon vector w is used as the classifier as summarized in Section II, 2.) a classical NB classifier obtained from [18], and 3.) a well-tuned version of the RLS classifier (1). Algorithm SS is implemented with the following parameter values: β₁ = 0.1, β₂ = 0.5, and k = 10. A focus of the investigation is evaluating the extent to which good sentiment estimation performance can be achieved even if only relatively few labeled documents are available for training; thus we examine training sets which incorporate a range of numbers of labeled documents: n_l = 50, 100, 150, 200, 300, 400, 600, 800, 1000.

Sample results from this study are depicted in Figure 2. Each data point in the plots represents the average of ten trials. In each trial, the movie reviews are randomly partitioned into 1000 training and 1000 test documents, and a randomly selected subset of training documents of size n_l is "labeled" (i.e., the labels for these reviews are made available to the learning algorithms). As shown in Figure 2, Algorithm SS outperforms the other three methods. Note that, in particular, the accuracy obtained with the proposed approach is significantly better than the other techniques when the number of labeled training documents is small. It is expected that this property of Algorithm SS will be of considerable value in security informatics applications that involve social media data.

5. Transfer Learning Sentiment Analysis

This section develops the second proposed sentiment estimation algorithm for social media content. Many security informatics applications are characterized by the presence of limited labeled data for the domain of interest but ample labeled information for a related domain. For instance, an analyst may wish to ascertain the sentiment of online discussions about an emerging topic of interest, and may have in hand a large set of labeled examples of positive and negative posts regarding other topics (e.g., from previous studies). In this setting it is natural to adopt a transfer learning approach, in which knowledge concerning the way sentiment is expressed in one domain, the so-called source domain, is transferred to permit sentiment estimation in a new target domain. In what follows we present a new bipartite graph-based approach to transfer learning sentiment analysis.

Assume that the initial problem data consists of a corpus of n = n_T + n_S documents, where n_T is the (small) number of labeled documents available for the target domain of interest and n_S >> n_T is the number of labeled documents from some related source domain; in addition, suppose that a modest lexicon V_l of sentiment-laden words is known. Let this label data be encoded as vectors d_T∈ℜ^nT, d_S∈ℜ^nS, and w∈ℜ^|V|, respectively. Denote by d_T,est∈ℜ^nT, d_S,est∈ℜ^nS, and c∈ℜ^|Vl| the vectors of estimated sentiment orientations for the target and source documents and the words, and define the augmented classifier as c_aug = [d_S,est^T d_T,est^T c^T]^T ∈ ℜ^n+|V|. Note that the quantity c_aug is introduced for notational convenience in the subsequent development and is not directly employed for classification.

In what follows we derive an algorithm for learning c_aug, and therefore c, by solving an optimization problem involving the labeled source and target training data, and then use c to estimate the sentiment of any new document of interest via the simple linear classifier orient = sign(c^Tx). This classifier is referred to as transfer learning-based because c is learned, in part, by transferring knowledge about the way sentiment is expressed from a domain which is related to (but need not be identical to) the domain of interest.

We wish to learn an augmented classifier c_aug with the following four properties: 1.) if a source document is labeled, then the corresponding entry of d_S,est should be close to this ±1 label; 2.) if a target document is labeled, then the corresponding entry of d_T,est should be close to this ±1 label, and the information encoded in d_T should be emphasized relative to that in the source labels d_S,; 3.) if a word is in the sentiment lexicon, then the corresponding entry of c should be close to this ±1 sentiment polarity; and 4.) if there is an edge X_ij of G_b that connects a document × and a word v∈V and X_ij possesses significant weight, then the estimated polarities of × and v should be similar.

The four objectives listed above may be realized by solving the following minimization problem:

min_{c_{aug}} c_{aug}^{T} L c_{aug} + β_{1} {∥d_{S,est} - k_{S} d_{S}∥}^{2} + β_{2} {∥d_{T,est} - k_{T} d_{T}∥}^{2} + β_{3} {∥c - w∥}^{2}

(5)

where L = D - A is the graph Laplacian matrix for G_b, as before, and β₁, β₂, β₃, k_S, and k_T are nonnegative constants. Minimizing (5) enforces the four properties we seek for c_aug. More specifically, the second, third, and fourth terms penalize "errors" in the first three properties, and choosing β₂ >β₁ and k_T >k_S favors target label data over source labels. To see that the first term enforces the fourth property, note that this expression is a sum of components of the form X_ij (d_T,est,i - c_j)² and X_ij (d_S,est,i - c_j)².The constants β₁, β₂, β₃ can be used to balance the relative importance of the four properties.

The c_aug which minimizes the objective function (5) can be obtained by solving the following set of linear equations:

[\begin{matrix} L_{11} + β_{1} I_{nS} & L_{12} & L_{13} \\ L_{21} & L_{22} + β_{2} I_{nT} & L_{23} \\ L_{31} & L_{32} & L_{33} + β_{3} I_{|V_{1}|} \end{matrix}] c_{aug} = [\begin{matrix} β_{1} k_{S} d_{S} \\ β_{2} k_{T} d_{T} \\ β_{3} w \end{matrix}]

(6)

where the L_ij are matrix blocks of L of appropriate dimension. The system (6) is sparse because the data matrix × is sparse, and therefore large-scale problems can be solved efficiently. In applications with very limited labeled data, sentiment classifier performance can be improved by replacing L in (6) with the nor-malized Laplacian L_n or with a power of this matrix L_n^k.

Note that developing systematic methods for characterizing how similar the source and target domains must be to enable useful transfer learning, or for selecting an appropriate source domain given a target set of interest, remain open research problems. Some helpful guidance for these tasks is provided in [12].

We summarize the above discussion by sketching an algorithm for constructing the proposed transfer learning classifier:

Algorithm TL

1.
Construct the set of equations (6), possibly by replacing the graph Laplacian L with L_n^k.
2.
Solve equations (6) for c_aug = [d_S,est^T d_T,est^T c^T]^T.
3.
Estimate the sentiment orientation of any new document × of interest as: orient = sign(c^Tx).

The utility of Algorithm TL is now examined through a case study involving online consumer product reviews.

6. Case Study Two: Product Reviews

This case study examines the performance of Algorithm TL for the problem of estimating sentiment of online product reviews. The data used in this study is a publicly available set of 1000 reviews of electronics products, 500 positive and 500 negative, and 1000 reviews of kitchen appliances, 500 positive and 500 negative, collected from Amazon and archived at the website [19]. The Lemur Toolkit [16] was employed to construct the data matrix × and vectors of document labels d_S and d_T from these reviews. A lexicon of 150 domain-independent sentiment-laden words was constructed manually and employed to form the lexicon vector w.

This study compares the product review sentiment classification accuracy of Algorithm TL with that of four other strategies: 1.) lexicon-only, in which the lexicon vector w is used as the classifier as summarized in Section II, 2.) a classical NB classifier obtained from [18], 3.) a well-tuned version of the RLS classifier (1), and 4.) Algorithm SS. Algorithm TL is implemented with the following parameter values: β₁ = 1.0, β₂ = 3.0, β₃ = 5.0, k_S = 0.5, k_T = 1.0, and k = 5. A focus of the investigation is evaluating the extent to which the knowledge present in labeled reviews from a related domain, here kitchen appliances, can be transferred to a new domain for which only limited labeled data is available, in this case electronics. Thus we assume that all 1000 labeled kitchen reviews are available to Algorithm TL (the only algorithm which is designed to exploit this information), and examine training sets which incorporate a range of numbers of labeled documents from the electronics domain: n_T = 20, 50, 100, 200, 300, 400.

Sample results from this study are depicted in Figure 3. Each data point in the plots represents the average of ten trials. In each trial, the electronics reviews are randomly partitioned into 500 training and 500 test documents, and a randomly selected subset of reviews of size n_T is extracted from the 500 labeled training instances and made available to the learning algorithms. As shown in Figure 3, Algorithm TL outperforms the other four methods. Note that, in particular, the accuracy obtained with the transfer learning approach is significantly better than the other techniques when the number of labeled training documents in the target domain is small. It is expected that the ability of Algorithm TL to exploit knowledge from a related domain to quickly learn an effective sentiment classifier for a new domain will be of considerable value in security informatics applications involving social media data.

7. Case Study Three: Jakarta Hotel Bombings

On 17 July 2009 the JW Marriott and Ritz-Carlton Hotels in Jakarta, Indonesia were hit by suicide bombing attacks within five minutes of each other. A little over a week later, on 26 July 2009, a document claiming responsibility for the attacks and allegedly written by N.M. Top was posted on the blog [20]; see Figure 4 for a screenshot of a portion of this blog post. In subsequent discussions we will refer to this post as the "Top post" for convenience, with the understanding that the authorship of the post is uncertain. At this time, senior U.S. intelligence and security officials expressed interest in understanding sentiment in the region regarding the bombings and the alleged claim of responsibility by a well-known extremist [personal communications, Senior U.S. Intelligence and Security Officials, July 2009]. Among other things, officials felt that characterizing this sentiment might provide insight into Indonesian public opinion concerning violent extremist organizations.

To enable a preliminary assessment along these lines, we collected two sets of social media data related to the Top post: 1.) the ~3000 comments made to the post during the two week period immediately follow-ing the post, and 2.) several hundred posts made to other Indonesian language blogs in which the Top post was discussed. We manually labeled the sentiment of a small subset of these documents, and also translated into Indonesian the generic sentiment lexicon used in Case Study One (in this paper all language translation was performed using the tool available at http://translate.google.com). Observe that this approach to constructing a sentiment lexicon is far from perfect. However, because our proposed algorithms employ several sources of information to estimate the sentiment of content, it is expected that they will exhibit robustness to imperfections in any single data source. This study therefore offers the opportunity to explore the utility of a very simple approach to multilingual sentiment analysis: translate a small lexicon of sentiment-laden words into the language of interest and then apply Algorithm SS or Algorithm TL directly within that language (treating words as tokens). The capability to perform automated, multilingual content analysis is of substantial interest in many security-related applications.

We implemented Algorithm SS to estimate the sentiment expressed in the corpus of comments made to the Top post [20] and in the set of related discussions posted at other blogs. Sample sentiment estimation results for the comments made to the Top post are shown at the top of Figure 5. These comments are almost universally negative, condemning both the bombings and the justification for the bombings given in the Top post. Manual examination of a subset of the comments confirms the results provided by Algorithm SS. For example, a typical negative comment reads, in part (translated from the Indonesian):

" ... a savage kind - Noordin M. Top does not deserve to live in this world, he only claims to serve Islam but actually he is in truth a disbeliever. You are all cowards, you terrorists reversing Islam, and Noordin Top is just a stupid terrorist who escaped Malaysia just to inflict violence and hatred ... ."

Interestingly, by slightly reformulating the classifier it is possible to discover that while almost all comments are negative, there is a thread of comments which puts forth various conspiracies associated with the origins of the bombings and/or the Top post. The following comment is illustrative of this theme:

" ... the only explanation for the sophistication of the attacks is that the bombers were aided by the CIA or Mossad."

Sample results obtained through sentiment analysis of relevant posts made to other blogs are presented at the bottom of Figure 5. These posts also express largely negative sentiments about the Top post, although they are not as consistently negative as are the comments made directly on the Top post blog site. It should be noted that the "neutral" post sentiment depicted in the plot at the bottom of Figure 5 are mainly news articles. Again, manual evaluation of a subset of these posts confirm the results obtained via auto-mated classification with Algorithm SS.

8. Case Study Four: Egyptian Revolution

Beginning on 25 January 2011, a popular uprising swept across Egypt in the form of massive demonstrations and rallies, labor strikes in various sectors, and violent clashes between protestors and security forces, ultimately leading to the resignation of Egyptian President Hosni Mubarak on 11 February. National security analysts and officials have expressed interest in understanding public sentiment regarding the Egyptian revolution generally and Mubarak specifically, especially 1.) in the weeks before the protests and 2.) for different regions of the globe.

To enable a preliminary assessment along these lines, we collected three sets of blog posts which are related to Egyptian unrest and Mubarak and were posted during the two week period immediately before the protests began on 25 January: 1.) 100 Arabic posts, 2.) 100 Indonesian posts, and 3.) 100 Danish posts. We manually labeled the sentiment of a small subset of these documents, and translated into the appropriate language the generic sentiment lexicon used in Case Study One for implementation in this study (using the translation tool available at http://translate.google.com). Observe that this approach to constructing a sentiment lexicon is far from perfect. However, because lexical information is only one of the data sources used in the proposed approaches to sentiment estimation, it is expected that overall performance will exhibit robustness to imperfections in any one source of data; indeed, this study provides a simple test of this expected robustness.

We used Algorithm SS to estimate the sentiment expressed in the three sets of blog posts noted above, classifying the posts as either 'negative' or 'positive/neutral'. The analysis reveals that, while the sentiment expressed by the bloggers in the sample is largely negative toward Mubarak, the fraction of negative posts varies by post language (and thus possibly by geographic region). In particular, as shown in Figure 6, Arabic language posts are the most negative, followed by Indonesian posts, with the Danish posts in our sample actually being slightly more positive/neutral than negative. Manual inspection of a subset of the comments confirms the results provided by Algorithm SS.

9. Summary

Sentiment analysis of social media content for security informatics applications is often characterized by the existence of only modest levels of prior knowledge regarding the domain of interest and the need to rapidly adapt to new domains; consequently, standard content analysis methods typically perform poorly in this setting. This paper presents two new computational methods for inferring the sentiment expressed in social media which address these challenges. Each method formulates the task as one of text classification, models the data as a bipartite graph of documents and words, and enables prior knowledge concerning the sentiment orientation of documents or words of interest to be effectively combined with "auxiliary" information to produce accurate sentiment estimates. The first proposed method is a semi-supervised algorithm that leverages a source of supplementary data which is abundant online: unlabeled documents and words. The second algorithm is a novel transfer learning method which permits the knowledge present in data that has been previously labeled in a related domain to be transferred to a new domain. We demonstrate the utility of the proposed algorithms by showing they outperform several standard methods for the task of inferring the sentiment polarity of online reviews of movies and consumer products. Additionally, we illustrate the potential of the methods for security informatics by estimating regional public opinion regarding two security-relevant events: the 2009 Jakarta hotel bombings and the 2011 Egyptian revolution.

The proposed algorithms are complementary in that they exploit different sources of auxiliary information (each of which is frequently available in security informatics applications). For instance, Algorithm SS is able to extract useful information from unlabeled documents and words, which are typically abundant online. This capability is particularly valuable in applications for which it is difficult or expensive to acquire any form of labeled data. Alternatively, if previous analysis has produced labeled information in a domain related to the current domain of interest, Algorithm TL can be used to effectively leverage this related data; observe that this situation is common in intelligence analysis settings.

Appendix

As summarized in Section 3, solving the optimization problem

balances three objectives:

min_{c_{aug}} c_{aug}^{T} L_{n} c_{aug} + β_{1} \sum_{i = 1}^{n_{l}} {(d_{est,i} - d_{i})}^{2} + β_{2} \sum_{i = 1}^{|V_{l}|} {(c_{i} - w_{i})}^{2}

clustering: reducing c^T_augL_nc_aug ensures that polarity estimates for documents, d_est,i, and words, c_j, are assigned to reduce the magnitude of terms of the form
$X_{ij} {(\frac{d_{est,i}}{D_{1,ii}^{1 / 2}} - \frac{c_{j}}{D_{2,jj}^{1 / 2}})}^{2}$

where D_1,ii^1/2 = (∑_j X_ij)^1/2 and D_2,jj^1/2 = (∑_i X_ij)^1/2;

prior knowledge on documents: if document i is labeled d_i then the polarity estimate d_est,i should be close to this label;
prior knowledge on words: if word i is labeled w_i then the polarity estimate c_i should be close to this label.

It is shown in this paper that this procedure enables accurate sentiment classifiers to be learned. However, in situations in which very little labeled data is available, this approach can produce numerous isolated polarity clusters around labeled instances on the graph G_b, resulting in over-fitted solutions with little power for generalization. Here we show that replacing the term ϕ = c_aug^TL_nc_aug with ϕ_k = c_aug^TL^k_nc_aug , where k a positive integer, smoothes the sentiment polarity estimates on G_b and resolves this difficulty.

Note first that L_n can be expressed as

L_{n} = \sum_{i = 1}^{n + | V |} λ_{i} z_{i} z_{i}^{T},

where (λ_i, z_i) are eigenvalue-eigenvector pairs for L_n, and similarly that L_n^k can be written

L_{n}^{k} = \sum_{i = 1}^{n + | V |} λ_{i}^{k} z_{i} z_{i}^{T} .

Next observe that the quantity

ϕ = c_{aug}^{T} L_{n} c_{aug} = \sum_{i} \sum_{j} X_{ij} {(\frac{d_{est,i}}{D_{1,ii}^{1 / 2}} - \frac{c_{j}}{D_{2,jj}^{1 / 2}})}^{2}

measures the smoothness of the document-word polarity assignment specified by c_aug. If the eigenvalues λ_i are ordered so that 0 = λ₁≤ λ₂≤ ... ≤ λ_n+|V| then, because z_j^T L_n z_j = λ_j, it is seen that the eigenvectors z_i of L_n are ordered by their smoothness. Then, since {z₁, ..., z_n+|V|} is an orthonormal basis (L_n is sym-metric), c_aug can be expanded as c_aug = ∑_i α_i z_i and ϕ becomes ϕ = ∑_i λ_iα_i².

Analogously, ϕ_k = c^T_augL^k_nc_aug can be written

\begin{gathered} ϕ_{k} = (\sum_{i} α_{i} z_{i}) (\sum_{j} λ_{j}^{k} z_{j} z_{j}^{T}) (\sum_{m} α_{m} z_{m}) \\ = \sum_{i} λ_{i}^{k} α_{i}^{2} . \end{gathered}

It follows that minimizing ϕ_k instead of ϕ results in smoother polarity specifications on G_b, because the former imposes a more aggressive penalization of polarity assignments that include the larger (and less smooth) eigenvalue terms.

References

US Committee on Homeland Security and Government Affairs, Violent Extremism, the Internet, and the Homegrown Terrorism Threat 2008.
Bergin A, Osman S, Ungerer C, Yasin N: Countering Internet Radicalization in Southeast Asia. ASPI Special Report 2009.
Google Scholar
Chen H, Yang C, Chau M, Li S (Eds): Intelligence and Security Informatics. Lecture Notes in Computer Science, Springer, Berlin; 2009.
Google Scholar
Proc 2010 IEEE International Conference on Intelligence and Security Informatics, Vancouver. 2010.
Colbaugh R, Glass K: Early warning analysis for social diffusion events. Proc 2010 IEEE International Conference on Intelligence and Security Informatics, Vancouver 2010.
Google Scholar
Pang B, Lee L: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2008, 2: 1–135. 10.1561/1500000011
Article Google Scholar
Dhillon I: Co-clustering documents and words using bipartite spectral graph partitioning. In Proc ACM International Conference on Knowledge Discovery and Data Mining. San Francisco; 2001.
Google Scholar
Kim S, Hovy E: Determining the sentiment of opinions. Proc International Conference on Computational Linguistics 2004.
Google Scholar
Sindhwani V, Melville P: Document-word co-regularization for semi-supervised sentiment analysis. Proc 2008 IEEE International Conference on Data Mining, Pisa 2008.
Google Scholar
Colbaugh R, Glass K: Estimating sentiment orientation in social media for intelligence monitoring and analysis. Proc 2010 IEEE International Conference on Intelligence and Security Informatics, Vancouver 2010.
Google Scholar
Pan S, Yang Q: A survey on transfer learning. IEEE Trans Knowledge and Data Engineering 2010, 22: 1345–1359.
Article Google Scholar
Blitzer J, Dredze M, Perieia F: Biographies, bollywood, boom-boxes, and blenders: Domain adaptation for sentiment classification. Proc 45th Annual Meeting of the ACL, Prague 2007.
Google Scholar
He J, Liu Y, Lawrence R: Graph-based transfer learning. In Proc 18th ACM Conference on Information and Knowledge Management. Hong Kong; 2009.
Google Scholar
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Second edition. Springer, New York; 2009.
Book MATH Google Scholar
accessed 2009, [http://www.cs.cornell.edu/People/pabo/movie-review-data/] accessed 2009
accessed Dec 2009, [http://www.lemurproject.org/] accessed Dec 2009
Ramakrishnan G, Jadhav A, Joshi A, Chakrabarti S, Bhattacharyya P: Question answering via Bayesian inference on lexical relations. In Proc 41st Annual Meeting of the ACL. Sapporo; 2003.
Google Scholar
accessed Dec. 2009, [http://www.borgelt.net/bayes.html] accessed Dec. 2009
accessed Dec. 2010, [http://www.cs.jhu.edu/~mdredze/] accessed Dec. 2010
accessed Aug. 2009, [http://www.mediaislam-bushro.blogspot.com] accessed Aug. 2009

Download references

Acknowledgements

This work was supported by the U.S. Department of Defense and the Laboratory Directed Research and Development Program at Sandia National Laboratories. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

Author information

Authors and Affiliations

Institute for Complex Additive Systems Analysis, New Mexico Institute of Mining and Technology, Socorro, USA
Kristin Glass
Analytics and Cryptography Department, Sandia National Laboratories, Albuquerque, USA
Richard Colbaugh

Authors

Kristin Glass
View author publications
You can also search for this author in PubMed Google Scholar
Richard Colbaugh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kristin Glass.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KG and RC designed the research, KG and RC developed the computational algorithms, KG conducted the empirical tests, and RC wrote the paper. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Glass, K., Colbaugh, R. Estimating the sentiment of social media content for security informatics applications. Secur Inform 1, 3 (2012). https://doi.org/10.1186/2190-8532-1-3

Download citation

Received: 02 September 2011
Accepted: 27 February 2012
Published: 27 February 2012
DOI: https://doi.org/10.1186/2190-8532-1-3

Estimating the sentiment of social media content for security informatics applications

Abstract

1. Introduction

2. Preliminaries

3. Semi-Supervised Sentiment Analysis

Algorithm SS

4. Case Study One: Movie Reviews

5. Transfer Learning Sentiment Analysis

Algorithm TL

6. Case Study Two: Product Reviews

7. Case Study Three: Jakarta Hotel Bombings

8. Case Study Four: Egyptian Revolution

9. Summary

Appendix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords