This section develops the second proposed sentiment estimation algorithm for social media content. Many security informatics applications are characterized by the presence of limited labeled data for the domain of interest but ample labeled information for a related domain. For instance, an analyst may wish to ascertain the sentiment of online discussions about an emerging topic of interest, and may have in hand a large set of labeled examples of positive and negative posts regarding other topics (e.g., from previous studies). In this setting it is natural to adopt a transfer learning approach, in which knowledge concerning the way sentiment is expressed in one domain, the so-called source domain, is transferred to permit sentiment estimation in a new target domain. In what follows we present a new bipartite graph-based approach to transfer learning sentiment analysis.
Assume that the initial problem data consists of a corpus of n = nT + nS documents, where nT is the (small) number of labeled documents available for the target domain of interest and nS >> nT is the number of labeled documents from some related source domain; in addition, suppose that a modest lexicon Vl of sentiment-laden words is known. Let this label data be encoded as vectors dT∈ℜnT, dS∈ℜnS, and w∈ℜ|V|, respectively. Denote by dT,est∈ℜnT, dS,est∈ℜnS, and c∈ℜ|Vl| the vectors of estimated sentiment orientations for the target and source documents and the words, and define the augmented classifier as caug = [dS,estT dT,estT cT]T ∈ ℜn+|V|. Note that the quantity caug is introduced for notational convenience in the subsequent development and is not directly employed for classification.
In what follows we derive an algorithm for learning caug, and therefore c, by solving an optimization problem involving the labeled source and target training data, and then use c to estimate the sentiment of any new document of interest via the simple linear classifier orient = sign(cTx). This classifier is referred to as transfer learning-based because c is learned, in part, by transferring knowledge about the way sentiment is expressed from a domain which is related to (but need not be identical to) the domain of interest.
We wish to learn an augmented classifier caug with the following four properties: 1.) if a source document is labeled, then the corresponding entry of dS,est should be close to this ±1 label; 2.) if a target document is labeled, then the corresponding entry of dT,est should be close to this ±1 label, and the information encoded in dT should be emphasized relative to that in the source labels dS,; 3.) if a word is in the sentiment lexicon, then the corresponding entry of c should be close to this ±1 sentiment polarity; and 4.) if there is an edge Xij of Gb that connects a document × and a word v∈V and Xij possesses significant weight, then the estimated polarities of × and v should be similar.
The four objectives listed above may be realized by solving the following minimization problem:
(5)
where L = D - A is the graph Laplacian matrix for Gb, as before, and β1, β2, β3, kS, and kT are nonnegative constants. Minimizing (5) enforces the four properties we seek for caug. More specifically, the second, third, and fourth terms penalize "errors" in the first three properties, and choosing β2 >β1 and kT >kS favors target label data over source labels. To see that the first term enforces the fourth property, note that this expression is a sum of components of the form Xij (dT,est,i - cj)2 and Xij (dS,est,i - cj)2.The constants β1, β2, β3 can be used to balance the relative importance of the four properties.
The caug which minimizes the objective function (5) can be obtained by solving the following set of linear equations:
(6)
where the Lij are matrix blocks of L of appropriate dimension. The system (6) is sparse because the data matrix × is sparse, and therefore large-scale problems can be solved efficiently. In applications with very limited labeled data, sentiment classifier performance can be improved by replacing L in (6) with the nor-malized Laplacian Ln or with a power of this matrix Lnk.
Note that developing systematic methods for characterizing how similar the source and target domains must be to enable useful transfer learning, or for selecting an appropriate source domain given a target set of interest, remain open research problems. Some helpful guidance for these tasks is provided in [12].
We summarize the above discussion by sketching an algorithm for constructing the proposed transfer learning classifier:
Algorithm TL
-
1.
Construct the set of equations (6), possibly by replacing the graph Laplacian L with Lnk.
-
2.
Solve equations (6) for caug = [dS,estT dT,estT cT]T.
-
3.
Estimate the sentiment orientation of any new document × of interest as: orient = sign(cTx).
The utility of Algorithm TL is now examined through a case study involving online consumer product reviews.