As explained in the previous section, we have conducted a number of different experiments to understand to what extent timeprints can be used for author identification and for alias matching. In this section we describe the individual experiments in further detail and present the obtained results.
Experiments on author identification
In our first experiment, we have varied the number of potential authors from 200 to 1000 in steps of 200 as explained above. First, we have randomly divided a user’s posts among its five sub-users. After this, we have used tenfold cross validation in which a Naive Bayes and a support vector machine have been trained and evaluated. We have then compared the obtained accuracies for the Naive Bayes and support vector machine classifiers. The results from the experiment are reported in Fig. 3 using the labels NBRand and SVMRand. Next, we have repeated the same experiment again, but this time we have distributed the posts sequentially among the five sub-users, rather than to distribute them randomly. These results are reported in Fig. 3 using the labels NBSeq and SVMSeq. Now, what is the reason for studying the effect of the way the sub-users are selected on the obtained results? The explanation to this is that it is hard to know exactly how someone would make use of several aliases. Would they first make a post using one of their aliases, then switch to a second, and so on? Would they first write a large number of posts using one account, then switch to the next, and so on? The answer is probably neither, and the exact behaviour is probably very different for different individuals and different purposes for why multiple aliases are used in the first place. For this reason, the sequential and random splits of posts are intended to be seen as two extreme points, where the sequential way of dividing the posts into sub-users is intended to give an upper bound on the classification accuracy that can be expected when using timeprints, while the random way of dividing posts is more challenging for the classifier and can be seen as a lower bound on what classification accuracy that can be obtained. In reality, we would expect the accuracy to be somewhere between these upper and lower bounds, depending on the individual and what she is using the multiple aliases for.
As can be seen when studying the results in Fig. 3, both classifiers perform well on the classification task, especially the SVM classifier. For 200 users the SVM classifier is achieving \(100\,\%\) accuracy on the sequentially generated data and the NB classifier is not far away from this result either. On larger problem instances, the SVM classifier is consistently outperforming the NB classifier with approximately 5–20\(\,\%\) higher accuracy, but this comes with a price. The training and evaluation phase of the NB classifier took a few minutes while the last steps took days to perform for the SVM classifier on the standard computer we used for the experiments, due to the large number of classes.
As expected, higher accuracies are achieved when distributing posts sequentially rather than randomly among sub-users. This is expected since features such as Month will have very similar relative frequencies among sub-users corresponding to the same user when dividing the posts sequentially. In a setting where people make use of several accounts sequentially (such as when having a “discussion” between two or more alter egos) the sequential approach make sense, while it probably is less realistic for more normal use of multiple aliases. For this reason we are in the rest of the experiments distributing posts among the sub-users in a random fashion, although this probably is closer to a lower bound on the accuracy.
In this experiment, the correct user is in more than nine out of ten cases selected when using the SVM, both when using the random and sequential approach to creating the timeprints. As can be seen, the accuracy is still over 90 % for the SVM classifier when increasing the number of users to 1000. The results are somewhat worse for the NB classifier, but are still impressive given the simple nature of the classifier. The achieved results imply that time features can be very useful for author identification when having access to large amounts of data material. Those results are significantly higher than those obtained for author identification with textual (stylometric) features on a forum dataset reported in [11]. It should however be noted that it is not the same forum datasets that have been used in those experiments, making it hard to make a fair comparison between the results.
In order to find out which of the time-based features that are most important for the achieved classification performance, we have applied information gain, which is an entropy-based feature selection method [30]. The results vary somewhat depending on which number of users the measure is applied to, but in general we can see that attributes related to Period of Day (such as Night and Morning) receive highest average ranks, followed by the Type of Day (i.e., weekend or weekday). After this follows Months and Days, while the set of attributes which seem to be least useful is the Hour of Day.
An important part of the explanation to the decrease in accuracy when the number of potential authors is increased is obviously that there are more candidates to chose among for the classifiers, but a contributing factor may also be that there is less data material for the users further down in the list (since they are ordered based on their number of posts). To get a better understanding of what impact the amount of posts has on the results, we have in a second experiment modified the original dataset so that we start out with randomly selecting only 100 posts for each of the top-200 users. Since each user is decomposed into five sub-users, this means that the timeprint vectors are built from 20 posts each only. The number of users is kept fixed to 200 while we are increasing the number of randomly selected posts in steps of 100, until reaching 1400 posts (since the 200th user has written a total of 1484 posts). In effect, this experiment simulates a setting where only a restricted amount of data material is available for creating the timeprints. When adjusting the experiment in this manner, the results shown in Fig. 4 were obtained.
As can be seen in this figure the accuracy increases for both classifiers when the numbers of posts are increased and the SVM classifier consistently performs better than the NB classifier. When only 100 posts are randomly selected for 200 users, the accuracy is below 20 % for both classifiers. When the number of posts are increased to 500, the SVM classifier has an accuracy over 80 % while the NB classifier is just above 55 %. The SVM classifier reaches over 95 % accuracy when the number of posts is around 800 and slowly continues to increase. The figure gives a good idea of what amount of data that is needed to reach a specific accuracy level for the classifiers, but please note that the although the shape of the curves is similar also for other number of users, the exact values will differ.
Experiment on alias matching
In our alias matching experiment, we have selected the top-4000 users from year 2007 and randomly splitted a user’s post into two equally sized users as explained in “Experimental setup”. Now, we have for each sub-user created three sets of feature vectors from its posts: (1) a timeprint vector, (2) a stylometric-based vector, and (3) a combination of the timeprint feature vector and the stylometric-based feature vector. The results from this experiment are summarized in Fig. 5. As can be seen, both time-based and stylometric-based features work very well for a limited number of users, but as the number of users is increased the decrease in performance is less steep for the timeprints compared to the stylometric-based features. Looking at the top-1 rankings (with top-3 ranking within parantheses), time alone yields \(100\,\%\) (\(100\,\%\)) for 500 users, \(89.6\,\%\) (\(94.4\,\%\)) for 2000 users, and \(70.6\,\%\) (\(79.0\,\%\)) for 4000 users. The corresponding results for stylometric features are \(98.2\,\%\) (\(99.2\,\%\)) for 500 users, \(71.4\,\%\) (\(78.5\,\%\)) for 2000 users, and \(44.6\,\%\) (\(51.7\,\%\)) for 4000 users. To combine both time-based and stylometric-based features seem to give better performance than using the individual time-based or stylometric-based features on their own, and this effect seems to increase as the number of users is increased. More specifically, the corresponding results for the combination of time-based and stylometric-based features yield \(100\,\%\) (\(100\,\%\)) for 500 users, \(93.0\,\%\) (\(96.6\,\%\)) for 2000 users, and \(75.4\,\%\) (\(82.8\,\%\)) for 4000 users. Based on these results we can conclude that time-based features are vey powerful on their own for alias matching and that combining time-based features with stylometric-based features allow for even (statistically significantly) better results.