Collective Aspects of Privacy in the Twitter social network

With an increased online participation on Social Media, privacy concerns have risen to unprecedented levels. It has become extremely important to allow individuals the full control of their private information. Popular mobile applications integrated with Online Social Networks (OSNs) allow them to access user’s private information like their contact lists. This might allow OSNs to create shadow profiles of non-users using the data of existing users. We test this hypothesis for the first time on Twitter and further evaluate the predictability of location and biographical vector of a user from the information given by a friend who has created a Twitter Profile before our user.

Dataset

To get an unbiased dataset, we collected 1,017 random twitter users which we call as ego users by random digit search method. We obtained their metadata and filtered spam users and celebrities by thresholding the follower to friends ratio in the range from 0.1 to 10. To maintain homogeneity, we collected only those users who have English as language on Twitter account and further obtained their timelines (up to 3,200 tweets). We identified the users mentioned at least 4 times by the ego users and used these links as an approximation to the underlying social network between Twitter users that is revealed when users share their contact lists through mobile phone apps. Thus, we generate a dataset of 68,447 alter users.

User Analysis

We identified the location of our users from their geotagged tweets and location provided by them in their Twitter Bio. We normalized these locations using Google’s Geocoding API and identified the City, State, and Country pertaining to each location. This way, we were able to locate 630 ego users and 38,936 alter users in our dataset. We further mapped each location to a unique set of geo-coordinates.

Figure 1 shows the locations of users in the dataset, illustrating that users come from a wide variety of countries but are generally located in countries where Twitter adoption is high and the users of similar locations are more associated with each other.

Figure 1 : User locations in the Twitter social network

We processed the Twitter Bio of each user by removing stop words and converting the tokens into stems. We considered only those users which had at least 3 tokens in their bio to obtain 49,576 alters and 676 ego users. Over this text, we applied a pre-trained 100-dimensional Doc2Vec model and further reduced the vector to two most informative dimensions with Principal Component Analysis dimensionality reduction.

Twitter API also provides us the source of each tweet which identifies the way tweet was produced. We mark all the alters that produced at least one tweet with the source “Twitter for iPhone” or “Twitter for Android” as “disclosing alters” as they used a mobile application which accessed their mobile contact lists. This way we obtain 934 ego users and 53,724 alters which amounts to 78% of our dataset.

Shadow Profile Problem

Figure 2: Twitter data and the shadow profiles problem

For each ego user, we identified the preceding alters that had joined Twitter before ego user. Some of the alters disclosed their contact lists (red) and others did not (blue); see Figure 2. The shadow profile problem consists of the inference of personal information of the ego user based only on the information given by disclosing preceding alters, ignoring all data from non-disclosing preceding alters and alters that joined Twitter after the ego user.

To predict the location of ego users, we took the locations of all disclosing alters and identified the most frequent city among alters, i.e., the modal predictor. We used this location as the unsupervised prediction of location to be compared against the ground truth of the location of the ego user. We evaluated the quality of the prediction by measuring the Haversine distance in Km between the predicted point and the ground truth which is our error distance. We predicted the biographical vector of each alter as the average vector of its disclosing alters and evaluated this prediction through the cosine similarity of predicted and ground truth vectors. Therefore, a high similarity will mean a high accuracy of the predictor.

We evaluated both the predictors against a Random Null Model which took a uniform random sample of all users for prediction. For each projection, we generated 100 instances of Null Model and took the average result over those 100 predictions.

Location Prediction

In Figure 3, the left panel shows the Cumulative Density Function (CDF) of the prediction error of user locations when using only the data of disclosing alters. Black lines indicate empirical errors and the blue line depicts the errors in the Null Model, revealing that empirical errors (median = 68.7 Km) are much lower than the Null Model errors (median = 6308.9 Km). The right panels show the regression profile of the empirical error versus the number of disclosing alters in Twitter. The line shows the model estimate and the shaded area its standard error. Prediction error decreases with the number of disclosing alters in Twitter.

To make stronger care for an actual scenario, we used the fact that all Twitter users do not have the Twitter mobile application installed or haven’t provided access to their contacts. We now made predictions, given a probability that the user will share his/her data. For each alter, we picked a random number in the range of 0 to 1 and compared it with our selected probability ρ. This allowed us to have only ρ*100 % alters for prediction for a particular ρ.

Figure 4: Location prediction quality as a function of disclosure tendency and number of friends

In Figure 4, the left panel shows the median error of location prediction in 1000 samples for each value of ρ∈[0.1,0.9]. The median error approaches the value of the error when ρ=1, using all alters, which is 72 Km. The inset shows the error of the Null Model, which is several orders of magnitude larger than the error of shadow profiles. The right panel shows stratified regression lines of median error as a function of the number of alters in the samples, revealing that error decreases with the number of alters for the different values of ρ.

Biographic Vector Prediction

Figure 5: Biography cosine similarity in predictions as a function of disclosure tendency and number of friends

In Figure 5, The left panel shows the median cosine similarity of predictions and the Null Model in 1000 samples for each value of ρ. The cosine similarity of predictions outperformed the Null Model for ρ>0.2 and increased with ρ. The right panel shows the regression analysis of cosine similarity versus the number of friends on Twitter, revealing a trend of growing similarity with the number of friends.

The error level for shadow profiles of location (68.7 Km) is comparable to error levels using full information, which are typically between 57.2 Km and 28.3 Km.

Our results demonstrate that even if as less as 30% of your network disclosed their information, your private information could be inferred with significant accuracy.

Limitations of our study :

Historical audit using future data as ground truth
Using mentions network to determine friendship link
Biographical vectors don’t allow the straightforward interpretation of user interests

The implications of our results are clear: individuals do not have full control over their privacy, and the decisions of other people mediate the decision not to share information with online services, which means that we cannot conceive online privacy as a purely individual phenomenon that can be reduced to the choices of a person.

Please find the full paper accepted at EPJ Data Science Journal 2018 here for detailed description of our work. This is joint work with Dr. David Garcia, Amod Agrawal, and PK.