A Picture is Worth 32.33 Words: Importance of Analyzing Images on Online Social Media

Do you remember the last time you rushed or saw any one rush to get an “autograph” of a famous personality? No, right? Because those days are long gone. Today’s generation believes in taking a selfie instead. And why not, digital media is forever, or at least, it can easily outlive a piece of paper with an autograph! There is an explosion of data that is generated on the Online Social Media (OSM), we see 422,340 tweets on Twitter, 3.3 million updates on Facebook, 55,555 pictures uploaded on Instagram every second [1]. In the recent past, with the updates, large fraction of it is images / pictures; one analysis shows that 1.8 billion photos are shared on Facebook, Instagram, Flickr, Snapchat, and WhatsApp every day [2]. It is also found that updates with images increase the engagement of the posts, like [3] shows 18% more clicks, 89% more favorites, and 150% more retweets when the tweet has an image compared to only text updates. Another article reports 93% of the most engaging posts on Facebook have an image [4]. Researchers are also studying what makes an image popular on networks like Flickr [5].

In last few years there have been many academic papers, technologies in real world all looking at this growth of content and analyzing them; we see most of them analyzing only the textual part of the content. Here is a non-comprehensive list of publications in some of the top tier conferences in this space; all of these papers look at content generated in English [6 – 20]. Some researchers are also looking at studying the sentiment and textual characteristics of non-English content on OSM [21 – 27]. Languages include, Farsi, and Hindi.

I have been curious for a little while now about non-textual content on OSM; some of my recent interest has been to look at images and videos on OSM. I recently had my student Sonal Goel investigate images on OSM, she completed her Masters thesis “Image Search for Improved Law and Order: Search, Analyse, Predict image spread on Twitter” where she predicted the virality of images on OSM using tweets from multiple events. Prateek Dewan, my Ph.D. student and I have been playing around the broader topic of images and OSM. We believe that the inferences that we draw from textual analysis can be different from the analysis done with images from the same posts. For example, textual analysis done in Hurricane Sandy [28] and Boston Marathon [29] could have classified the posts with images (along with text) to be legitimate, whereas, if we analyze the images itself it may be fake. Below is a fake image which went viral during Sandy, but textual analysis for the posts with these images could have leaned towards credible content.

Sentiment analysis of the OSM content is used to make decisions on the pulse of citizens, customers, etc. Sometimes the sentiment of the textual content is very different from the images posted with the text. Below image was posted with the content “Thank you Piers Morgan for speaking truth. #PrayForParis #MuslimsStandWithParis“ [30] Text analysis will give positive / neutral sentiment, while the content from the image attached with the post is negative. We found other examples to substantiate this point, post being negative and image being more positive [33, 34] and post being positive and image being more negative [35].

Just to test our hypothesis of how much information is spread through images, we analyzed some events for which we have been collecting data. Below is the table which shows data for 9 events; consistently we see that on average about 20 – 25% of the content has only images without text. In most of the analysis that is done now with textual content will miss this information. In one of the event that we are analyzing now, we were able to extract text from 8,200 images; these images were posted on OSM with no text. To understand the amount of text that are shared through images, we got images annotated and using Tesseract OCR [31], we were able to get 1,030,471 words from 31,869 images.

Column “with text” refers to the number of posts containing the “message” field as returned by the Graph API. This field contains the status / text message posted by the user. The “with image” column represents the number of posts where the “type” of post is “photo.” Facebook automatically determines this “type” while a user is composing a post. This field is assigned to ALL posts, and can take up one of the following values: link, status, photo, video, offer [32]. This makes column “text and image” an intersection of previous two columns. Similarly, “image and no text” is a subset of column “with image”, and “text and no image” is a subset of column “with text.” All values in the table in parenthesis is percentage value.

Event	Total posts	Posts with text	Posts with image	Posts with text and image	Posts with image and no text	Posts with text and no image
AirAsia flight missing 2014	22,820	6,868 (30)	10,192 (44)	538 (2)	9,654 (42)	6,330 (28)
Cricket world cup 2015	20,960	17,217 (82)	7,463 (36)	5,756 (27)	1,707 (8)	11,416 (54)
Ebola outbreak 2014	67,453	28,030 (42)	12,386 (18)	1,553 (2)	10,833 (16)	26,477 (39)
Euro cup 2016	109,189	77,355 (71)	61,119 (56)	40,518 (37)	20,601 (19)	36,837 (34)
Wimbeldon 2015	111,417	80,469 (72)	52,756 (47)	37,862 (34)	14,894 (13)	42,607 (38)
Paris attacks 2015	131,548	78,803 (60)	75,277 (57)	32,861 (25)	41,416 (32)	45,942 (35)
Malasiyan MH17 crash 2014	22,490	5,270 (23)	2,947 (13)	316 (1)	2,631 (12)	4,954 (22)
IPL8 cricket 2015	48,329	31,526 (65)	19,116 (40)	9,251 (19)	9,865 (20)	22,275 (46)
Gaza unrest 2015	31,537	10,142 (46)	6,157 (20)	1,716 (5)	4,441 (14)	8,426 (27)

Given this growth of images and pictures on OSM, and less work done on topics related to OSM & images, there is a great scope for contributing in this domain. There are full-fledged and dedicated traditional conferences like IEEE International Conference on Computer Vision, International Conference on Machine Learning (ICML), and IEEE Conference on Computer Vision and Pattern Recognition (CVPR) which look at images. There needs some knowledge transfer from these classic domains to OSM. It may also be the case that, in the past, image analysis was not as advanced as it is now, so, advancements in image analysis, including neural networks now makes it possible to do some really cool image analysis which could have been difficult or impossible to do it earlier. Given the large amount of data on OSM, and with advanced image analysis techniques, we should be able to answer some very exciting research questions.

Some specific topics and problems that I think that will be interesting in this space of OSM and images (these are just my random thoughts and they are non-comprehensive):

Spread of untrustworthy / Mis-information on OSM through images
Leakage of personal information like current location, etc. through images on OSM
Leakage of sensitive information like DOB, gender, etc. through images on OSM

If you are interested in keeping updated about our activities at Precog, you can visit our website or our Facbeook page If you have any suggestions or ideas to explore in this direction, feel free to write to me.

Acknowledgements: I thank my brilliant students Prateek Dewan, Niharika Sachdeva, Indira Sen, Kushagra Singh, Megha Arora, Hemank Lamba, and Varun Bharadhwaj for helping with putting together these thoughts / some numbers / analysis in this post. Thanks to all members of Precog group where the idea of studying images and trying it out from different perspectives started.

References

http://www.smartinsights.com/internet-marketing-statistics/happens-online-60-seconds/
http://www.businessinsider.com/were-now-posting-a-staggering-18-billion-photos-to-social-media-every-day-2014-5
http://www.adweek.com/socialtimes/twitter-images-study/493206
http://www.socialbakers.com/blog/1749-photos-make-up-93-of-the-most-engaging-posts-on-facebook
https://people.csail.mit.edu/khosla/papers/www2014_khosla.pdf
Pollyanna Gonçalves, Matheus Araújo, Fabrício Benevenuto, and Meeyoung Cha. 2013. Comparing and combining sentiment analysis methods. In Proceedings of the first ACM conference on Online social networks (COSN ’13). ACM, New York, NY, USA, 27-38. DOI=http://dx.doi.org/10.1145/2512938.2512951
Tomer Simon , Avishay Goldberg, Limor Aharonson-Daniel, Dmitry Leykin, Bruria Adini. Twitter in the Cross Fire—The Use of Social Media in the Westgate Mall Terror Attack in Kenya, Plos-One.
Saritha SK, Devshriroy D (2013) Semantic Orientation of Sentiment Analysis on Social Media. International Journal of Computers & Technology 11 (4) 2401–2409.
Munmun De Choudhury,Scott Counts, and Eric Horvitz.2013. Predicting Postpartum Changes in Emotion and Behavior via Social Media. In Proc. CHI ’13
Munmun De Choudhury, Scott Counts,Eric J Horvitz, and Aaron Hoff. 2014. characterizing and predicting postpartum depression from shared facebook data. In Proc. CSCW ’14. ACM, 626–638.
Munmun De Choudhury, Andres Monroy-Hernandez, and Gloria Mark. 2014. “Narco” Emotions: Affect and Desensitization in Social Media during the Mexican Drug War. In Proc. CHI ’14. ACM.
Satarupa Guha, Tanmoy Chakraborty, Samik Datta, Mohit Kumar, Vasudeva Varma. TweetGrep: Weakly Supervised Joint Retrieval and Sentiment Analysis of Topical Tweets. In the proceedings of ICWSM 2016.
Soroush Vosoughi, Deb Roy. A Semi-Automatic Method for Efficient Detection of Stories on Social Media. In the proceedings of ICWSM 2016.
David Alvarez-Melis, Martin Saveski. Topic Modeling in Twitter: Aggregating Tweets by Conversations. In the proceedings of ICWSM 2016.
Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests. In the proceedings of ICWSM 2014.
Efthymios Kouloumpis, Theresa Wilson & Johanna Moore 2011. Twitter Sentiment Analysis: The Good the Bad and the OMG! (ICWSM ’11)
Alexander Pak and Patrick Paroubek 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In LREC, vol. 10, pp. 1320-1326.
Aliaksei Severyn, and Alessandro Moschitti. Twitter sentiment analysis with deep convolutional neural networks. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2015.
Cícero Nogueira dos Santos, and Maira Gatti. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. COLING. 2014.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. In ACL (1), pp. 1555-1565. 2014.
1. Vaziripour, Elham, Christophe Giraud-Carrier, and Daniel Zappala. Analyzing the Political Sentiment of Tweets in Farsi. Tenth International AAAI Conference on Web and Social Media. 2016.
2. Peng, Nanyun, Yiming Wang, and Mark Dredze. Learning Polylingual Topic Models from Code-Switched Social Media Documents. ACL (2). 2014.
3. Weerkamp, Wouter, Simon Carter, and Manos Tsagkias. How people use twitter in different languages. (2011): 1-2.
4. Volkova, Svitlana, Theresa Wilson, and David Yarowsky. Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media. EMNLP. 2013.
Anupam Jamatia, Bjorn Gambäck, and Amitava Das. 2015. Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages. Proceedings of Recent Advances in Natural Language Processing, page 239.
Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar, and Pabitra Mitra. 2008. Named Entity Recognition in Hindi using Maximum Entropy and Transliteration. Research journal on Computer Science and Computer Engineering with Applications, pp. 33–41.
Ayush Kumar, Sarah Kohail, Asif Ekbal, and Chris Biemann. 2015. IIT-TUDA: System for sentiment analysis in indian languages using lexical acquisition. Mining Intelligence and Knowledge Exploration, pages 684–693.
Gupta, A., Lamba, H., Kumaraguru, P., and Joshi, A. Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy. 2nd International Workshop on Privacy and Security in Online Social Media (PSOSM), in conjunction with the 22th International World Wide Web Conference (WWW) (2013).
Gupta, A., Lamba, H., and Kumaraguru, P. $1.00 per RT #BostonMarathon #PrayForBoston: Analyzing Fake Content on Twitter. IEEE APWG eCrime Research Summit (eCRS), 2013.
https://www.facebook.com/americanmuslims1/photos/a.809524959106862.1073741828.527806667278694/990645217661501/?type=1&theater
https://github.com/tesseract-ocr/tesseract
https://developers.facebook.com/docs/graph-api/reference/v2.7/post#read
https://www.facebook.com/ChristianChronicle/photos/a.83579936833.99565.11127431833/10153806013491834/?type=3&theater
https://www.facebook.com/roberta.metsola/photos/a.406836966100205.1073741826.406824526101449/839065439544020/?type=3&theater
https://www.facebook.com/516601545154233/photos/a.519535361527518.1073741828.516601545154233/574613042686416/?type=3&theater

PK