martes, 15 de abril de 2014

Social Media Mining with R

Dannemann, N and Heimann, R. (2014). Social Media Mining with R, Kindle Edition: File Size: 1414 KB,

One of the trends in the analysis of human intelligence states that cultural products -and language is one of those, widely mediate not only our thoughts, but also mould our representation of reality (Bruner, 1991). This trend argues, according to Bruner, that every individual's working intelligence can only be understood by taking into account the

"reference books, notes, computer programs and data bases, or most important of all, the network of friends, colleagues, or mentors on whom [this working intelligence]leans for help and advice" (p. 3)
If we understand intelligence, beyond the narrow academic skill, which is mostly based upon book learning and test taking (Wikipedia), and define it as
"a broader and deeper capability for comprehending our surroundings --`catching on,' `making sense' of things, or `figuring out' what to do" (Ib.)
one way to understand how people are creating their reality and predict it, is to investigate those sources of information and, when possible, the network of friends.

The WWW with its 1.7 billion pages (worldwidewebsize.com) has become that source of information and a "place" to investigate the networks of friends on whom people are leaning for help and advice today. The WWW, provides not only physical and tangible products but also ideas and perspectives. Baby Boomers and Millennials, people who were born between 1977 and 1995, equally search on the Web before making decisions (Bazaar Voice, 2012). While 71% of the Baby Boomers use the information collected to buy in stores, Millennials, who are constantly connected and highly dependent on social media, prefer to buy online (52%) (ib.). Actually, Millennials have more confidence in people's opinions about brands than they have on the information provided by companies (Ib). The Web, in short, is mentoring people in making sense of things, figuring out what to do, and catching on. Consequently, to understand, serve, and predict individuals working intelligence today, we need to mine the WWW.

This is not an easy task because although there are a lot of information about mining text, in general, and the Web, this information, nevertheless, tends to focus on algorithms and model developments, which makes it a highly specialized information and out of reach of common people. Here is where Social Media Mining with R, written by Nathan Danemman and Richard Heimann, comes into play. This book offers us some insights about the theory behind mining the Web, specifically social media, suggests an open source and free software for doing it - R, as well as three case of studies from which we can grasp the procedure, all these in only 120 pages.

In three of the six chapters, the authors provide us with a short background about their approach to social data mining. Among the topics discussed, we can find why the Web is an extraordinary source of information for opinionated social data, meaning data generated by people or by their interactions with which they expose their sentiments, evaluations and opinions. This data is produced in real time as well as in big scale. Following the tradition of social science, the authors intend to use social media data to ask and answer questions on individual and group level behavior. Danemman and Heimann are aware of the pitfalls and failures of social media data; consequently, they devote an entire chapter to discuss them. They also illustrate the differences between traditional social data commonly used in social science and social media data. They advocate for the latter despite its limitations, used with creativity, curiosity and a dose of healthy skepticism, because it is available and can help answer vast majority of emerging questions related to business, politics, and social life (p. 55), questions for which actually there is no traditional social data.

In two chapters, 2-3, the authors introduce us to R, and teach us how to collect tweets using the package twitteR. Regrettably, the procedure they provide to get the Oauth for Twitter does not work, and generates the same message error that has been posting on different R blogs and e-mail lists lately. It does not mean that readers won't finally figure out how to harvest tweets but it is not going to be easy. In any case, it is worth reading chapter 3 and getting an idea about the possible outcomes from this analysis. Finally, in chapter 5 and 6, the authors gave us the framework to cope with social media data. In chapter 5, they give us the fundamentals to extract sentiments as well as the theoretical foundations to understand the techniques they will apply in chapter 6. There are three methods suggested. Two unsupervised learning (processes that do not need previous data to generate an outcome), a lexicon-based sentiment approach and an Item Response Theory for Text Scaling, ITS - and a supervised one, a Naïve Bayes Classifier. The first method consists on counting the opinion words from a subset of data from a particular source (p. 60). The ITS approach takes the previous opinions of people on a given topic and according to the sentiments they have used, locates them, or the documents, in a continuum scale that represents the author's sentiment toward the topic under study (p. 63). As for the Naïve Bayes Classifier, it is used to classify new observations, in this case opinionated data, based on existing data.

Finally, chapter six wraps up everything discussed in chapter 5. The authors use two different social media data as case of studies. The Beige Book Summary of Commentary on Current Economic Conditions, published by the Federal Research Board (FRB), and 4000 tweets hashtagged as #prolife and #prochoice. The authors apply the lexicon-based sentiment approach to the Beige Book, and the ITS and the Naïve Bayes Classifier to the tweets. Danemman and Richard Heimann take us by the hand and guide us step by step through each of these methods, so that we can even calculate how much RAM we may need depending on our data. Each code, which can be downloaded from the publisher web page, is fully explained, so that we can "see" what we are doing. I downloaded the code, but R complains with a message indicating a deprecated function/command in the R sources provided.

I recommend this book to anybody who wants to start this fascinating task of mining social data and to capture the reality created by people in almost real time. You can read in in 4-6 hours, and depending on your ability to quickly catch up with R, two or three days to replicate the case studies. Totally beginners will have some problems, though.

References

Baazar Voice. (2012). Talking to Strangers. Millennials Trust People Over Brands. In: http://www.semiootika.ee/sygiskool/tekstid/bruner.pdfhttp://resources.bazaarvoice.com/rs/bazaarvoice/images/201202_Millennials_whitepaper.pdf

Bruner, J. (1991). The Narrative Construction of Reality. In: Intelligence In.http://en.wikipedia.org/wiki/Intelligence

The size of the World Wide Web (The Internet). (2014). In: http://worldwidewebsize.com/