Mining twitter using python – Part 1


Twitter is really a massive social networking shopping mall. You’ve got tons of audience out there where you can analyze them and get really useful insights you really intend to find. If you ever want to research or study about something Twitter should be in your list of places to look for data.
There is so much you can do with the twitter data. For instance, you can determine relationships between users, stalking your ex or current one and much more. These data can be used to carryout some interesting analysis or research inside the social media universe. I’ve read about an app called Fire Me, the app collects and analyzes the recent tweets and finds out the most horrendous tweets about their jobs and has implemented an algorithm to find out the users who have strong chance of getting fired.
To be honest there is lots and lots of noisy or irrelevant data present inside the twitter universe. To really make most of your twitter data, you’ve to first figure out what really you are trying to find out or study. A very clearly defined objective and making sure of cleaning the data at your end. Organisations have been mining twitter data to understand categories, consumer mindset and needs, potential market trends, and the list is quite huge to list them all. There are companies who correlate the twitter data with their available data sources to predict the future and even get insights about their current business. All all there are very statistical and very mathematical.
Today we will be focusing on mining twitter data using python and tweepy was the obvious candidate. Before using tweepy API. But before that we have to create a twitter app to get hands on some keys.
First we will create an app here. It looks something like this
CreatingNewApp
Click on Create New App to create an app on twitter and enter the following fields.
AppFields
1. Name – Add some name upto 32 characters.
2. Description – What ever your app does, for this tutorial we will use something lame. you can have description from 10 to 200 characters.
3. Website – Put up your website in the field. Suppose to be your publicly accessible home page. For personal use just enter something you want. http://www.websitename.domain
4. Callback URL – Well this is really beyond scope of this tutorial, Just to give you heads up its something to do with the authentication of the users. If you are allowing your users to login in to your app.
At the end of the page there will be developer agreement. CLick on I agree check-box and click on Create twitter application
Once the application is create the page should redirect you to your application dashboard.
AppDashboard
Click on the “Keys and Access Tokens” tab on the dashboard and you shall find your Consumer Key andConsumer Secret.
ConsumerKeys
You’ve got your Consumer Key and Consumer Secret and now its time for generating your Access token andAccess token secret. Scroll down a bit and you’ll find a button which says Create my access token
CreateAccessToken
Post clicking on the button you shall have your Access token and Access token secret.
GeneratedAccessToken
Now we have all the keys to access our App via python. Lets do it!!!
As I had mentioned before, we will be using tweepy API for accessing our app.
To install tweepy just use the following command on windows or linux

pip install tweepy 

or there are other way rounds if you are finding difficulties installing through this command
Download the whl file from following link and use the following command

pip install [directory-to-downloaded-whl-package]

After the module has been installed, we will use the simple program to access our public timeline.
or you can use

git clone git://github.com/joshthecoder/tweepy.git
cd tweepy/
python setup.py install

Once tweepy has been installed, We have to write something like this.

import tweepy

CONSUMER_KEY=""
CONSUMER_SECRET=""
ACCESS_TOKEN=""
ACCESS_TOKEN_SECRET=""


auth_name=tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth_name.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api=tweepy.API(auth_name)

tweets=api.home_timeline()
for tweet in tweets:
     tweet.text

This is a very simple program to get the recent 20 tweets, retweets posted by the user[in this case me] and my friends i.e the people I follow.
Lets break down the code


auth_name=tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

#This returns the OAuthHandler instance from the provided consumer secret and consumer key
#Consider it a way to use the twitter application. We've just created its instance and to 
#use the application we will be supplying it with the access tokens in the next step

auth_name.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

#To Access the API features we have to create an API instance by supplying the authentication

api=tweepy.API(auth_name)

#Once you've created the API instance we are free to do anything we want with it.
#In the above example we've used the API instance to get the recent 20 tweets.


tweets=api.home_timeline()

#The above method will return a tweepy.model.ResultSet
#Lets analyze a single instance of the this result set

tweet_one=tweets[0] #Get the first instance

type(tweet_one) #This returns "tweepy.model.Status"

So in the above code we are iterating through each of the "tweepy.model.Status" object and printing text for each Status object.

for tweet in tweets:
     tweet.text


We can convert the Status object into a dictionary directly


tweets=api.home_timeline()

tweet=tweets[0] #Get the very first tweepy.model.status object

tweet_json=tweet._json 

type(tweet_json) # this returns dict type

Once they’ve been converted to the JSON format, you can use the key-value concept to access the data.
We need to look at the following important data in the Status object

favorite_count
entities
user
geo
coordinates

To access them from the JSON we have normal syntax
 tweet_json['entities']

We will be focusing on 2 values present in tweet_json i.e. entities and user
entities
tweet_json[‘entities’] returns a dictionary with following attributes
1. Symbols
2. User mentions
3. hashtags
4. urls
The hashtags are something we are interested in.

tweet_json['entities']['hashtags'] #To Access the hashtags used

To access the hashtags we will use the following code

hashtags=tweet_json['entities']['hashtags']

for hashtag in hashtags:
     print hashtag['text']

user
user=tweet_json[‘user’] #This returns the user details
This is used to access the details of the user whose status we are analyzing
user has following attributes

follow_request_sent -> True or False if we have sent that user a follow request.
id -> Some twitter unique ID 
profile_background_image_url_https -> The https url for background image of the user
verified -> True or False [The blue tick wankers]
entities -> only have urls
followers_count -> You get the idea
statuses_count ->number of status that are updated by the user
friends_count -> count of followers
description -> some words
location -> Location if provided
following -> returns True or False if we are following the user
screen_name -> The screen name of the user
profile_image_url -> You know how to stalk
name -> the name given by the user while creating the twitter profile

So much attributes right???? Well to be honest there are more attributes and you gotta explore it yourself 😉
To retrieve more tweets from your home timeline, Use the following code.

tweets=api.home_timeline(count=200)

This will retrieve 200 recent tweets from your timeline. You cannot retrieve more than 200 recent tweets.
To stalk some user tweets, use the following code.

tweets=api.user_timeline(id="",count=200)

for tweet in tweets:
     json_data=tweet._json
     do_someprocessing(json_data)
     

do_someprocessing(json_data) is your written algorithm. The algorithm is something you intend to do with data.
Again the max number of tweets you can retrieve is 200.
Want to update a status???? Here’s how you do it

tweet_instance=api.update_status(status="Some Status")

The update_status returns a “tweepy.models.Status” object and you can analyze the same things that we had done before.
Suppose you want to reply the first status update that is present in your timeline.

tweet=api.home_timeline()[0]._json #get's the first tweet in JSON format
user=tweet['user']
api.update_status(status="some status @"+user['screen_name'],in_reply_to_status_id=tweet['id'])

Here we are referring to the @ we want to reply and we want the tweet id that we are referring to.
To remove most recent tweet done by me


tweet=api.user_timeline(id="MyUserName")[0]._json
f=api.destroy_status(tweet['id'])

You can even retweet from the one you are following. I will be retweeting only the first recent tweet but you can have some continious loop and on an event when person you are following tweets your program will retweet at the exact same moment :)

tweet=api.user_timeline(id="someid")[0]._json
f=api.retweet(tweet['id'])

To get some user details

user=api.get_user(id="username")

#Then access the user attributes to get whatever details you require


To follow someone


user=api.create_friendship(id="TheIDYouWantToFollow")

To Un-follow someone

user=api.destroy_friendship(id="username")

To check if user a follows user b

val=api.exists_friendship("UserIDA","UserIDB")

returns True or False
This is just the introduction of the twitter API and we got more to explore and we will be exploring it a lot. In the next tutorial we will cover the Cursor part of tweepy. The reason cursor is awesome because as I’ve mentioned above, our traditional methods only retrieve 200 records and not more than that. But if we want suppose say I am following a band I like and I’d love to get their old tweets. This is not possible through the above method specified. Thus, here in this case we will use a tweepy cursor. The next tutorial will also include hashtag mining and even the streaming live feed from twitter :)
There is lots you can do with tweepy API. Explore and have fun 😉

0 comments: