…My heart's in Accra » Studying Twitter and the Moldovan protests - 0 views
-
At some point on Friday, we hit a peak tweet density - 410 of 100,000 tweets included the #pman tag. Had I been scraping results by iterating 100,000 tweets at a time, I would have had four pages of new results - my script is only looking at the first page, so I’d be dropping results. If I ran the script again, I’d try to figure out the maximum tweet density by looking for the moment where the meme was most hyped, try to do a back of the envelope calculation as to an optimum step size and then halve it - that would probably have me using 20,000 steps for this set.
-
Density of tweets charted against blocks of 100,000 tweets
-
http://search.twitter.com/search?max_id=1511783811&page=2&q=%23pman&rpp=100 Picking apart the URL: max_id=1511783811 - Only return results up to tweet #1511783811 in the database page=2 - Hand over the second page of results q=%23pman - The query is for the string #pman, encoded to escape the hash rpp=100 - Give the user 100 results per page While you can manipulate these variables to your heart’s content, you can’t get more than 100 results per page. And if you retrieve 100 results per page, your results will stop at around 15 pages - the engine, by default, wants to give you only 1500 results on any search. This makes sense from a user perspective - it’s pretty rare that you actually want to read the last 1500 posts that mention the fail whale - but it’s a pain in the ass for researchers.
- ...3 more annotations...
-
http://search.twitter.com/search?max_id=1511783811&page=2&q=%23pman&rpp=100 Picking apart the URL: max_id=1511783811 - Only return results up to tweet #1511783811 in the database page=2 - Hand over the second page of results q=%23pman - The query is for the string #pman, encoded to escape the hash rpp=100 - Give the user 100 results per page While you can manipulate these variables to your heart's content, you can't get more than 100 results per page. And if you retrieve 100 results per page, your results will stop at around 15 pages - the engine, by default, wants to give you only 1500 results on any search. This makes sense from a user perspective - it's pretty rare that you actually want to read the last 1500 posts that mention the fail whale - but it's a pain in the ass for researchers. What you need to do is figure out the approximate tweet ID number that was current when the phenomenon you're studying was taking place. If you're a regular twitterer, go to your personal timeline, find a tweet you posted on April 7th, and click on the date to get the ID of the tweet. In the early morning (GMT) of the 7th, the ID for a new tweet was roughly 1468000000 - the URL http://search.twitter.com/search?max_id=1468000000&q=%23pman&rpp=100 retrieves the first four tweets to use the tag #pman, including our Ur-tweet: evisoft: neata, propun sa utilizam tag-ul #pman pentru mesajele din piata marii adunari nationale My Romanian's a little rusty, but Vitalie Eşanu appears to be suggesting we use the tag #pman - short for Piata Marii Adunari Nationale, the main square in Chisinau where the protests were slated to begin - in reference to posts about the protests. His post is timestamped 4:40am GMT, suggesting that there were at least some discussions about promoting the protests on Twitter before protesters took to the streets. Now the key is to grab URLs from Twitter, increasing the max_id variable in steps so that we're getting all results from the st