Downloading the Results of a Twitter Search using PowerShell and SQL (part 3)

Part one of this series is here and part two is here. You’ll need the SQL scripts from part two in order for the code below to work but you can skip part one unless you’re curious to see the evolution of this project.

First off, I found a bug in my code. A unique contraint violation may be thrown if the exact same tweet is tweeted at the exact same time by two different users. I didn’t code to handle this because I thought it was an incredibly unlikely scenario but given the popularity of automatic tweeting – eg, “Share this on Twitter” buttons – it seems like it is a scenario which can and does happen.

Anyway, I decided to create a different constraint. Here is the code to drop the old one and add the new one.


use twitter
go

/*
a web app which auto tweets for a user 
might cause two identical tweets 
at the same second
*/

if exists (select 1 from sys.indexes where name = N'uq_tblTweet_tweet_tweetDate')
	alter table tblTweet drop constraint uq_tblTweet_tweet_tweetDate

alter table tblTweet
add constraint uq_tblTweet_tweet_tweetDate_twitterUserId unique (tweet, tweetDate, twitterUserId) 


Secondly, I added some paramters to the script. The parameter -searchString takes a single string to search on. Or you can specify a file path as an argument to the -inputFile parameter and it will go through each of the strings and do a search on those strings.

If you use both parameters it will do the -searchString first and then the -inputFile

I needed to update the proc which moves data from the staging table to the data tables so that the staging table doesn’t grow indefinitely. I like the data to hang around for a bit in case I need to debug some error so I don’t want to truncate the table every day. I decided to add a paramater to the proc so that it deletes data whose import date (not the twitter date but the date the data was imported) is more than a certain number of days old. The number of days is passed as an argument on the command line of the PowerShell script call. The default is 30 days for both the PowerShell script and the stored proc and the paramter is named -delAgedStagingData. If you pass a 0 it will not delete any data

Here is the updated proc:


use twitter
go

if object_id(N'uspLoadTwitterDataFromStaging') is not null
	drop procedure uspLoadTwitterDataFromStaging

go

create procedure uspLoadTwitterDataFromStaging
/*
If data is oldet than @deleteDataAge in days, delete it
default = 30
if 0 then don't delete data
*/
@delAgedStagingData int = 30 
as
begin


	insert into tblTwitterUser (twitterUserName)
	select twitterUser
	from tblTwitterStaging
	except
	select twitterUserName
	from tblTwitterUser


	/*
	I could prime this table in a config step 
	but I want the whole thing to be driven by user configurable files in the end product
	or parameters on the command line

	Anyway, the PowerShell script is responsible for inserting the search string into the staging table
	*/

	insert into tblSearchString (searchString)
	select searchString
	from tblTwitterStaging
	except 
	select searchString
	from tblSearchString


	--there is some weird date stuff going on - salvage what you can!
	update tblTwitterStaging
	set twitterdate = convert(datetime,left(twitterdate,10))
	where isdate(twitterdate) = 0


	insert into tblTweet(tweet,tweetDate,twitterUserId)
	select distinct stg.tweet, convert(datetime,stg.twitterDate), u.twitterUserId
	from tblTwitterStaging stg
	inner join tblTwitterUSer u
	on stg.twitterUser = u.twitterUserName

	except select tweet,tweetDate,twitterUserId
	from tblTweet


	insert into tblTweetBySearchString(tweetId,searchStringId)
	select t.tweetId, s.searchStringId
	from tblTwitterStaging stg
	inner join tblTweet t
	on stg.tweet = t.tweet
	inner join tblSearchString s
	on stg.searchString = s.searchString
	except select tweetId,searchStringId
	from tblTweetBySearchString





	if @delAgedStagingData <> 0
		
		delete tblTwitterStaging where datediff(day,importDate,getdate()) > @delAgedStagingData
	
		
end

I also found one case where the published string came back with a format that SQL server couldn’t handle. If I dropped off the time then it could convert. That causes a bit of data loss but I figure it’s better than losing the record and it seems like the situation is pretty rare.

Thirdly, I decided for ease of use to do the URL encoding in the PowerShell script. So when you pass in a string you simply type the search string as you would into any search box. So, if you want to search for “i am the cloud” you can just pass that string as a paramter.

Here is the updated PowerShell script.



param([string]$searchString = "", [string]$inputFile = "", [int]$delAgedStagingData = 30)
Write-Host "searchString: $searchString"
Write-Host "inputFile: $inputFile"
Write-Host "delAgedStagingData: $delAgedStagingData"


[System.Reflection.Assembly]::LoadWithPartialName("System.Web") | out-null
$searchString = [System.Web.HttpUtility]::UrlEncode($searchString)



Function getSearchPage ($searchString, $page)
{
	"searchString = $searchString"
	"page = $page"


	$req = "http://search.twitter.com/search.atom?q=$searchString&show_user=1&rpp=100&since_id=1&page=$page"
	
	
	"`$req = $req"
	
	([xml](new-object net.webclient).DownloadString($req)).feed.entry |
	%{

		#replace problematic characters in input
		$title = $_.title -replace ("`t","``t")
		$title = $title -replace ("'","''")
		$title = $title -replace ("`n","``n")
		$title = $title -replace ("`r","``r")
		$title = $title -replace ("`$","``$")
		$title = $title -replace ("`"","`"`"")
		$published = $_.published -replace ("`t"," ")
		$published = $published -replace ("'","''")
		$published = $published -replace ("`n","``n")
		$published = $published -replace ("`"","`"`"")
		
		
		
		
		$splitData = $title.split(":")
		$twitterUser = $splitData[0]
		
		$tweet = $title.substring($twitterUser.length,$title.length - $twitterUser.length)
		if ($tweet.length -ge 2)
		{$tweet = $tweet.substring(2,$tweet.length -2)}
		
		

		
		#turn input into sql statement
		if ($tweet.length -gt 0){
			$sql = "insert into twitter.dbo.tblTwitterStaging 
			(twitterUser,tweet, twitterDate,searchString) 
			select '$twitterUser','$tweet' ,'$published','$searchString'" 
			
			sqlcmd -E -Q $sql 
			

		}

	} 

	
}

Function getLast10SearchPages ($searchString)
{
	for ($i=10; $i -ge 1; $i--)
	{



		$i

		getSearchPage $searchString $i



	} 
}


#process command line searhc string first
if ($searchString -ne '') {getLast10SearchPages ($searchString)}

#process any search strings in the file
if ($inputFile  -ne '') {Get-Content $inputFile | Foreach-Object {getLast10SearchPages ([System.Web.HttpUtility]::UrlEncode($_))}}



sqlcmd -E -Q "exec twitter.dbo.uspLoadTwitterDataFromStaging @delAgedStagingData = $delAgedStagingData"



I’ve been calling the file twitter_search_to_db.ps1 and assuming that you call yours the same thing here are some sample calls. I created a text file called search_strings.txt which sits in the same directory in which I’m running my script. Here is what is saved in search_strings.txt:

#powershell
#sqlhelp
i am the cloud
death cab for cutie
florence + the machine
mary timony
Jolie Holland

Here is the call with a search string:

.\twitter_search_to_db.ps1 -searchString "J. Vernon McGee"

Here is a call with a file parameter:

.\twitter_search_to_db.ps1 -inputFile ".\search_strings.txt" -delAgedStagingData 30

And here is a call with both:

.\twitter_search_to_db.ps1 -searchString "J. Vernon McGee" -inputFile ".\search_strings.txt" 

Notice that the search string is stored URL encoded in the tables:

select * from twitter.dbo.tblSearchString

I figure that this is closer to what is actually required by the program. It should be fairly safe for use in SQL and over the web. It’s up to the interface to transform user input into something computer friendly. You could store both inside but I try to avoid repetivive data and since the unencoded data should be fairly easy to get from the encoded data it seems silly to have both. You might end up with difficult to find bugs if you have both.

I would recommend you add this as a scheduled task. That’s fairly easy to do. You can find details on that here along with some other nice PowerShell tricks.

I hope to play around a bit with displaying the results but my plan for my next post is probably going to be some pure T-SQL stuff. I feel like I could go on and on with this project. I love PowerShell but my deeper interest in T-SQL is starting to pull me away. So I’m going to take a brief PowerShell vaction.

Advertisements
This entry was posted in PowerShell, SQL and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s