Downloading the Results of a Twitter Search using PowerShell and SQL (part 2)

I’ve expanded a little on my Twitter project. (See part 1 here.) I adjusted the PowerShell script to page through the data going back as far as Twitter holds the data. There is still a lot to be done.

Left to be done:

  • allow for dynamic search strings instead of hard coding
  • read search strings from file or command line
  • better structure in the script

But today I want to focus on the data model. I pasted the commented DDL statements below


--assuming you have created a database for this project. I'm calling mine simply twitter
use twitter 
go

/*

staging - dump all the strings in hear then clean them with T-SQL

*/


create table tblTwitterStaging(
	staging_row_id int identity
	/*
	I'm making the twitterUser and tweet columns much larger than
	the twitter app allows because this is a staging DB and I want to be sure that
	I can handle any mess that twitter sends me
	
	As a general rule, I try to code as defesnively as the situation allows
	
	
	*/
	,twitterUser nvarchar(1000)
	,tweet nvarchar(1000) 
	
	--use nvarchar to avoid differences in formatting between their system and SQL server.
	,twitterDate nvarchar(25) 
	
	--the actual date the row is inserted into the staging table
	,importDate datetime 
	constraint df_tblTwitterStaging_importDate default getdate()
	
	--search string sent to twitter
	,searchString nvarchar(1000) 
	
)


/*

Since a user can have multiple Tweets across many searches it seems good to give them a table

*/

create table tblTwitterUser(
	twitterUserId int identity
	--code defensively - more than twitter will allow
	,twitterUserName nvarchar(450) not null 
)


alter table tblTwitterUser
add constraint pk_tblTwitterUser primary key (twitterUserId)

/*
I can't see how twitter could allow dups in user name
So let's put that in our data model

We may be able to use knowledge like this later for query optimization and so forth
*/
alter table tblTwitterUser
add constraint uq_tblTwitterUser_twitterUserName unique (twitterUserName) 

	
/*

I'm going to scale this out to track multiple searches

*/


create table tblSearchString(
	searchStringId int identity
	,searchString nvarchar(450)
)		


alter table tblSearchString
add constraint pk_tblSearchString primary key (searchStringId)

--We control the search strings and it doesn't make any sense to have multiple, identical search strings
alter table tblSearchString
add constraint uq_tblSearchString_searchString unique (searchString)



create table tblTweet(
	tweetId int identity
	,twitterUserId int
	--once again, much longer than we'd ever expect a tweet to be but coding defensively
	,tweet nvarchar(400) 
	,tweetDate datetime
)

alter table tblTweet
add constraint pk_tblTweet primary key (tweetId)

/*
I don't care about duplicate tweets
if the same tweet shows up in more than one search it will tie back to this one
*/
alter table tblTweet
add constraint uq_tblTweet_tweet_tweetDate unique (tweet, tweetDate) 

--every tweet ties to a user
alter table tblTweet
add constraint fk_tblTweet_tblTwitterUser foreign key (twitterUserId) references tblTwitterUser(twitterUserId)

/*
each search will return zero or more tweets

The same tweet could be returned by more than one search string
So this relationship needs its own table
*/
create table tblTweetBySearchString(
	tweetId int not null
	,searchStringId int not null
)		

alter table tblTweetBySearchString
add constraint pk_tblTweetBySearchString primary key (tweetId, searchStringId)

alter table tblTweetBySearchString
add constraint fk_tblTweetBySearchString_tblTweet foreign key (tweetId) references tblTweet(tweetId)

alter table tblTweetBySearchString
add constraint fk_tblTweetBySearchString_tblSearchString foreign key (searchStringId) references tblSearchString(searchStringId)

go

/*

And here we have the actual results

I'm using a view so that a client application will not break if the table schemas change

*/
create view vwTweetsBySearchString
as
select  s.searchString, u.twitterUserName,t.tweet, t.tweetDate

from tblTweetBySearchString t_by_s

inner join tblSearchString s
on t_by_s.searchStringId = s.searchStringId

inner join tblTweet t
on t_by_s.tweetId = t.tweetId

inner join tblTwitterUser u
on t.twitterUserId = u.twitterUserId

Here is the proc we’ll use to move the data from staging to the tables we’ll actually query

use twitter
go

if object_id(N'uspLoadTwitterDataFromStaging') is not null
	drop procedure uspLoadTwitterDataFromStaging

go

create procedure uspLoadTwitterDataFromStaging
as
begin


	insert into tblTwitterUser (twitterUserName)
	select twitterUser
	from tblTwitterStaging
	except
	select twitterUserName
	from tblTwitterUser


	/*
	I could prime this table in a config step 
	but I want the whole thing to be driven by user configurable files in the end product
	or parameters on the command line

	Anyway, the PowerShell script is responsible for inserting the search string into the staging table
	*/

	insert into tblSearchString (searchString)
	select searchString
	from tblTwitterStaging
	except 
	select searchString
	from tblSearchString




	insert into tblTweet(tweet,tweetDate,twitterUserId)
	select distinct stg.tweet, convert(datetime,stg.twitterDate), u.twitterUserId
	from tblTwitterStaging stg
	inner join tblTwitterUSer u
	on stg.twitterUser = u.twitterUserName

	except select tweet,tweetDate,twitterUserId
	from tblTweet


	insert into tblTweetBySearchString(tweetId,searchStringId)
	select t.tweetId, s.searchStringId
	from tblTwitterStaging stg
	inner join tblTweet t
	on stg.tweet = t.tweet
	inner join tblSearchString s
	on stg.searchString = s.searchString
	except select tweetId,searchStringId
	from tblTweetBySearchString






end

And here is the PowerShell script to get the data into the data staging table. Like I said before there is a lot to be done but it pages through the data and puts the data into our staging table using sqlcmd.




for ($i=10; $i -ge 1; $i--)
{
	
	
	$i
	$req = "http://search.twitter.com/search.atom?q=%23sqlhelp&show_user=1&rpp=100&since_id=1&page=$i"
	
	([xml](new-object net.webclient).DownloadString($req)).feed.entry |
	%{

		#replace problematic characters in input
		$title = $_.title -replace ("`t","``t")
		$title = $title -replace ("'","''")
		$title = $title -replace ("`n","``n")
		$title = $title -replace ("`r","``r")
		$title = $title -replace ("`"","`"`"")
		$published = $_.published -replace ("`t"," ")
		$published = $published -replace ("'","''")
		$published = $published -replace ("`n","``n")
		$published = $published -replace ("`"","`"`"")
		
		
		

		
		$splitData = $title.split(":")
		$twitterUser = $splitData[0]
		
		$tweet = $title.substring($twitterUser.length,$title.length - $twitterUser.length)
		if ($tweet.length -ge 2)
		{$tweet = $tweet.substring(2,$tweet.length -2)}
		
		

		
		#turn input into sql statement
		if ($tweet.length -gt 0){
			$sql = "insert into twitter.dbo.tblTwitterStaging 
			(twitterUser,tweet, twitterDate,searchString) 
			select '$twitterUser','$tweet' ,'$published','%23sqlhelp'" 
		
			#insert into staging table
			sqlcmd -E -Q $sql
		}

	} 
} 

sqlcmd -E -Q "exec twitter.dbo.uspLoadTwitterDataFromStaging"

And here is a view that our downstream systems will use to view the results

create view vwTweetsBySearchString
as
select  s.searchString, u.twitterUserName,t.tweet, t.tweetDate

from tblTweetBySearchString t_by_s

inner join tblSearchString s
on t_by_s.searchStringId = s.searchStringId

inner join tblTweet t
on t_by_s.tweetId = t.tweetId

inner join tblTwitterUser u
on t.twitterUserId = u.twitterUserId

In the next installment in the series I’m hoping to build out the PowerShell script to accept multiple search strings. I may play with the T-SQL a bit as that is the sort of thing I never get tired of doing.

Thoughts? Feature requests?

Advertisements
This entry was posted in PowerShell, SQL and tagged , , , , , , , , , . Bookmark the permalink.

One Response to Downloading the Results of a Twitter Search using PowerShell and SQL (part 2)

  1. Pingback: Downloading the Results of a Twitter Search using PowerShell and SQL (part 3) | SELECT STUFF FROM SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s