Datafeed Scraping

Filed Under (Excel, Regular Expressions, Scraping, VBA) by zl1n0x on 11-11-2008

Tagged Under : , , ,

This is the first post in my Learn to Scrape series.  We’re going to rip content from merchant websites using their affiliate datafeed and Microsoft Excel.  You might want to do something like this if you are setting up a price comparison style store that is populated from the content that merchants provide within their datafeeds.  One thing that I have noticed with these is that the quality of the content for products is often poor, and due to the ease with which you can set up these sites virtually every merchant you choose will already have many identical looking affiliate sites with the exact same content appearing in the SERP’s.  This is a way to differentiate your site with more relevent information regarding the products you are promoting.

For this example we will use the following tools:

  • Our affiliate store will be based on Affilistore.  This is a free datafeed solution that is probably the easiest package to get started with
  • We will scrape data from the merchant sites using Microsoft Excel.  If you dont have a copy Microsoft have a free 60 day trial.
  • We will use merchant datafeeds from Affiliate Window, a UK based affiliate network.  You could just as easily use datafeeds from Commission Junction or any other source, just so long as you can get the info in CSV format.
  • We will use Regular Expressions to extract the data from the page
  • Once we have scraped the information you can import it from your Excel spreadsheet to a database of your choice.

One caveat before we begin.  Many datafeeds contain hundreds or even thousands of products.  The technique we use is to loop through the products deep link, request the page, then extract the result from the response.  You need to be careful when doing this as some merchants will recognise it as bot traffic and may not like what you are doing.  You should take the following precautions:

  • Dont use the affiliate networks deep link, as this contains your affiliate id.  Use the direct url to the product.
  • Dont scrape during high traffic times.  Most merchants will handle the traffic just fine, but if you get unlucky and bring their site down with a constant stream of requests during their busiest hour you will have one pissed off merchant on your hands (assuming they can trace it back to you).
  • If you are really paranoid then find a good proxy and do your scraping through this.

OK, so thats the preliminaries out of the way, now for the fun stuff.  The first thing we will do is select a merchant that we want to promote.  For this example I will use a UK based book merchant Borders.co.uk from Affiliate Window.  Download the datafeed in CSV using a pipe delimiter as product descriptions often have commas which will bugger up Excel.  The datafeed is massive, 200,000+ records, so I’m deleting all but 4.  Lets do a comparison between Borders and Affilistore.

This is the Borders page for The Dark Tower.  In the product details section we have a nice long synopsis, then some technical details describing the book in a block on the right including the ISBN, number of pages, category and publication date.

The Dark Tower Page

This is the Affilistore page for the same product.  We get a brief description of the book and no technical details.

The Dark Tower Affilistore Page

Wouldn’t it be nice to have those technical details on our site?  Yes it would.  So lets do it.  The first thing to do is to import the datafeed into Excel.  Open Excel and create a new blank workbook.  Save it do disk as a Macro enabled workbook if you are using 2007.  Select the option to Import data from text and follow the wizards instructions to load it in.

Excel datafeed import step 1

Select the correct delimter

Excel Datafeed Import Step 2

Right, so now you’ve got the datafeed loaded in Excel you should create a new vba macro. Microsoft have gone and hidden the Developer tab in 2007 because they think that most of their users are too stupid to need it, which is probably true if you consider the amount of people using MySpace, but anyways this is how you get it to display - right click anywhere in the area at the top where the save icon is. Select “Customize Quick Access Toolbar” - Popular - check the option that says “Show Developer Tab in the Ribbon”

So go to the developer tab and click the Visual Basic icon to create a new macro. We’ll put the macro in “ThisWorkbook”. We’ll be using some external components for this so you will need to make a reference to Microsoft Scripting Runtime, Microsoft XML (v6.0, v5.0, whatever version you have listed) and Microsoft VBScript Regular Expressions 5.5 (or whatever version you have). You’ll find the dialog to select these under Tools - References.

Once you’ve done this add the following code (if the code is clipped in your browser click the “view plain” link)

Option Explicit

Public Sub Scrape()
Dim i As Integer
Dim j As Integer

'sends the web request and gets the response
Dim xmlReq As MSXML2.XMLHTTP60
Set xmlReq = New MSXML2.XMLHTTP60

'we use this to extract the response
Dim reg As VBScript_RegExp_55.RegExp
Set reg = New VBScript_RegExp_55.RegExp

'stores matches to our regex patterns
Dim matches As VBScript_RegExp_55.MatchCollection

Dim synopsis, publisher, publicationDate, category As String
Dim numPages As Integer

For i = 2 To 5 'change these numbers to suit your spreadsheet

'get the page from using our xmlhttp object
xmlReq.Open "GET", Replace(Sheet1.Cells(i, 19), "?utm_campaign=affiliates&utm_source=awin&utm_medium=affiliate-link", ""), False
xmlReq.send

If xmlReq.Status = 200 Then

'initialise our variables
synopsis = ""
publisher = ""
publicationDate = ""
category = ""
numPages = 0

'get the synopsis
reg.Pattern = "<div class=""withBigLetter"">(.*)</div></div>"

Set matches = reg.Execute(xmlReq.responseText)

If matches.Count = 1 And matches.Item(0).SubMatches.Count = 1 Then
synopsis = matches.Item(0).SubMatches.Item(0)
End If

'get the books details
reg.Pattern = "<div class=""title"">(.*)</div>\r\n[\S*\s*]<div class=""entry"">(.*)</div>"
reg.Global = True

Set matches = reg.Execute(xmlReq.responseText)

If matches.Count > 0 And matches.Item(0).SubMatches.Count = 2 Then

For j = 0 To matches.Count - 1

If matches.Item(j).SubMatches.Item(0) = "Publisher" Then
publisher = Replace(Replace(matches.Item(j).SubMatches.Item(1), "<h3>", ""), "</h3>", "")
ElseIf matches.Item(j).SubMatches.Item(0) = "Publication date" Then
publicationDate = matches.Item(j).SubMatches.Item(1)
ElseIf matches.Item(j).SubMatches.Item(0) = "Number of pages" Then
numPages = CInt(matches.Item(j).SubMatches.Item(1))
ElseIf matches.Item(j).SubMatches.Item(0) = "Category" Then
category = matches.Item(j).SubMatches.Item(1)
End If

Next j

'the category is surrounded by an anchor tag, so we'll extract the text and remove the link
reg.Pattern = "<[^>]+>"
category = reg.Replace(category, "")

'populate merchant_id field
Sheet2.Cells(i, 1) = Sheet1.Cells(i, 1)

'populate newtork_id field
Sheet2.Cells(i, 2) = Sheet1.Cells(i, 3)

'populate merchants product id field
Sheet2.Cells(i, 3) = Sheet1.Cells(i, 4)

'now add the scraped info
Sheet2.Cells(i, 4) = synopsis
Sheet2.Cells(i, 5) = publisher
Sheet2.Cells(i, 6) = publicationDate
Sheet2.Cells(i, 7) = numPages
Sheet2.Cells(i, 8 ) = category

End If

End If

Next i

End Sub

Most of this is actually pretty straightforward, the trickiest part is coming up with the regular expressions to get the bits of data that you are interested in. My go-to reference for regular expressions is at WebReference, in particular I use the Regex rules page. I also use the Regex tester at RegexpLib. The thing to remember is that you are writing throwaway scripts to scrape data - they dont have to be elegant so if you have to write hacky shit to get the job done fast then do it.

The first pattern pulls out the synopsis which is found in a div with class “withBigLetter”. The (.*) returns any text in its own match group that we can access later in VBA

reg.Pattern = "<div class=""withBigLetter"">(.*)</div></div>"

The second pattern does much the same thing, except you will notice the two sets of (.*) meaning we will have the content returned in 2 groups, representing the key value pairs in the table on the right of the synpopsis.

reg.Pattern = "<div class=""title"">(.*)</div>\r\n[\S*\s*]<div class=""entry"">(.*)</div>"

The third pattern matches any html tags - we use this to strip the unwanted anchor tag from the category name

reg.Pattern = "<[^>]+>"

On running this sheet 2 of your spreadsheet should populate with the data from the merchants site. Its a simple process then to load this into your database and modify your site to put these extra details on the screen for your products, something that we will cover using Affilistore in the next post in this series.

Stumble it!

Comments:

Total 2 Comments posted Datafeed Scraping

Post a comment