Nov 11
2008Datafeed Scraping
Filed Under (Excel, Regular Expressions, Scraping, VBA) by zl1n0x on 11-11-2008
Tagged Under : Excel, Regular Expressions, Scraping, VBA
This is the first post in my Learn to Scrape series. We’re going to rip content from merchant websites using their affiliate datafeed and Microsoft Excel. You might want to do something like this if you are setting up a price comparison style store that is populated from the content that merchants provide within their datafeeds. One thing that I have noticed with these is that the quality of the content for products is often poor, and due to the ease with which you can set up these sites virtually every merchant you choose will already have many identical looking affiliate sites with the exact same content appearing in the SERP’s. This is a way to differentiate your site with more relevent information regarding the products you are promoting.
For this example we will use the following tools:
- Our affiliate store will be based on Affilistore. This is a free datafeed solution that is probably the easiest package to get started with
- We will scrape data from the merchant sites using Microsoft Excel. If you dont have a copy Microsoft have a free 60 day trial.
- We will use merchant datafeeds from Affiliate Window, a UK based affiliate network. You could just as easily use datafeeds from Commission Junction or any other source, just so long as you can get the info in CSV format.
- We will use Regular Expressions to extract the data from the page
- Once we have scraped the information you can import it from your Excel spreadsheet to a database of your choice.
One caveat before we begin. Many datafeeds contain hundreds or even thousands of products. The technique we use is to loop through the products deep link, request the page, then extract the result from the response. You need to be careful when doing this as some merchants will recognise it as bot traffic and may not like what you are doing. You should take the following precautions:
- Dont use the affiliate networks deep link, as this contains your affiliate id. Use the direct url to the product.
- Dont scrape during high traffic times. Most merchants will handle the traffic just fine, but if you get unlucky and bring their site down with a constant stream of requests during their busiest hour you will have one pissed off merchant on your hands (assuming they can trace it back to you).
- If you are really paranoid then find a good proxy and do your scraping through this.
OK, so thats the preliminaries out of the way, now for the fun stuff. The first thing we will do is select a merchant that we want to promote. For this example I will use a UK based book merchant Borders.co.uk from Affiliate Window. Download the datafeed in CSV using a pipe delimiter as product descriptions often have commas which will bugger up Excel. The datafeed is massive, 200,000+ records, so I’m deleting all but 4. Lets do a comparison between Borders and Affilistore.
This is the Borders page for The Dark Tower. In the product details section we have a nice long synopsis, then some technical details describing the book in a block on the right including the ISBN, number of pages, category and publication date.

This is the Affilistore page for the same product. We get a brief description of the book and no technical details.

Wouldn’t it be nice to have those technical details on our site? Yes it would. So lets do it. The first thing to do is to import the datafeed into Excel. Open Excel and create a new blank workbook. Save it do disk as a Macro enabled workbook if you are using 2007. Select the option to Import data from text and follow the wizards instructions to load it in.

Select the correct delimter

Right, so now you’ve got the datafeed loaded in Excel you should create a new vba macro. Microsoft have gone and hidden the Developer tab in 2007 because they think that most of their users are too stupid to need it, which is probably true if you consider the amount of people using MySpace, but anyways this is how you get it to display - right click anywhere in the area at the top where the save icon is. Select “Customize Quick Access Toolbar” - Popular - check the option that says “Show Developer Tab in the Ribbon”
So go to the developer tab and click the Visual Basic icon to create a new macro. We’ll put the macro in “ThisWorkbook”. We’ll be using some external components for this so you will need to make a reference to Microsoft Scripting Runtime, Microsoft XML (v6.0, v5.0, whatever version you have listed) and Microsoft VBScript Regular Expressions 5.5 (or whatever version you have). You’ll find the dialog to select these under Tools - References.
Once you’ve done this add the following code (if the code is clipped in your browser click the “view plain” link)
Option Explicit Public Sub Scrape() Dim i As Integer Dim j As Integer 'sends the web request and gets the response Dim xmlReq As MSXML2.XMLHTTP60 Set xmlReq = New MSXML2.XMLHTTP60 'we use this to extract the response Dim reg As VBScript_RegExp_55.RegExp Set reg = New VBScript_RegExp_55.RegExp 'stores matches to our regex patterns Dim matches As VBScript_RegExp_55.MatchCollection Dim synopsis, publisher, publicationDate, category As String Dim numPages As Integer For i = 2 To 5 'change these numbers to suit your spreadsheet 'get the page from using our xmlhttp object xmlReq.Open "GET", Replace(Sheet1.Cells(i, 19), "?utm_campaign=affiliates&amp;amp;amp;amp;amp;amp;utm_source=awin&amp;amp;amp;amp;amp;amp;utm_medium=affiliate-link", ""), False xmlReq.send If xmlReq.Status = 200 Then 'initialise our variables synopsis = "" publisher = "" publicationDate = "" category = "" numPages = 0 'get the synopsis reg.Pattern = "<div class=""withBigLetter"">(.*)</div></div>" Set matches = reg.Execute(xmlReq.responseText) If matches.Count = 1 And matches.Item(0).SubMatches.Count = 1 Then synopsis = matches.Item(0).SubMatches.Item(0) End If 'get the books details reg.Pattern = "<div class=""title"">(.*)</div>\r\n[\S*\s*]<div class=""entry"">(.*)</div>" reg.Global = True Set matches = reg.Execute(xmlReq.responseText) If matches.Count > 0 And matches.Item(0).SubMatches.Count = 2 Then For j = 0 To matches.Count - 1 If matches.Item(j).SubMatches.Item(0) = "Publisher" Then publisher = Replace(Replace(matches.Item(j).SubMatches.Item(1), "<h3>", ""), "</h3>", "") ElseIf matches.Item(j).SubMatches.Item(0) = "Publication date" Then publicationDate = matches.Item(j).SubMatches.Item(1) ElseIf matches.Item(j).SubMatches.Item(0) = "Number of pages" Then numPages = CInt(matches.Item(j).SubMatches.Item(1)) ElseIf matches.Item(j).SubMatches.Item(0) = "Category" Then category = matches.Item(j).SubMatches.Item(1) End If Next j 'the category is surrounded by an anchor tag, so we'll extract the text and remove the link reg.Pattern = "<[^>]+>" category = reg.Replace(category, "") 'populate merchant_id field Sheet2.Cells(i, 1) = Sheet1.Cells(i, 1) 'populate newtork_id field Sheet2.Cells(i, 2) = Sheet1.Cells(i, 3) 'populate merchants product id field Sheet2.Cells(i, 3) = Sheet1.Cells(i, 4) 'now add the scraped info Sheet2.Cells(i, 4) = synopsis Sheet2.Cells(i, 5) = publisher Sheet2.Cells(i, 6) = publicationDate Sheet2.Cells(i, 7) = numPages Sheet2.Cells(i, 8 ) = category End If End If Next i End Sub
Most of this is actually pretty straightforward, the trickiest part is coming up with the regular expressions to get the bits of data that you are interested in. My go-to reference for regular expressions is at WebReference, in particular I use the Regex rules page. I also use the Regex tester at RegexpLib. The thing to remember is that you are writing throwaway scripts to scrape data - they dont have to be elegant so if you have to write hacky shit to get the job done fast then do it.
The first pattern pulls out the synopsis which is found in a div with class “withBigLetter”. The (.*) returns any text in its own match group that we can access later in VBA
reg.Pattern = "<div class=""withBigLetter"">(.*)</div></div>"
The second pattern does much the same thing, except you will notice the two sets of (.*) meaning we will have the content returned in 2 groups, representing the key value pairs in the table on the right of the synpopsis.
reg.Pattern = "<div class=""title"">(.*)</div>\r\n[\S*\s*]<div class=""entry"">(.*)</div>"
The third pattern matches any html tags - we use this to strip the unwanted anchor tag from the category name
reg.Pattern = "<[^>]+>"
On running this sheet 2 of your spreadsheet should populate with the data from the merchants site. Its a simple process then to load this into your database and modify your site to put these extra details on the screen for your products, something that we will cover using Affilistore in the next post in this series.
Stumble it!


[...] has written here how they use the supplied vendor feed and advanced use of Excel to scrape for particular fields [...]
WHOA!!!!
Now that’s a lot of information. I’ve never done datafeeds. Looks complicated.