Saturday, 26 January 2008
Parsing Morrisons
Just finished spidering and parsing Morrisons, it came out at 360 stores, no idea if this is correct, but hopefully is about right. It is interesting the difference it makes as to how your web pages are constructed. Morrisons was quite tricky as it didn't have any type of browsable index, so a query needed to written for each postal district, which resulted in lots of duplicates which needed to be filtered, contrast this with a site like Tesco or Waitrose which linked directly to each page making the spidering very simple.
Subscribe to:
Post Comments (Atom)
Very nice! Any ETA on Sainsburys?
ReplyDeleteThanks. Sainsbury's is working, but some stores are missing as I was too conservative when guessing urls, I'm planning to re-scrape as all the data is all from January.
ReplyDeleteGreat! Also, would be nice if the initial postcode input box could handle postcodes with the middle space left out.
ReplyDeleteGood tip, comments like that will force me to start using sort of bug tracker ;)
ReplyDelete