Saturday, 26 January 2008

Parsing Morrisons

Just finished spidering and parsing Morrisons, it came out at 360 stores, no idea if this is correct, but hopefully is about right. It is interesting the difference it makes as to how your web pages are constructed. Morrisons was quite tricky as it didn't have any type of browsable index, so a query needed to written for each postal district, which resulted in lots of duplicates which needed to be filtered, contrast this with a site like Tesco or Waitrose which linked directly to each page making the spidering very simple.

4 comments:

  1. Very nice! Any ETA on Sainsburys?

    ReplyDelete
  2. Thanks. Sainsbury's is working, but some stores are missing as I was too conservative when guessing urls, I'm planning to re-scrape as all the data is all from January.

    ReplyDelete
  3. Great! Also, would be nice if the initial postcode input box could handle postcodes with the middle space left out.

    ReplyDelete
  4. Good tip, comments like that will force me to start using sort of bug tracker ;)

    ReplyDelete

Please feel free to comment on my blog, I read everything. Thanks