Some sites, including Betfair, appear to be employing some means to prevent scraping.
When I scrape the prices from Betfair once per second, they only update after five seconds. On another site, dynamic data appears static when scraped.
Anybody know how they do this and how we get round it?
If you wish to scrape live betfair prices then you need to emulate a normal browser login. The login requirement was added when data charges were brought in. You have to login to identify yourself so that your number of requests can be correctly tallied.
If a site behaves differently when scraped then your browser emulation
probably isn't convincing enough. Check that you're sending the right
headers and handling cookies properly.
Thanks, that explains Betfair, but with regard to the other site, my code works fine on another PC, but not on my own. And it's unlikely that my PC has been blocked in any way.
"dynamic data appears static" sounds more like a cacheing issue. Could be caused by many factors. Try adding anti-cacheing headers. If you are adding time based parameters to the request then check the time/time zone etc. For more clues you'd need to compare the full exchange of headers over several requests, using browser / your code / your code on other pc.
Not sure what the problem is, but by stopping and restarting my program, it somehow retrieves dynamic data, but that's impractical.
A workaround I found is to include a random number in the url. That'll do me until I find something better.
That's definitely a cacheing problem. What language/libraries are you using? Is the other PC running the same operating system?
You're not automating a web browser component are you?
I'd agree with birchy I used to have problems like that when using flash+php to scrape sites and the prog was just returning cached data unless I was adding random numbers to the url to make the returned data unique.
I'm using Visual Basic Express and also MS Access on Vista. PC where it works has XP.
I've tried ActiveX Data Objects, MS HTML, MS XML 6, WinHTTP 5.
Not web browser component.
Data refreshes okay with IE.
Are you using the "Pragma: no-cache" header? If not, try it.
I'm no longer scraping betfair, but i remember that i always used the main loader url and not the one they use for refreshing:
http://uk.site.sports.betfair.com/betting/api/json/getBootstrapData.do?mi=xxxxx
Also note that i never used any of the url parameters. And do make sure that you're not using the "prevcache" parameter...although that should be obvious...
This Topic Is Locked To Guest Posts
It's been a while since this topic was active, if you'd like to get it going again, please post as a registered member