close Lefora Announcement: We recently turned back on the 'Send Invites' link in the userbar. New features include the ability to hook directly into your Hotmail, Yahoo, or Gmail accounts. Click here.
Track Topic
: rss

Topic: Web scraping in Java or C++

posts 1–8 of 8
Page 1
member
124 posts
Now that i'm a full-blown Linux user, all my old VB6 code has become semi-obsolete. Not that it's a major problem, because after several years of botting against betfair, i have finally accepted defeat (sort of). On the plus side, moving to Linux has given me that extra push to learn C/C++ and Java, something i've been meaning to do for a couple of years but never bothered with because coding whatever i wanted in VB6 was always the easy option.

I've done a little Java and understand the concept of classes and the whole OO thing, however i've not previously had a specific project to keep me interested. To cut a long story short, i need to do some web scraping in order to gather some info from several websites. I'd like to use some "native" Java functions and i'm not keen on using any 3rd party libraries because Java is already very bloaty. What i'm asking for is some advice and/or source code on the most simple and efficient way to:
  1. download the HTML source from a given URL.
  2. parse the HTML. In VB6, i wrote a simple function like this: GetSubstring(tag1, tag2) which returned the string in between the given tags/strings
I'm also interested in C++ code to achieve the same. I'm using Eclipse for Java and Code::Blocks for my C++ programming.

Thanks in advance.
member
25 posts

Can't help with the C++/Java scraping, sorry, but I just wondered how you were managing with online poker under linux. I remember you were keen on the betfair site a month or so ago; have you had any luck running their software under linux?

member
30 posts

This should get you started in Java:
http://parthian-shot.blogspot.com/2007/09/html-screen-scraping-easy-way.html

If you don't really want to use an HTML parsing library (what'll you save, a few hundred KB?) get a string from the connection's input stream - everything up to that point is 'native'.

If you change your mind and decide to use someone else's parsing code, there are plenty of options:
http://java-source.net/open-source/html-parsers

And, if you need cookie handling:
http://blogs.sun.com/CoreJavaTechTips/entry/cookie_handling_in_java_se

member
124 posts
Thanks, that should be enough to get me rolling. Smile
member
70 posts

Anyone done any scraping against betdaq? I noticed today that the login call doesn't seem to go over a secure connection ...??

member
4 posts

Looks that way doesn't it. Without digging around I might guess that it relies on Javascript to redirect the post to a secure address.
Why screen scrape though? The Betdaq API is free.

member
70 posts

Hmm ... well I surfed via a web proxy and it seemed to send my username and password over http ... plain for all to see.

scraping still has it's occassional uses...

member
124 posts
Dunno what the situation is now, but several months ago i decided that if i was gonna go scraping again, i'd go to WBX because of the cheaper commission. I'm getting more fun out of poker at the moment. I actually feel in control of my own odds and the 30% rakeback makes it a whole lot sweeter.
posts 1–8 of 8
Page 1

This Topic Is Locked To Guest Posts

It's been a while since this topic was active, if you'd like to get it going again, please post as a registered member

join now