Remove this ad
avatar

birchy

Betfair Elite

Posts: 591 Member Since: May 11, 2008

Lead

November 19, 2008 17:46:53

Tags : :

Well it's no secret that i'm a big fanboy of scraping rather than using the API. Other than the unrestricted access, fast response times, reliability, less frequent updates, etc, etc, it useful to have some decent scraping code in your toolbox in order to extract data from various websites. Regular readers know that i'm dabbling in Java due to the fact it is multi platform, although i DO think it is a bloaty language.

Soooo, what this thread is all about is scraping. Which languages/libraries make scraping EASY? The obvious requirements are:
1) Ubuntu compatible
2) Fast download of HTML
3) Good HTML parsing library
4) Automatic/Easy handling of cookies
5) Easy implementation of GZIP decompression (optional but preferred)

Following this, some simple source code/tutorial would be useful for the "Code Snippets" section...
Quote    Reply   
Remove this ad
Remove this ad
avatar

myrddin

bot addict

Posts: 56 Member Since:June 2, 2008

#1 [url]

November 19, 2008 19:58:29

perl is pretty easy for scraping, especially for 'old school' html websites (although it's not so hot for web 2.0 - type sites, but perhaps these are trickier in most languages).
I use the 'standard' perl libraries for this, most of which have been around for donkeys years. Nearly all of them ship with a standard perl install, and the ones that don't are available on CPAN. They tick all five of your boxes, so I thought I'd throw them up here.

Internet related libraries I use (for scraping AND betfair SOAP stuff) are:

LWP::UserAgent
LWP::Debug
LWP::Simple
HTTP::Request
HTTP::Cookies
XML::Simple
XML::XPath
HTML::TokeParser

Throw in the generic database library DBI and that is pretty much my entire 'toolkit'. Something along the lines of 'Gaffa Tape, WD40, Hammer (optional)', but it works for me.

Quote    Reply   
avatar

denp

bot addict

Posts: 65 Member Since:June 14, 2008

#2 [url]

November 19, 2008 20:53:15

i use python for this kind of stuff, with the SGMLParser. Some easy examples at http://www.boddie.org.uk/python/HTML.html. Python seems to be the preferred scripting language on ubuntu. Get eclipse and pydev. Or IPython is a pretty neat shell with auto complete, command history, persistent environment variables between shells etc.

Not sure what the DIY tools parallel is for python. Perhaps myrddin can help out.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#3 [url]

November 19, 2008 20:55:30

myrddin:
Although i've heard of it, i've never done or seen ANY Perl code before so have just Googled it. It looks very similar to PHP. Is it any good for producing reasonable GUI's and are there any decent IDE's with auto-complete (yes, i'm a lazy coder).

denp:
I've looked at Python several times in the past because i've read that it's easy to learn and quite powerful but i can honestly say that i just don't "get" it. I'm fluent in VB6 and i understand the basics of C, C++, Java and PHP...yet Python totally confuses me.

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

myrddin

bot addict

Posts: 56 Member Since:June 2, 2008

#4 [url]

November 19, 2008 22:01:07

The standard way to create GUIs in perl is to use the Tk toolkit, although I couldn't say, hand on heart, that it is either 'good' or 'reasonable' - 'adequate' would be as far as I would go, and certainly nowhere near as easy as VB5 (never tried VB6). Like you, I've never really 'got' Python, but I understand that GUIs and Widgets are pretty straightforward, certainly better than perl.
Eclipse seems to be the default IDE on Ubuntu, and is good for pretty much any language - definitely worth learning if you are going to stick with linux.
Personally I use emacs in a terminal rather than an IDE. It does all that syntax highlighting, debugging and autocomplete nonsense, as well as running a MySQL client and web browser. It's a steep learning curve though, and bugger all use if you want a GUI.
I'm happier with text based interfaces (ncurses if I'm feeling flash), mainly because they are simpler to use over a network and you don't need to run XWindows on your bot machine, but I realise this sort of bare bones approach isn't everyone's cup of tea.

Denp: like I say, I don't really know a lot about Python, but from what I hear it would probably be something very practical, efficient and well-engineered - perhaps one of those Bahco adjustable spanners, or a good German pair of pliers?

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#5 [url]

November 20, 2008 14:54:10

Nope, still can't get my head around this Python malarkey. I know that Gambas has built in libraries for everything i asked for but documentation and sample code is none-existent, so you're only guessing at what to do. Shame really, because it looks a fine language for Linux users, especially with its VB6 style drag n' drop GUI builder. It seems that they're constantly adding new features and fixing bugs but rarely document anything. I know that writing documentation is pretty boring but it is essential for anyone who's not involved with developing the language.

So how about Java? I know there's loads of libraries available...too many, in fact...so i really don't know which i should be looking at. Are there any libraries that are designed specifically for scraping? I suppose what would be ideal is a complete web browser backbone. Now there's a thought...can i rob anything from Firefox??

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

inksmithy

Botting Guru

Posts: 184 Member Since:June 1, 2008

#6 [url]

November 20, 2008 16:31:26

birchy, this set of libraries I've attached to the post is designed to allow the developer to grab the html page and scrape it, I think I have it on another post, but it may have been one which didn't upload for some reason.
Its pretty simple to use, I'll post some sample code in the next post.



Click here to view the attachment

Slashdot. It's like Digg on slow, but sensible.

Quote    Reply   
avatar

inksmithy

Botting Guru

Posts: 184 Member Since:June 1, 2008

#7 [url]

November 20, 2008 16:33:53


import org.htmlparser.beans.StringBean;
private String grabhtml()
{
String mt = null;
try
{
URL url = new URL("http://proxy1.info/nph-index.cgi/011110A/http/www.goals365.com/feed/soccer/index.php");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
StringBean bean = new StringBean();
bean.setConnection(connection);
mt = bean.getStrings();
// System.out.println(mt);
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
return mt;
}


Once you have used the StringBean, there are more tools in the library which will help you strip the html tags away and leave you with the data you want to play with. I haven't really got too far into it because I haven't done much scraping, but I have certainly used this code before.

I wish I could say I wrote this code, but I didn't. I know we swapped a lot of emails about java and so on so I may have mentioned it to you and it may even have been you who told me about the code to start with. I don't think so though, I'm pretty sure it was someone else from the bdp forum.

Find the documentation here.

Slashdot. It's like Digg on slow, but sensible.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#9 [url]

November 20, 2008 19:31:51

inksmithy:
What do i do with your sample? Bearing in mind that i'm not a Java expert, all i see is 2 folders named "bin" and "lib" which contain various libs and Jar files. Where's the project or class file?

Must admit i quite fancy Perl or PHP or perhaps Ruby or Python (which now makes slightly more sense). But then again, i'm thinking that perhaps i should stick to a syntax with which i'm familiar and try to get some sense out of the French Gambas developers...

rubyguru:
Ok, that hpricot lib might be good enough to convince me to learn Ruby. I like my "lazy coder" IDE's with auto-complete and a "run" button so that i can quickly write > test > rewrite > test > write > debug > rewrite....what would you recommend for a Linux box?

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

inksmithy

Botting Guru

Posts: 184 Member Since:June 1, 2008

#10 [url]

November 20, 2008 19:54:59

birchy, have a look at the examples on the website I put in the second post relating to the libraries I posted.There are a few examples of how to use it. Like I said, I haven't really played with it much, although from what I see, it is a pretty good java parser.

As regards Ruby, my best recollection is that netbeans can do all that Ruby stuff as well, you just need the right plugins installed unless you already have the complete edition. Netbeans with Ruby

I believe Eclipse may also offer the same sort of features too.

Slashdot. It's like Digg on slow, but sensible.

Quote    Reply   
Remove this ad
avatar

denp

bot addict

Posts: 65 Member Since:June 14, 2008

#11 [url]

November 23, 2008 20:53:29

Birchy - sounds like the battle is lost, but here's a general tutorial on python that might get you started:
http://diveintopython.org/toc/index.html

And once you've come to accept that indentation is part of the syntax, and that you need an __init__.py file in any new package, it all comes a lot more easily.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#12 [url]

December 1, 2008 17:04:00

While searching for HTML parsers, i found THIS benchmark test. Judging by the final times, i'd say that libxml is most certainly worth using...and (of course), it is available as a Gambas component. The only downside is that the Gambas component is poorly documented, however i have the option to SHELL libxml directly...so now i need to find out some useage examples.

And yes, i know i'm burying my head in the sand...but i really can't justify learning a totally new language/syntax when all i am doing is writing an occasional wonky app...although i AM thankful to all of you for your help.

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#13 [url]

December 2, 2008 14:44:13

Well i've found out how to implement LIBXML2 in Gambas...but i'm getting blasted with validation errors, even from a simple page like http://www.google.co.uk. Even the W3C Validator throws 60 errors for that page. Soooo, can any of you guys confirm which languages/parsers actually work on "wild" html?

Perhaps my "old fashioned" method of manually parsing the html in VB6 wasn't so bad after all...

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#15 [url]

December 3, 2008 14:32:22

Yeah, Gambas has everything VB6 has (and more...though documentation is piss poor), but i thought it was about time i learnt how to deal with html/xml in DOM format or at least something more elegant than manually pulling out substrings based on the strings at each end of it. I don't even know regex either. I mean, is it *better* for parsing strings? How does it handle very specific things like text within variable size tables, etc?

Don't laugh, but i'm STILL pondering which language(s) to tackle. I have a few ideas for some very basic tools that i could poke up on my website...but for commercial use, i'm gonna need a solid language that is both multi platform AND difficult to decompile/crack....which is obviously NOT Java or any other managed/framework type language. I could learn C/C++. Hell, i've got no less than 5 books on the subject, yet whenever i sit down to learn it again, i always get fed up of having to write about 10x more lines of code than i need to. So yeah, i've spent several days googling around and seem to keep coming back to either Java or the 3 P's. I suppose that realistically, with just about everything electronic having a JVM nowadays, Java is the way forward. But i hate the bloat. I really, really, hate the bloat....yet i guess that at least 80% of pc's will have a JVM installed...so it's hard to completely dismiss. For the time being, i'm just tinkering with own little ideas, so i need something quick n' dirty...with (or without) a GUI...and without having to install a server. So i'm back to the 3 P's:
PHP5-CLI
Perl
Python

Next Google search is for IDE's. And no, i don't fancy Eclipse. I'll need a JVM to run that, so i may as well learn Java. At the moment, Python keeps shouting "pick me, pick me!", but (like java), it seems to have taken the whole Object Orientated thing a step too far. So that leaves me PHP or Perl.... Smile

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

myrddin

bot addict

Posts: 56 Member Since:June 2, 2008

#16 [url]

December 4, 2008 01:26:10

I guess you've googled hundreds of comparisons between the various languages over the last couple of weeks, and if you can't find any deal-braking differences between them I suggest you just try them all - write a few simple apps in each and see which one floats your boat.
I was attracted to C because of all the stuff I'd read about it over the years, but when I actually started using it I found it a bit of a grind. Perl was just one of the scripting languages I tried when I moved to linux (php and ruby were the others), and I fell in love with it early on, just through using it. I'm not saying you will feel the same way, but perhaps one of them will just make you think "yes, that's the one". Unless you want to learn a specific language for employment reasons, why not go for the one you actually look forward to coding in?

I think the idea of abstracted HTML/XML handling code is great in theory, the problem is that most real-world websites don't obey the rules. I find I have to write new bits of code for each site, and things are prone to break without warning when the webmaster tweaks the formatting every five minutes (I'm looking at you, Sporting Life results geezer!). Now we all know that sort of stuff should be done in the css rather than the html, but 'should be' and 'is' are very different things. Half my web parsing code is made up of sanity checks to make sure the data I'm getting out is what I'm expecting - when it fails I need to re-code.

There are libraries for perl (and I assume all the other languages) which are fairly low-level and just parse the html tokens from a page. This still makes for labour-intensive code, but it's a lot better than using string-handling functions on the raw text.

As for regexes - all I can say is try them. All languages have them nowadays (although obviously perls is still the best:-). I read pages about them without really 'getting' them, but once you use them a few times you realize how much more flexible, robust and powerful they are than simple string-handling functions. I used to use Access VBA a lot ten years ago, and parsing data from web pages and VAX or Stratus report files was always a headache using string functions - with regexes it would have been a tiny fraction of the work.

Oh, and all IDEs suck - emacs ftw :-)

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#18 [url]

December 4, 2008 09:04:27

inkysmithy: You ARE pulling my pisser...right? Wink

myrddin: I totally agree with *everything* you said, apart from: "all IDEs suck". Having worked a 12hr shift last night, i had time away from the pc/google to reflect upon my findings...

PHP

I like PHP. I like the syntax, and i love the fact it is well documented, has plenty of source code and has plenty of built-in functions. I very nearly dismissed everything else, BUT it was the worst (by a long way) of the three P's in every benchmark test i saw, it's not *really* designed for stand-alone apps and it isn't (yet) included as part of a standard Ubuntu install. IDEs and compilers exist but seem a bit weak, although using Geany with the CLI was a nice experience (yes FRED, you can code n' run n' code n' run n' debug n' start again...)

PERL
I quite fancied this for all the reasons i like PHP, plus the benefit that it is part of a standard Ubuntu install, doesn't require a server, and is good for stand-alone apps. The only thing that put me off was the lack of any decent GUI builders. Ok, there's wxPerl and wxDesigner (supports many other languages as well), but i couldn't find a great amount of info regarding useage. Once i delved deeper into the syntax, i must admit that some of it was just plain weird. And then i read something about some of the built-in functions only working on *nix and that's where i stopped reading...

PYTHON
Yes, i know i've been fighting it, but Google has brought me so much more info about the language in a much faster time than the pre-mentioned. It seems to be a highly active project with all it's libraries and C code. wxPython is well documented and seems easy to use. The syntax seems shorter and sweeter than PHP or Perl. The built-in functions are designed for multi-platform useage. Python seems to be the most popular scripting language for *nix systems. There are a load of IDEs/Editors...of which, BOA was horrible (hate the whole Gimp style multi window thing), Geany seems ok but no auto-complete for Python (although i need to investigate this), PIDA just opened up with a blank page and wouldn't let me type any code, and i'm currently looking at (and liking) SPE.
And the MAIN reason i've plumbed for Python? I discovered THIS section of the massive library.

Now i bet you lot are glad i've made my mind up (at last!). Laughing

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

inksmithy

Botting Guru

Posts: 184 Member Since:June 1, 2008

#19 [url]

December 4, 2008 21:19:17

inkysmithy: You ARE pulling my pisser...right? Wink

Oh I dunno birchy - if nothing else its a coding exercise!

I saw a writeup of Malbolge in a Linux Format mag - apparantly its the only language which is humanly impossible to code in. You actually have to write a program which will convert your program into Malbolge source code, then you have to compile the generated code using the interpreter in order to run your program. It is insane. Apparantly the guy who came up with Malbolge realised he made it too hard to use, so he came up with another one called "Dis" which is only extraordinarily difficult.

From another perspective, thanks for the link to those python libs - I have always liked the look of python and I reckon I'm going to get started on it soon. If for no other reason that to kickstart my coding - I seem to have fallen into a bit of a rut with it at the moment. Not looking forward to learning concurrency in python, although knowing my luck it will be so much easier than Java's version that I will think I have wasted my time struggling with ThreadPools and Executors.

Slashdot. It's like Digg on slow, but sensible.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 591 Member Since:May 11, 2008

#20 [url]

December 5, 2008 17:44:13

Yeah, i must admit that Python has surprised me a little. At first glance, i was thinking "WTF" but the Dive Into Python book didn't help on that one as they start off by totally bypassing the "hello world" stuff...which for me made it look more complicated than it actually is. And ironically, it is an excellent book. There are several chapters covering http stuff and html/xml parsing.

Still trying to decide on an editor/IDE. I can't seem to find a happy medium. All i *really* require is a fancy text editor with:
1) syntax highlighting
2) autocomplete/suggest...INCLUDING completion for imported libraries (not many offer this...other than SPE?)
3) a run/execute button (faster than having to run a terminal every time

As i suggested earlier, i quite like Geany. The only let down (for Python coding) is the auto complete. The whole idea of autocomplete is that it should support all the libraries...so that you KNOW what options are available and what parameters are required...yes, almost like having an instant manual. At the moment, only SPE seems to fit my needs but is a bit overkill. Having said that, i couldn't make head nor tail of EMACS....perhaps you could enlighten me on how to set it up? I've also looked at gedit plugins but there doesn't seem to be any autocompleters?

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
Remove this ad
Add Reply

Quick Reply

bbcode help