Remove this ad
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#41 [url]

December 26, 2008 20:09:11

By far the best editor for Python is Editra. I've tried all the fancy IDE's, but most of them are weak when it comes to code completion and call tips. Editra actually executes the code in order to get the correct completions. It may sound a bit CPU intensive but it's actually no worse than running code in the interactive shell. Should you decide to install it, don't bother with the Ubuntu repos as they're out of date. Download the latest version from the website and follow the simple install guide HERE. There's also a useful forum but i think the most pleasing aspect is the plugin engine. You can write your own plugins in Python...should you actually need any specific functionality that isn't already included.

As for scraping, i've gone round and round in circles. urllib works ok for quick n' dirty scraping. urllib2 gives you more control so you can add headers to enable gzip, user agent spoofing, etc. You start hitting problems when you want to use https. The most useful looking i've found so far is THIS scraping tutorial which (i think) i posted before. Downside is that It uses two external libraries, but the http side appears to auto-handle cookies and gzip decompression and it uses BeautifulSoup on the parsing side of things which is (apparently) excellent at handling 'wild' html. Both libraries can be installed via Synaptic.

Not really sure how Python 3 will affect all our code, i'm kinda hoping they add some decent web libs such as a scraping and soap library. Just out of interest, how are you doing SOAP stuff with betfair/betdaq? Dive Into python suggests SOAPpy but i read somewhere that it's now a dead project. Google suggested ZSI but i don't really need soap just yet...although i'll be using betdaq when i do.

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
Remove this ad
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#42 [url]

December 27, 2008 15:19:34

HTML parsing is becoming my Achilles heel. As stated previously, in VB6 i simply used string functions to grab the text i needed by specifying the text either side of it. Although it can get a bit ugly and has to be carefully written to avoid unexpected results, it DOES work. Having looked at a number of HTML parsing libs, i came across THIS site which does some performance testing on various libraries. The basic outcome is that lxml absolutely annihilates all the others (including BeautifulSoup) and is also capable of handling badly formed 'tag soup'...so that would be my library of choice.

Problems occur when you're dealing with heavy JavaScript pages. The 'old fashioned' manual methods can handle any format, any text, ANY string which makes it much more flexible. In particular, i'm attempting to parse the Sporting Life 'Meetings' page which builds its links from JavaScript. Does anyone know if this can be parsed via a standard library?

The more i think about it, the more i see the need for a very specific library. Perhaps i should dig out my C book and pack all of my http and parsing needs into a single library?

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

myrddin

bot addict

Posts: 56 Member Since:June 2, 2008

#43 [url]

December 27, 2008 18:12:23

Not sure what you mean, birchy, but the way I deal with the page is:

* Loop over all the div tags, if the class is "race_idx_hdr", get the meeting name

* Until the next "race_idx_hdr", loop though the div tags with class "racecard_link"

* Get the race details and pull the race number ($SportingLifeId) out of the java href with a regexp

* Get the page "http://horses.sportinglife.com/Racecards/0,12495,".$SportingLifeId.",00.html"

* Parse the desired info out of the racecard page.

As far as I can tell, the racecard url always looks like the above, with the only variable being the race number which I call $SportingLifeId, so no need to mess with the actual JavaScript at all.
BTW if you need a specific date for the meetings index, rather than just today, the url is of the form : "http://horses.sportinglife.com/Meetings/0,12496,30-12-2008,00.html", and they keep about a year's worth.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#44 [url]

December 27, 2008 19:03:22

myrddin:
Yes, that's exactly what i did in VB6 and Gambas, but i was trying to get into less 'manual' ways of parsing html. After hours of Googling, it seems that the only way to handle JavaScript is to use something like SpiderMonkey which executes the JS. Earlier in this thread, it seemed to be only yourself and me that had been using manual parsing...and now that i've investigated deeper, our 'old skool' methods are actually pretty solid.

I know that you're a big fan of Perl, so i'm wondering how you deal with things like https, cookies, gzip, etc? Are there any modules that can auto-handle all this stuff? I've seen Mechanize mentioned in several places...can that be manipulated to auto-handle cookies and gzip?

EDIT:
Incidentally, i'm thinking about writing a more generic string parsing library for Python. The basic idea is to use string functions as we currently do, but add some tag recognition...in particular, it would be handy if we had a function that could return a chunk of html based on the opening tag. The matching closing tag would be automatically calculated within the function. This would be pretty useful for chopping up tables and such-like. Performance wise, i've discovered that Python (and Perl) can use inline C code which should greatly improve parsing speeds.

Perhaps i'm slightly disillusioned, but Perl seems more complete than Python? The more i read about Python, the more peculiarities i discover...

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

myrddin

bot addict

Posts: 56 Member Since:June 2, 2008

#45 [url]

December 27, 2008 21:29:03

Yeah, the LWP and HTTP modules I listed in the second post of this thread handle all the gzip/https/cookie-type stuff, and the HTML::TokeParser module is pretty much the low-level tag-handling parser you are planning on writing in Python - like I say, still pretty handraulic, but a lot quicker than raw string functions.

I've never used WWW::Mechanize, but a quick look suggests it is based on LWP::UserAgent, so it would include all those gzip/https/cookie handlers. It (Mechanize) looks pretty useful for navigating sites - thanks for bringing it to my attention.

I can't believe there aren't equivalent libraries floating around for Python, although they do say that the greatest strength of Perl is the number of good-quality modules available through CPAN.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#46 [url]

December 27, 2008 22:01:51

Python has Mechanize. I'm looking at it now.

EDIT:
Looks quite sweet in Python:
_______________________________________________________________________________________________________

from mechanize import Browser

browser = Browser()
html = browser.open(’http://www.somewebsite.com/’).read()
_______________________________________________________________________________________________________

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

denp

bot addict

Posts: 65 Member Since:June 14, 2008

#47 [url]

December 30, 2008 15:07:38

birchy - I installed the editra editor today - thanks for the pointers. The integrated python shell is neat, but does it do anything over what you'd get by opening a separate shell?

The autocomplete is good - but perhaps a bit scary? It executes the code to find out what the completions should be - couldn't that cause some problems :-o

I'm back in pydev + eclipse for now. A full interactive debugger and limited autocomplete is a better combination for me.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#48 [url]

December 30, 2008 16:45:56

I'm assuming you have Editra 0.4.28? I've played around with various settings to customize it to how i want it. Most of the editors i tried (IDLE, for instance) were interactive shells so they're a pain in the arse when you're trying to write a timed loop or a function or class. Perhaps i've misunderstood the whole python and shell malarkey...shells ARE only used to quickly test code snippets...aren't they? I just wanted a nice code n' run style editor with code completion and Editra seems to fit that bill. Code completion is far from perfect, but i've recently had a thread running on the Ubuntu forums and it turns out that none of the other editors/ide's have better code completions than Editra...and that includes all the commercial ones. I did try PyDev but i it doesn't offer a great deal more than Editra considering it uses about 50x more hard drive space.

Onwards and upwards....i've only got 3 days of my holiday left! Cry

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
Remove this ad
avatar

denp

bot addict

Posts: 65 Member Since:June 14, 2008

#49 [url]

December 30, 2008 17:11:52

right, 0.4.28. yes, I agree it's pretty handy for the code n'run situation. I guess I'm suffering from eclipse institutionalisation.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#50 [url]

December 30, 2008 17:23:32

There's nothing wrong with Eclipse (or Netbeans), they're nice tools but i'm an old stick in the mud and can't for the life of me see how Java apps are always so big. I know that modern hard drives are massive, cpu's are faster and ram is bigger and faster, but (to me, at least), programming seems to taking a step backwards regarding file sizes and execution speeds. There's obviously a direct relationship between execution speed and development time, and that is the argument for modern languages such as Python. Yes, it might be 100x slower than an equivalent C program but the trade-off is that you can develop the program faster and with less lines of code. So it's swings and roundabouts really. The way I'M going to be using Python is as a front-end for gluing together various bits of C code and libraries...and i think that's what it's designed for. :)


EDIT:
Just thought...i didn't try Netbeans for Python coding...has anyone tried it?

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

kvajnto

rookie botter

Posts: 1 Member Since:September 30, 2008

#51 [url]

December 31, 2008 08:07:20

Activestate Komodo is a nice IDE for python programming, not free though. There are versions for all major OSs.

Checking out psyco is a must if you are serious about your python coding. Optimizes your code before it's run. Even though I personally find that list comprehension makes the code a lot easier to read, in all the tests I've made (granted, not THAT many) pysco optimized loops are just as fast or faster.

Quote    Reply   
avatar

birchy

Betfair Elite

Posts: 593 Member Since:May 11, 2008

#52 [url]

December 31, 2008 14:20:22

Yeah, i tried the free version of Komodo but the code completion wasn't up to Editra's standard. I tried Netbeans after posting the above. I installed the bare minimum (JVM and Netbeans 6.5 - python only) and the whole lot was over 200mb. They seem to have code completion seriously wrong. Let's say you have a self-defined object and your editor can't find any completions for it...most of them will offer nothing which is fair enough. Netbeans, on the other hand, offers you every single completion in the whole python library. So instead of saying 'sorry mate, can't find anything', they suggest about 5000 totally irrelevant completions.

I started a thread on the Ubuntu forums regarding python code completions and it seems that even the commercial editors/ide's haven't progressed any further than some of the free ones. If you have Komodo IDE, i'd appreciate it if you could try the little exercise in post #7 of that thread: Code Completions.

Regarding Psyco, i tried it several days ago and it does indeed improve execution speed by about 25%. But to be fair, the code i was testing took 50ms in C, 65ms in Java, 80ms in VB6 (Windows only), 300ms in Gambas, 3900ms with Python and 2700ms with Psyco optimization. So i've decided that (for me), Python is going to be used for gluing together various libraries that are designed for the job in hand.

www.bespokebots.com

"This time next year Rodney, we'll be millionaires!"

Quote    Reply   
avatar

JayBee

rookie botter

Posts: 31 Member Since:October 21, 2013

#53 [url]

October 22, 2013 12:45:51

Why doesn't every scraper put up an example (with code and screen shots) of their application grabbing data from Betfair then the rest of us can judge for ourselves as to which API or language is best.

Personally, I prefer the Betfair API with some scraping of the TimeForm and RaceForm websites for data I cannot get from Betfair.

If someone can put up a compelling argument (with code and screen shots) as to there being a better way then I might stop using the Betfair API.

So far this thread reminds me of an old copy of Computer and Video Games and the letters (remember them?) page war between ZX Spectrum, BBC Micro and Commodore 64 owners as to who had the best computer.

Thanks in advance.

Quote    Reply   
avatar

guiness

rookie botter

Posts: 39 Member Since:August 28, 2012

#54 [url]

October 28, 2013 14:27:55

So far this thread reminds me of an old copy of Computer and Video Games and the letters (remember them?) page war between ZX Spectrum, BBC Micro and Commodore 64 owners as to who had the best computer.'

haha I had both BBC and C64, ​I think we all know the C64 was the best.. 8 sprites, 16 colours, all of which can be on the screen at once..  VIC graphics chip which allowed smooth scrolling
I learned how to code on C64 using assembler..



 

Quote    Reply   
avatar

cran

bot addict

Posts: 72 Member Since:June 11, 2013

#55 [url]

October 28, 2013 19:06:35

Python, Beautifulsoup and/or Regular Expressions (re), Wing IDE

Last Edited By: cran October 28, 2013 19:11:21. Edited 2 times.

Quote    Reply   
avatar

denp

bot addict

Posts: 65 Member Since:June 14, 2008

#56 [url]

October 29, 2013 20:29:55

python, pandas and ipython notebook.
Oh, and a Model B BBC micro.  Not together.
image

Quote    Reply   
Remove this ad
Add Reply

Quick Reply

bbcode help