Show HN: Easy to use Wikipedia API for Python (opens in new tab)

(github.com)

218 pointsjgoldsmith12y ago35 comments

35 comments

29 comments · 10 top-level

nevermore12y ago· 4 in thread

Please note that this API does not make any specific attempts to obey the mediawiki etiquette (http://www.mediawiki.org/wiki/API:Etiquette). This sort of API is easy and clean for something like a command line script, but if you're going to do further automation or crawling I strongly recommend using the pywikipediabot library (http://www.mediawiki.org/wiki/Manual:Pywikipediabot) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.

If you just want a bash script to look things up on wikipedia, you can always use something like

function wp { curl "http://en.wikipedia.org/wiki/$(echo "$@" | tr ' ' '_')" | gunzip | html2text }

which will work for basic queries (needs url encoding and words to be properly capitalized).

A full api reference is here (http://en.wikipedia.org/w/api.php).

jgoldsmithOP12y ago

Hi (creator here),

Thanks for bringing this to my attention. I've added a disclaimer to the GitHub page regarding Pywikipediabot and plan to make changes to fully comply with MediaWiki API etiquette. The last thing I'd want to do is inadvertently cause problems for the site or foundation.

tmarthal12y ago

I would go one step further and suggest people that need structured queries use the Google BigTable API to query their structured Wikipedia data. Granted, their public dataset is from 2010, so is slightly outdated, but you can write structured SQL against all of the wikipedia article metadata and then use the mediawiki api itself to grab only the article text that you're interested in.

The wikipedia data is hosted here: https://bigquery.cloud.google.com/table/publicdata:samples.w...

Here is a sample query, searching for all articles that start with Positive:

SELECT id,title FROM [publicdata:samples.wikipedia] WHERE (REGEXP_MATCH(title,r'^Positive*')) LIMIT 10

Query complete (2.0s elapsed, 9.13 GB processed

  1|	464347|	Positive airway pressure	 
  2|	10008223|	Positive behavior support	 
  3|	464347|	Positive airway pressure	 
  4|	1354851|	Positivism in Poland	 
  5|	1023857|	Positive set theory	 
  6|	5154273|	Positivism dispute	 
  7|	2871407|	Positivism	 
  8|	17179765|	Positive psychological capital	 
  9|	9033239|	Positive Action Group	 
  10|	4163012|	Positive K

Here is the python API documentation: https://developers.google.com/api-client-library/python/

zalew12y ago

> If you just want a bash script to look things up on wikipedia

or for basic description:

    wp() { dig +short txt "$*".wp.dg.cx; }

jonbaer12y ago

+1 thank you

level0912y ago· 4 in thread

Excellent !

Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.

jpatokal12y ago

There's MediaWiki::Gateway for Ruby: https://github.com/jpatokal/mediawiki-gateway

Disclaimer: I'm the main author, and there's other implementations, but this seems to have become the most popular one.

draegtun12y ago

For Perl library see MediaWiki::API - https://metacpan.org/module/MediaWiki::API

LukeShu12y ago

GP meant APIs for Wikipedia languages than English, not other programming languages.

1 more reply

jgoldsmithOP12y ago

Thanks! I'm working on some additional features including international support, and I hope to push them within a week.

DenisM12y ago· 3 in thread

Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.

jgoldsmithOP12y ago

Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:

- https://github.com/dcramer/py-wikimarkup (coverts wikitext to HTML using Python, would need to extract text with BeautifulSoup or something

- http://wiki.eclipse.org/Mylyn/Incubator/WikiText (also to HTML, but in Java)

- https://github.com/earwig/mwparserfromhell

I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.

krichman12y ago

Easy? Haven't they got a Turing-complete language hiding in there?

DenisM12y ago

Thanks!

leoplct12y ago· 3 in thread

Great! I looking forward for a Ruby version!

benastan12y ago

Like https://github.com/kenpratt/wikipedia-client ?

LukeShu12y ago

There are a couple of implementations for Ruby--but I haven't been able to get them to work.

So, I wrote one that was dead simple, but have only implemented the API features I needed, which is mostly related to deleting things, as I use it for mass spam deletion. If you're interested, I'll upload it somewhere, but like I said, it's very incomplete.

jpatokal12y ago

Here's mine: https://github.com/jpatokal/mediawiki-gateway

cjbarber12y ago· 2 in thread

Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!

This is Hacker News at it's best. Highlighting creation.

Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!

I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).

In fact I'm going to go and write it now (and create!).

Edit 1: If anyone is on Medium, here's my draft.

https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...

Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.

https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...

DjangoReinhardt12y ago

Wait, HN doesn't mind reposts? Shut the front door! o_O

Also, I wistfully disagree with you about the "highlighting creation" bit. I mean, FSM knows I want to agree with you but my experience so far says otherwise. In all the time I've spent on HN (most of it as a lurker) I've found HN to be quite snobbish about the show-and-tell attempts.

Then again, maybe I am experiencing sour grapes since my own Show HN posts seem to disappear rapidly even before I can say, "Hey HN, loo-"...

I'm thinking of creating an HN spin-off for young, upcoming devs to do a Show-And-Tell about their recent attempts at learning/developing. Heck I've been dying to give discourse ~~(the django-based discussions platform)~~ a try, maybe I'll finally get around to it now. In fact, I'm going to go and write it now... (Sorry, couldn't resist. ;) Not meant as a dig.)

Question is, should I do a Show HN, when it is done? :P

EDIT: Turns out discourse is rails-based, not django-based. Still gonna give it a try, I guess... :(

voltagex_12y ago

Highly recommend Discourse, even if you don't know Rails.

1 more reply

echohack12y ago· 2 in thread

Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.

I'll work in some changes tonight. Let's start with PEP8, shall we? :)

sbuccini12y ago

As a budding coder, sometimes following PEP8 is harder than actually coding the stuff!

frakkingcylons12y ago

For PEP-8 compliance, I would recommend looking into using an IDE in some part of your workflow that has active inspections for things like PEP-8 compliance. I personally use PyCharm.

harlowja12y ago· 1 in thread

You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)

jgoldsmithOP12y ago

Each function has its own cache, and the size of the cache is limited by the number of unique requests you make. How would this be a problem? (totally genuine question)

Also, if you have a better way of doing it, please totally fork and request a pull!

languagehacker12y ago

Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.

To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you're looking to make some contributions to a project, this one is rife with possible pull requests.

In terms of article access and analysis, I'd recommend looking at Pattern (https://github.com/clips/pattern) before starting with this library. Not only do you get access to the rest of Pattern's IR/text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).

toyg12y ago

I've used (and patched) this alternative: https://github.com/richardasaurus/wiki-api

ksrm12y ago

Is there an API for extracting data from infoboxes?

j / k navigate · click thread line to collapse

35 comments

29 comments · 10 top-level

nevermore12y ago· 4 in thread

If you just want a bash script to look things up on wikipedia, you can always use something like

function wp { curl "http://en.wikipedia.org/wiki/$(echo "$@" | tr ' ' '_')" | gunzip | html2text }

which will work for basic queries (needs url encoding and words to be properly capitalized).

A full api reference is here (http://en.wikipedia.org/w/api.php).

jgoldsmithOP12y ago

Hi (creator here),

tmarthal12y ago

The wikipedia data is hosted here: https://bigquery.cloud.google.com/table/publicdata:samples.w...

Here is a sample query, searching for all articles that start with Positive:

SELECT id,title FROM [publicdata:samples.wikipedia] WHERE (REGEXP_MATCH(title,r'^Positive*')) LIMIT 10

Query complete (2.0s elapsed, 9.13 GB processed

  1|	464347|	Positive airway pressure	 
  2|	10008223|	Positive behavior support	 
  3|	464347|	Positive airway pressure	 
  4|	1354851|	Positivism in Poland	 
  5|	1023857|	Positive set theory	 
  6|	5154273|	Positivism dispute	 
  7|	2871407|	Positivism	 
  8|	17179765|	Positive psychological capital	 
  9|	9033239|	Positive Action Group	 
  10|	4163012|	Positive K

Here is the python API documentation: https://developers.google.com/api-client-library/python/

zalew12y ago

> If you just want a bash script to look things up on wikipedia

or for basic description:

    wp() { dig +short txt "$*".wp.dg.cx; }

jonbaer12y ago

+1 thank you

level0912y ago· 4 in thread

Excellent !

Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.

jpatokal12y ago

There's MediaWiki::Gateway for Ruby: https://github.com/jpatokal/mediawiki-gateway

Disclaimer: I'm the main author, and there's other implementations, but this seems to have become the most popular one.

draegtun12y ago

For Perl library see MediaWiki::API - https://metacpan.org/module/MediaWiki::API

LukeShu12y ago

GP meant APIs for Wikipedia languages than English, not other programming languages.

1 more reply

jgoldsmithOP12y ago

Thanks! I'm working on some additional features including international support, and I hope to push them within a week.

DenisM12y ago· 3 in thread

jgoldsmithOP12y ago

Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:

- https://github.com/dcramer/py-wikimarkup (coverts wikitext to HTML using Python, would need to extract text with BeautifulSoup or something

- http://wiki.eclipse.org/Mylyn/Incubator/WikiText (also to HTML, but in Java)

- https://github.com/earwig/mwparserfromhell

I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.

krichman12y ago

Easy? Haven't they got a Turing-complete language hiding in there?

DenisM12y ago

Thanks!

leoplct12y ago· 3 in thread

Great! I looking forward for a Ruby version!

benastan12y ago

Like https://github.com/kenpratt/wikipedia-client ?

LukeShu12y ago

There are a couple of implementations for Ruby--but I haven't been able to get them to work.

jpatokal12y ago

Here's mine: https://github.com/jpatokal/mediawiki-gateway

cjbarber12y ago· 2 in thread

Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!

This is Hacker News at it's best. Highlighting creation.

I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).

In fact I'm going to go and write it now (and create!).

Edit 1: If anyone is on Medium, here's my draft.

https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...

https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...

DjangoReinhardt12y ago

Wait, HN doesn't mind reposts? Shut the front door! o_O

Then again, maybe I am experiencing sour grapes since my own Show HN posts seem to disappear rapidly even before I can say, "Hey HN, loo-"...

Question is, should I do a Show HN, when it is done? :P

EDIT: Turns out discourse is rails-based, not django-based. Still gonna give it a try, I guess... :(

voltagex_12y ago

Highly recommend Discourse, even if you don't know Rails.

1 more reply

echohack12y ago· 2 in thread

Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.

I'll work in some changes tonight. Let's start with PEP8, shall we? :)

sbuccini12y ago

As a budding coder, sometimes following PEP8 is harder than actually coding the stuff!

frakkingcylons12y ago

For PEP-8 compliance, I would recommend looking into using an IDE in some part of your workflow that has active inspections for things like PEP-8 compliance. I personally use PyCharm.

harlowja12y ago· 1 in thread

You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)

jgoldsmithOP12y ago

Each function has its own cache, and the size of the cache is limited by the number of unique requests you make. How would this be a problem? (totally genuine question)

Also, if you have a better way of doing it, please totally fork and request a pull!

languagehacker12y ago

Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.

toyg12y ago

I've used (and patched) this alternative: https://github.com/richardasaurus/wiki-api

ksrm12y ago

Is there an API for extracting data from infoboxes?

j / k navigate · click thread line to collapse