Show HN: Python 3.5 Async Web Crawler Example (opens in new tab)

(github.com)

28 pointsmehmetkose10y ago13 comments

13 comments

12 comments · 6 top-level

dham10y ago· 3 in thread

What is the advantage of this over say using threads? Web scraping is pretty much all IO so you get big wins using threads in Python and Ruby.

takeda10y ago

Actually because it is all I/O is what makes it more suitable than threads. You don't have all the overhead that comes with threads, but you still get all the concurrency.

That's why for example Nginx when it was released was so much faster compared to Apache.

poooogles10y ago

Threads in Python creates a full posix thread which is very heavyweight compared to using AsyncIO. You should get the same throughput for a far lower resource usage.

platz10y ago

does asyncio in python = green threads in python, therefore defeats the GIL?

1 more reply

kmike8410y ago· 2 in thread

I was investigating how to add asyncio / async def support to Scrapy (see https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...). Small examples like the one at the link look neat, but it is not all roses as you go further. The problems are not specific to Scrapy; I think any advanced `async def` based crawler will face them.

There are 2 challenges with async def I don't know how to solve elegantly:

1. how to integrate coroutine-based scraping code with on-disk persistent request queues;

2. how to deallocate resources without boilerplate in coroutine-based scraping code.

(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.

Example of (2) is this code:

    async def parse(self, response):
        resp = await self.fetch(url)
        # ... find another URL to follow
    
        # Here we have the problem: 
        # response object is kept in memory
        # until second response is fully received.
        # This is a problem if 10s and 100s 
        # of requests are processed in parallel
        # and responses are large.
        # Because of refcounting, with callbacks 
        # response would have been kept in 
        # memory only until second request
        # starts - callbacks+refcounting provide
        # an elegant way for resource deallocation.
        resp = await self.fetch(url2)

If anyone has suggestions please comment on https://github.com/scrapy/scrapy/issues/1144#issuecomment-14....

takeda10y ago

I don't know scrapy enough to understand your first problem.

As for the second one, wouldn't using streams to process the data solve the memory usage issue?

kmike8410y ago

The first problem is not specific to Scrapy; it is the same if you're using inlineCallbacks from twisted, tornado.gen or async def: there is no way to serialize tasks inside coroutine and save them to disk, e.g. to be able to stop the process and then restart it from the same point, or to avoid keeping the whole request queue in memory.

Stream processing could fix the second problem, but it is not a practical solution: in most cases one needs to build HTML tree in memory to do further processing (e.g. extract links). I'm also not aware of streaming regex libraries.

zedpm10y ago· 1 in thread

This example isn't really making use of asyncio. asyncio.run_until_complete() is a blocking method (note that you don't use await when calling it, as it's not a coroutine.) You'd want to use something like asyncio.wait() with multiple futures to achieve some concurrency.

mehmetkoseOP10y ago

Thanks for the notice. I just trying python 3.5 goodies. I updated the code now and I added a queue.

pixelmonkey10y ago

Guido van Rossum, the creator of Python, wrote a web crawler as a motivating example for asyncio. You can find the code for it here:

https://github.com/aosabook/500lines/tree/master/crawler

And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:

http://aosabook.org/en/500L/a-web-crawler-with-asyncio-corou...

takeda10y ago

While you're using AsyncIO, your requests are still done serially due to using loop.run_until_complete().

aaront10y ago

Here's a proper example written for 3.4+: https://gist.github.com/madjar/9312452

j / k navigate · click thread line to collapse

13 comments

12 comments · 6 top-level

dham10y ago· 3 in thread

What is the advantage of this over say using threads? Web scraping is pretty much all IO so you get big wins using threads in Python and Ruby.

takeda10y ago

Actually because it is all I/O is what makes it more suitable than threads. You don't have all the overhead that comes with threads, but you still get all the concurrency.

That's why for example Nginx when it was released was so much faster compared to Apache.

poooogles10y ago

Threads in Python creates a full posix thread which is very heavyweight compared to using AsyncIO. You should get the same throughput for a far lower resource usage.

platz10y ago

does asyncio in python = green threads in python, therefore defeats the GIL?

1 more reply

kmike8410y ago· 2 in thread

There are 2 challenges with async def I don't know how to solve elegantly:

1. how to integrate coroutine-based scraping code with on-disk persistent request queues;

2. how to deallocate resources without boilerplate in coroutine-based scraping code.

(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.

Example of (2) is this code:

    async def parse(self, response):
        resp = await self.fetch(url)
        # ... find another URL to follow
    
        # Here we have the problem: 
        # response object is kept in memory
        # until second response is fully received.
        # This is a problem if 10s and 100s 
        # of requests are processed in parallel
        # and responses are large.
        # Because of refcounting, with callbacks 
        # response would have been kept in 
        # memory only until second request
        # starts - callbacks+refcounting provide
        # an elegant way for resource deallocation.
        resp = await self.fetch(url2)

If anyone has suggestions please comment on https://github.com/scrapy/scrapy/issues/1144#issuecomment-14....

takeda10y ago

I don't know scrapy enough to understand your first problem.

As for the second one, wouldn't using streams to process the data solve the memory usage issue?

kmike8410y ago

zedpm10y ago· 1 in thread

mehmetkoseOP10y ago

Thanks for the notice. I just trying python 3.5 goodies. I updated the code now and I added a queue.

pixelmonkey10y ago

Guido van Rossum, the creator of Python, wrote a web crawler as a motivating example for asyncio. You can find the code for it here:

https://github.com/aosabook/500lines/tree/master/crawler

And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:

http://aosabook.org/en/500L/a-web-crawler-with-asyncio-corou...

takeda10y ago

While you're using AsyncIO, your requests are still done serially due to using loop.run_until_complete().

aaront10y ago

Here's a proper example written for 3.4+: https://gist.github.com/madjar/9312452

j / k navigate · click thread line to collapse