That's why for example Nginx when it was released was so much faster compared to Apache.
There are 2 challenges with async def I don't know how to solve elegantly:
1. how to integrate coroutine-based scraping code with on-disk persistent request queues;
2. how to deallocate resources without boilerplate in coroutine-based scraping code.
(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.
Example of (2) is this code:
async def parse(self, response):
resp = await self.fetch(url)
# ... find another URL to follow
# Here we have the problem:
# response object is kept in memory
# until second response is fully received.
# This is a problem if 10s and 100s
# of requests are processed in parallel
# and responses are large.
# Because of refcounting, with callbacks
# response would have been kept in
# memory only until second request
# starts - callbacks+refcounting provide
# an elegant way for resource deallocation.
resp = await self.fetch(url2)
If anyone has suggestions please comment on https://github.com/scrapy/scrapy/issues/1144#issuecomment-14....As for the second one, wouldn't using streams to process the data solve the memory usage issue?
Stream processing could fix the second problem, but it is not a practical solution: in most cases one needs to build HTML tree in memory to do further processing (e.g. extract links). I'm also not aware of streaming regex libraries.
https://github.com/aosabook/500lines/tree/master/crawler
And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:
http://aosabook.org/en/500L/a-web-crawler-with-asyncio-corou...