Need for Speed
- 1 minHey! This is my fifth blog post for GSoC 2019, covering week 7 and 8.
The most of week 7 was spent making Protego compatible with Google’s parser. I also worked on the documentation, since Protego codebase is small enough, proper comments and a good readme was sufficient. I uploaded Protego to PyPI - pip install Protego
that’s all it takes to install Protego.
Week 8 was quite interesting. For Protego to become default in Scrapy, it is necessary that it doesn’t throw any kind of error while parsing robots.txt
files. To make sure that, I decided to download robots.txt
from top 10,000 websites. I added tests to see if Protego throws any exceptions while parsing the downloaded robots.txt
. I benchmarked Protego, and the results were quite disappointing. You can see the result here.
We decided to spend the next week improving performance of Protego. I am going to try profiling and heuristics, and see if the performance can be improved.