GSoC 2019 : Weekly Check-in #9
- 1 minWhat did you do this week?
-  
I added tests to make sure Protego doesn’t throw exceptions on
robots.txtof top 10,000 most popular websites. Utilised Scrapy to create a tool to downloadrobots.txtof top 10,000 most popular websites. -  
Benchmarked Protego : I ran Protego (written in Python), Robotexclusionrulesparser (written in Python), Reppy (written in C++ but has Python interface) on 570Â
robots.txtdownloaded from top 1000 websites, and here are the results.- Time spend parsing the 
robots.txt- Protego : 79.00128873897484 seconds Â
 - Robotexclusionrulesparser :Â 0.30100024401326664 seconds
 - Reppy : 0.05821833698428236 seconds
 
 - Time spend answering queries (1000 queries (all were identical) per 
robots.txt)- Protego :Â 14.736387926022871 seconds
 - Robotexclusionrulesparser :Â 67.33521732398367 seconds
 - Reppy : 1.0866852040198864 seconds
 
 
 - Time spend parsing the 
 -  
Added logging to Protego.
 
What is coming up next?
-  
Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
 -  
Make
SitemapSpideruse the new interface forrobots.txtparsers. -  
Implement Crawl-delay & Host directive support in Scrapy.
 
Did you get stuck anywhere?
Nothing major.