GSoC 2019 : Weekly Check-in #12

Friday. August 09, 2019 - 1 min

What did you do this week?

Benchmarking Protego (again). This time we crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. We downloaded 111, 824 links in total. Next we made each robots.txt parser - parse and answer query (we made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :
- Protego :
  - 25th percentile : 0.002419 seconds
  - 50th percentile : 0.006798 seconds
  - 75th percentile : 0.014307 seconds
  - 100th percentile : 2.546978 seconds
  - Total Time : 19.545984 seconds
- RobotFileParser (default in Scrapy) :
  - 25th percentile : 0.002188 seconds
  - 50th percentile : 0.005350 secondsstyle
  - 75th percentile : 0.010492 seconds
  - 100th percentile : 1.805923 seconds
  - Total Time : 13.799954 seconds
- Rerp Parser :
  - 25th percentile : 0.001288 seconds
  - 50th percentile : 0.005222 seconds
  - 75th percentile : 0.014640 seconds
  - 100th percentile : 52.706880 seconds
  - Total Time : 76.460496 seconds
- Reppy Parser :
  - 25th percentile : 0.000210 seconds
  - 50th percentile : 0.000492 seconds
  - 75th percentile : 0.000997 seconds
  - 100th percentile : 0.129440 seconds
  - Total Time: 1.405558 seconds
Removing an hack used in Protego due to lack of an option to ignore characters in urllib.parse.unquote. Added few new features to Protego as well.
Protego has been moved to Scrapy organisation.

What is coming up next?

Configuring Travis to push to PyPI automatically.
Introducing a new ROBOTSTXT_USER_AGENT setting in Scrapy.
Making SitemapCrawler use the new interface.

Did you get stuck anywhere?

I got blocked by StackExchange for few hours. I think they don’t like crawlers on their websites. “It is a temporary automatic ban that is put up by our HAProxy instance when you hit our rate limiter.” they answered to one of the questions on their website.

Anubhav Patel

Life doesn’t happen to you, it happens for you. 💯

GSoC 2019 : Weekly Check-in #12

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

Related Posts

Anubhav Patel