GSoC 2019 : Weekly Check-in #7
- 1 minHey! here is an update on what I have achieved so far.
What did you do this week?
-
Protego now passes all tests borrowed from reppy, rep-cpp and robotexclusionrulesparser.
-
Made few changes to Protego to make it compatible with google’s parser.
-
Worked on changes suggest on the interface pull request.
-
Wrote code to fetch robots.txt files from top 1000 websites, and generate statistics we need. ( link )
-
Looked at the code of Google’s robots.txt parser for the purpose of creating a python interface on top of it. I might need to modify its code as currently it parses the robots.txt file for answering every query. (Working on anything in C++ that uses pointers or STL heavily makes me feel uncomfortable).
What is coming up next?
-
Modify protego to make it behave similar to Google’s parser (will need to add few more features like record group merging), and add more tests.
-
Document Protego.
-
Benchmarking Protego’s performance.
-
I would need to read how to call C/C++ code from python, for creating an interface on top Google’s parser. I am currently thinking of using Cython.
-
Would work on blog posts (planning to write 3 blog posts within this week).
Did you get stuck anywhere?
- No, I got to work with some data science tools like jupyter notebook & pandas.