Ad

Unable To Find An Internet Page Blocked By Robots.txt

- 1 answer

Problem: to find answers and exercises of lectures in Mathematics at Uni. Helsinki

Practical problems

  1. to make a list of sites with .com which has Disallow in robots.txt
  2. to make a list of sites at (1) which contain files with *.pdf
  3. to make a list of sites at (2) which contain the word "analyysi" in pdf-files

Suggestions for practical problems

  1. Problem 3: to make a compiler which scrapes data from pdf-files

Questions

  1. How can you search .com -sites which are registered?
  2. How would you solve the practical problems 1 & 2 by Python's defaultdict and BeautifulSoap?
Ad

Answer

Your questions are faulty.

With respect to (2), you are making the faulty assumption that you can find all PDF files on a webserver. This is not possible, for multiple reasons. The first reason is that not all documents may be referenced. The second reason is that even if they are referenced, the reference itself may be invisible to you. Finally, there are PDF resources which are generated on the fly. That means they do not exist until you ask for them. And since they depend on your input, there's an infinite amount of them.

Question 3 is faulty for pretty much the same reasons. In particular, the generated PDF may contain the word "analyysi" only if you used it in the query. E.g. http://example.com/makePDF.cgi?analyysi

Ad
source: stackoverflow.com
Ad