Unable To Find An Internet Page Blocked By Robots.txt
Problem: to find answers and exercises of lectures in Mathematics at Uni. Helsinki
- to make a list of sites with .com which has
- to make a list of sites at (1) which contain files with *.pdf
- to make a list of sites at (2) which contain the word "analyysi" in pdf-files
Suggestions for practical problems
- Problem 3: to make a compiler which scrapes data from pdf-files
- How can you search .com -sites which are registered?
- How would you solve the practical problems 1 & 2 by Python's defaultdict and BeautifulSoap?
Your questions are faulty.
With respect to (2), you are making the faulty assumption that you can find all PDF files on a webserver. This is not possible, for multiple reasons. The first reason is that not all documents may be referenced. The second reason is that even if they are referenced, the reference itself may be invisible to you. Finally, there are PDF resources which are generated on the fly. That means they do not exist until you ask for them. And since they depend on your input, there's an infinite amount of them.
Question 3 is faulty for pretty much the same reasons. In particular, the generated PDF may contain the word "analyysi" only if you used it in the query. E.g. http://example.com/makePDF.cgi?analyysi
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module