Ad

How To Find All Websites Under A Certain URL.

- 1 answer

I really want to know how to find all websites under a certain URL. For example, I have an URL of https://a.b/c, and I want to find all websites under it such as https://a.b/c/d and https://a.b/c/d/e . Are there some methods to do this? Thanks so much!

Ad

Answer

If the pages are interconnected with hyperlinks from the page at the root, you can easily spider the site by following internal links. This would require you to load the root page, parse its hyperlinks, load those pages and repeat until no new links are detected. You will need to implement cycle detection to avoid crawling pages you have already crawled. Spiders are not trivial to operate politely; many sites expose metadata through robots.txt files or otherwise to indicate which parts of their site they do not wish to be indexed, and they may operate slowly to avoid consuming excessive server resource. You should respect these norms.

However, do note that there is no general purpose way to enumerate all pages if they are not explicitly linked from the site. To do so would require:

  • that the site enables directory listing, so you can identify all files stored on those paths. Most sites do not provide such a service; or
  • cooperation with the operator of the site or the web server to find all pages listed under those paths; or
  • a brute-force search of all possible URLs under those paths, which is an effectively unbounded set. Implementing such a search would not be polite to the operator of the site, is prohibitive in terms of time and effort, and cannot be exhaustive.
Ad
source: stackoverflow.com
Ad