Ad

How To Identify A Change In A Websites’ Structure Programmatically

Within the implementation of a Python Scrapy crawler I would like to add a robust mechanism for monitoring/detecting potential layout changes within a website.

These changes do not necessarily affect existing spider selectors - for example, a site adds a new HTML element to represent the number of visitors an item has received - an element I might now be interested in parsing. Having said that, detecting selector issues (Xpath/CSS) would be also beneficial in case where they are removed/relocated.

Please note this is not about selector content change or a website refresh (if-modified-since or last-modified), but rather a modification in the structure / nodes / layout of a site.

Therefore, how would one implement logic to monitor such circumstances?

Ad

Answer

This is actually a topic for research as you can see on this paper but there are of course some implemented tools that you can check out:

Basically the base for comparing (on the previous approaches) is to use the Tree Edit Distance of the html layout.

Ad
source: stackoverflow.com
Ad