Web Crawling

web crawler is a bot (AKA crawling agent, spider bot, web crawling software, website spider, or a search engine bot) that goes through websites and collects information. In other words, the spider bot crawls through websites and search engines searching for information.

Google's search engine bot would work something like this: Google spider's main purpose is to update the content of search engines and index web pages of other websites. When the said spider crawls a certain page, it will gather the page's data for later processing and evaluation.

Once the page is evaluated, a search engine can index the page appropriately. This is why when you type in a certain keyword into a search bar, you will see the most relevant (according to the search engine) web page.

Web crawlers are provided with a list of [[Uniform Resource Locator|URLs]] to crawl. What the crawler does is it goes through the provided URLs, and then finds more URLs to crawl within the pages. This could become an endless process of course, and that is why all crawlers need a set of rules (what pages should they crawl, when should they crawl, etc.) Web crawlers can:

  • Discover readable and reachable URLs
  • Explore a list of seeds or URLs to identify new links and add them to the list
  • Index all identified links
  • Keep all indexed links up to date

What's more, a web crawler can be used by companies that need to gather data for business purposes. In this case, a web crawler is usually accompanied by a web scraper that downloads, or scrapes, required information.

For business cases, web crawlers and scrapers have to use [[Proxy|proxies]].

Web crawling in the context of penetration testing (pentesting) refers to the automated process of systematically browsing and mapping out a web application to identify its structure, functionality, and potential vulnerabilities. This is a crucial initial step in pentesting, as it helps the tester understand the scope of the application and plan further targeted attacks.