Path Explorer
Crawler comes with a set of debugging tools. The Path Explorer is one of these tools.
You can use Path Explorer to detect patterns and anomalies. At a glance, it shows you whether specific sections of your site were properly crawled, how many URLs were crawled, how many errors happened, how much bandwidth was necessary, etc.
Path Explorer as a URL directory
With Path Explorer, you can explore your crawled websites as though you were navigating folders on your computer. Each website is considered a file, and each sub-path is considered a folder. For instance, support.algolia.com and www.algolia.com could give you this structure:
Identifying issues
Path Explorer is effective at identifying specific issues. This section presents these issues alongside their usual solutions.
Identifying ignored websites and paths
Your pathsToMatch
parameter is too restrictive
- Modify your
pathsToMatch
patterns - Add a new pattern
Your website is missing links from the explored pages to the ignored website or path
- Improve your website by adding links between sections
Your startUrls
parameter is missing a first page to discover this website or path
- Add the website’s sitemap to your
startUrls
- Add this website or path’s main URL to your
startUrls
Identifying crawled websites and paths which should be excluded
Your pathsToMatch
parameter might be too permissive
- Make your
pathsToMatch
patterns more specific
You’re missing a pattern in exclusionPatterns
- Add the website or path to
exclusionPatterns
Identifying websites and paths with numerous errors
Errors are of three types:
- Website errors: Some HTTP codes, wrong Content-Type header, network error, timeout, and so on.
- Contact the team responsible for the website and ask them to investigate the recurring errors
-
Configuration errors: Runtime errors, invalid JSON, extraction timeout, etc.
- Fix your crawler’s configuration to prevent these errors
- Internal errors: These indicate the failure of an internal Algolia service
- If the issue persists, contact the Algolia support team
Identify websites and paths consuming lots of bandwidth
If you’re crawling frequently, bandwidth costs might go up quickly.
- Make sure you’re only crawling what’s necessary. Note, ignored pages are also crawled. If you have a lot of ignored pages, consider setting stricter
pathsToMatch
or addingexclusionPatterns
. - Decrease your crawl frequency in the
schedule
parameter to proportionally reduce bandwidth costs
If there are specific parts of your website you’d like to crawl more frequently than others, contact the Algolia support team.