Website - data source
You can use as a data source a webpage or an entire website. In this section we describe how to use a website as a data source.
Note that we support both server and client side rendered web pages!
Click on Website to add a web page or a website as a data source:
You will get to the following page:
Add manually selected URLs
Enter the URLs of the pages you want to add and click on Add Page or press Enter. Your URLs will be appearing in the list below:
- Input
- Added
Crawl an entire website
Enter the URL a website you want to add and click on Add Page and Subpages. You will extract all the sub-pages under the particular URL. You now have a list of all URLs under the website you specified which might be grouped.
- Group sub-pages
- Show sub-pages
You may see ai symbol. By default the extraction skips link that do not contain the original url since this might be pages from another website. You can click on symbol to select additional urls you want to add as a data source.
Delete sub-pages
Some sub-pages might not interest you, you can remove them by clicking on the delete button (in the Action column).
Move sub-pages
Your sub-pages are originally grouped but you can move them out of their group by clicking on the Move button (in the Action column).
- Before
- After
Advanced Settings (optional)
You can customize the website crwaling parameters if needed (default parameters are good for general use). This include for example:
Automatic Link Detection (Only for Extracting Hyperlinks)
-
Maximum Depth: This controls how deep the system should go when exploring pages to find links. Imagine a website as a tree with many branches (pages). This sets how far into those branches the tool should go.
-
Timeout: This is a time limit for how long the tool will spend searching for links. If it takes too long to find more, it will stop early. It helps to avoid wasting time.
Authentication (For Both Link and Page Indexing)
-
Cookies: Some websites require special "passes" to let the tool in. These passes are called cookies, and you can think of them like tickets that allow the tool to collect information from the site. The cookies need to be added as a dictionary (a special format for organizing data).
Example:
{
"session": "abc123",
"user_id": "789xyz"
} -
Token: Some websites need a specific token (a secret code) to allow scraping. The tool will use this token and its name to access the site, these values are set in the local storage of the website and the page is reloaded.
{
"token_name": "Bearer your_token_here",
}
Extra Parameters (For Both Link Extraction and Page Indexing)
- Search for Iframes: Websites sometimes use “iframes,” which are like windows within a page showing other content. This option decides if the tool should look inside these windows. It’s set to “auto” by default but can be forced on if needed.
- Scrolling Strategy: Some websites load more content as you scroll down, like an endless page. This setting helps control how the tool should scroll the page to make sure everything loads.
- Skip 404: When a page doesn’t exist (a “404 error”), this setting decides if the tool should ignore that broken link or stop trying to access it. This is only for when the tool is gathering links.
Driver Settings (For Both Link Extraction and Page Indexing)
- Implicit Timeout: This is the time the tool will always wait before trying to load a page. It’s like giving the tool a short break before it moves on to the next step.
- Page Load Timeout: This sets how long the tool will wait for a page to fully load, including all images and scripts (like JavaScript). If it takes too long, it will stop waiting.
- Script Timeout: This is specifically for how long the tool will wait for JavaScript code to run. If the code takes too long, the tool will stop and move on.
Add Filters (For Both Link Extraction and Page Indexing)
- Include: This is a list of tags (parts of a webpage) that the tool should focus on when gathering links or content. For example, you can tell it to look at specific sections, like
<div>
or<a>
tags. - Exclude: This is the opposite. Here, you list tags or sections that the tool should ignore when gathering information. For example, you can tell it not to look at ads or unnecessary sections, like
<script>
or<footer>
.
Create
After clicking on the Finish button, you will be redirected to the data source page and the selected pages will be crawled: