Workers¶
DefaultWorker
¶
DefaultWorker is responsible for orchestrating the extraction of emails and LinkedIn URLs from a given website.
This class utilizes both synchronous and asynchronous workers to perform the extraction process. It manages the configuration of the extraction process, including the website URL, browser, link filter, data extractors, depth, and maximum links to extract from a page.
Attributes:
Name | Type | Description |
---|---|---|
website_url |
str
|
The URL of the website to extract data from. |
browser |
PageSourceGetter
|
The browser instance used to fetch page sources. |
link_filter |
LinkFilterBase
|
The filter used to determine which links to follow. |
data_extractors |
list[DataExtractor]
|
The list of data extractors to use for extracting information. |
depth |
int
|
The maximum depth to traverse the website. |
max_links_from_page |
int
|
The maximum number of links to extract from a single page. |
links |
list[list[str]]
|
A list of lists containing URLs to be processed at each depth level. |
current_depth |
int
|
The current depth level of the extraction process. |
Source code in extract_emails/workers/default_worker.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
aget_data()
async
¶
Retrieve extracted data asynchronously.
Returns:
Type | Description |
---|---|
list[PageData]
|
list[PageData]: A list of PageData objects containing the extracted information. |
Source code in extract_emails/workers/default_worker.py
83 84 85 86 87 88 89 |
|
get_data()
¶
Retrieve extracted data synchronously.
Returns:
Type | Description |
---|---|
list[PageData]
|
list[PageData]: A list of PageData objects containing the extracted information. |
Source code in extract_emails/workers/default_worker.py
75 76 77 78 79 80 81 |
|