Skip to content

Workers

DefaultWorker

DefaultWorker is responsible for orchestrating the extraction of emails and LinkedIn URLs from a given website.

This class utilizes both synchronous and asynchronous workers to perform the extraction process. It manages the configuration of the extraction process, including the website URL, browser, link filter, data extractors, depth, and maximum links to extract from a page.

Attributes:

Name Type Description
website_url str

The URL of the website to extract data from.

browser PageSourceGetter

The browser instance used to fetch page sources.

link_filter LinkFilterBase

The filter used to determine which links to follow.

data_extractors list[DataExtractor]

The list of data extractors to use for extracting information.

depth int

The maximum depth to traverse the website.

max_links_from_page int

The maximum number of links to extract from a single page.

links list[list[str]]

A list of lists containing URLs to be processed at each depth level.

current_depth int

The current depth level of the extraction process.

Source code in extract_emails/workers/default_worker.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class DefaultWorker:
    """DefaultWorker is responsible for orchestrating the extraction of emails and LinkedIn URLs from a given website.

    This class utilizes both synchronous and asynchronous workers to perform the extraction process. It manages the
    configuration of the extraction process, including the website URL, browser, link filter, data extractors, depth,
    and maximum links to extract from a page.

    Attributes:
        website_url (str): The URL of the website to extract data from.
        browser (PageSourceGetter): The browser instance used to fetch page sources.
        link_filter (LinkFilterBase): The filter used to determine which links to follow.
        data_extractors (list[DataExtractor]): The list of data extractors to use for extracting information.
        depth (int): The maximum depth to traverse the website.
        max_links_from_page (int): The maximum number of links to extract from a single page.
        links (list[list[str]]): A list of lists containing URLs to be processed at each depth level.
        current_depth (int): The current depth level of the extraction process.
    """

    def __init__(
        self,
        website_url: str,
        browser: PageSourceGetter,
        *,
        link_filter: LinkFilterBase | None = None,
        data_extractors: list[DataExtractor] | None = None,
        depth: int = 20,
        max_links_from_page: int = 20,
    ):
        self.website_url = website_url.rstrip("/")
        self.browser = browser
        self.link_filter = link_filter or ContactInfoLinkFilter(self.website_url)
        self.data_extractors = data_extractors or [
            EmailExtractor(),
            LinkedinExtractor(),
        ]
        self.depth = depth
        self.max_links_from_page = max_links_from_page

        self.links = [[self.website_url]]
        self.current_depth = 0

        self._sync_worker = _SyncDefaultWorker(
            self.website_url,
            self.browser,
            link_filter=self.link_filter,
            data_extractors=self.data_extractors,
            depth=self.depth,
            max_links_from_page=self.max_links_from_page,
        )
        self._async_worker = _AsyncDefaultWorker(
            self.website_url,
            self.browser,
            link_filter=self.link_filter,
            data_extractors=self.data_extractors,
            depth=self.depth,
            max_links_from_page=self.max_links_from_page,
        )

    def get_data(self) -> list[PageData]:
        """Retrieve extracted data synchronously.

        Returns:
            list[PageData]: A list of PageData objects containing the extracted information.
        """
        return self._sync_worker.get_data()

    async def aget_data(self) -> list[PageData]:
        """Retrieve extracted data asynchronously.

        Returns:
            list[PageData]: A list of PageData objects containing the extracted information.
        """
        return await self._async_worker.get_data()

aget_data() async

Retrieve extracted data asynchronously.

Returns:

Type Description
list[PageData]

list[PageData]: A list of PageData objects containing the extracted information.

Source code in extract_emails/workers/default_worker.py
83
84
85
86
87
88
89
async def aget_data(self) -> list[PageData]:
    """Retrieve extracted data asynchronously.

    Returns:
        list[PageData]: A list of PageData objects containing the extracted information.
    """
    return await self._async_worker.get_data()

get_data()

Retrieve extracted data synchronously.

Returns:

Type Description
list[PageData]

list[PageData]: A list of PageData objects containing the extracted information.

Source code in extract_emails/workers/default_worker.py
75
76
77
78
79
80
81
def get_data(self) -> list[PageData]:
    """Retrieve extracted data synchronously.

    Returns:
        list[PageData]: A list of PageData objects containing the extracted information.
    """
    return self._sync_worker.get_data()