Workers¶

`DefaultWorker` ¶

DefaultWorker is responsible for orchestrating the extraction of emails and LinkedIn URLs from a given website.

This class utilizes both synchronous and asynchronous workers to perform the extraction process. It manages the configuration of the extraction process, including the website URL, browser, link filter, data extractors, depth, and maximum links to extract from a page.

Attributes:

Name	Type	Description
`website_url`	`str`	The URL of the website to extract data from.
`browser`	`PageSourceGetter`	The browser instance used to fetch page sources.
`link_filter`	`LinkFilterBase`	The filter used to determine which links to follow.
`data_extractors`	`list[DataExtractor]`	The list of data extractors to use for extracting information.
`depth`	`int`	The maximum depth to traverse the website.
`max_links_from_page`	`int`	The maximum number of links to extract from a single page.
`links`	`list[list[str]]`	A list of lists containing URLs to be processed at each depth level.
`current_depth`	`int`	The current depth level of the extraction process.

Source code in extract_emails/workers/default_worker.py

class DefaultWorker:
    """DefaultWorker is responsible for orchestrating the extraction of emails and LinkedIn URLs from a given website.

    This class utilizes both synchronous and asynchronous workers to perform the extraction process. It manages the
    configuration of the extraction process, including the website URL, browser, link filter, data extractors, depth,
    and maximum links to extract from a page.

    Attributes:
        website_url (str): The URL of the website to extract data from.
        browser (PageSourceGetter): The browser instance used to fetch page sources.
        link_filter (LinkFilterBase): The filter used to determine which links to follow.
        data_extractors (list[DataExtractor]): The list of data extractors to use for extracting information.
        depth (int): The maximum depth to traverse the website.
        max_links_from_page (int): The maximum number of links to extract from a single page.
        links (list[list[str]]): A list of lists containing URLs to be processed at each depth level.
        current_depth (int): The current depth level of the extraction process.
    """

    def __init__(
        self,
        website_url: str,
        browser: PageSourceGetter,
        *,
        link_filter: LinkFilterBase | None = None,
        data_extractors: list[DataExtractor] | None = None,
        depth: int = 20,
        max_links_from_page: int = 20,
    ):
        self.website_url = website_url.rstrip("/")
        self.browser = browser
        self.link_filter = link_filter or ContactInfoLinkFilter(self.website_url)
        self.data_extractors = data_extractors or [
            EmailExtractor(),
            LinkedinExtractor(),
        ]
        self.depth = depth
        self.max_links_from_page = max_links_from_page

        self.links = [[self.website_url]]
        self.current_depth = 0

        self._sync_worker = _SyncDefaultWorker(
            self.website_url,
            self.browser,
            link_filter=self.link_filter,
            data_extractors=self.data_extractors,
            depth=self.depth,
            max_links_from_page=self.max_links_from_page,
        )
        self._async_worker = _AsyncDefaultWorker(
            self.website_url,
            self.browser,
            link_filter=self.link_filter,
            data_extractors=self.data_extractors,
            depth=self.depth,
            max_links_from_page=self.max_links_from_page,
        )

    def get_data(self) -> list[PageData]:
        """Retrieve extracted data synchronously.

        Returns:
            list[PageData]: A list of PageData objects containing the extracted information.
        """
        return self._sync_worker.get_data()

    async def aget_data(self) -> list[PageData]:
        """Retrieve extracted data asynchronously.

        Returns:
            list[PageData]: A list of PageData objects containing the extracted information.
        """
        return await self._async_worker.get_data()

`aget_data()` `async` ¶

Retrieve extracted data asynchronously.

Returns:

Type	Description
`list[PageData]`	list[PageData]: A list of PageData objects containing the extracted information.

Source code in extract_emails/workers/default_worker.py

async def aget_data(self) -> list[PageData]:
    """Retrieve extracted data asynchronously.

    Returns:
        list[PageData]: A list of PageData objects containing the extracted information.
    """
    return await self._async_worker.get_data()

`get_data()` ¶

Retrieve extracted data synchronously.

Returns:

Type	Description
`list[PageData]`	list[PageData]: A list of PageData objects containing the extracted information.

Source code in extract_emails/workers/default_worker.py

def get_data(self) -> list[PageData]:
    """Retrieve extracted data synchronously.

    Returns:
        list[PageData]: A list of PageData objects containing the extracted information.
    """
    return self._sync_worker.get_data()

Workers¶

DefaultWorker ¶

aget_data() async ¶

get_data() ¶

`DefaultWorker` ¶

`aget_data()` `async` ¶

`get_data()` ¶