Plugin architecture

General architecture

Papis uses the stevedore library for general plugin management. However, other modules are not expected to interact with it and instead use the helper wrappers given by papis.plugin.

The different plugins in Papis (e.g. papis.command, papis.exporter etc.) define a so-called ExtensionManager, which loads various objects that have been declared as entrypoints (plugins) in the package metadata.

For example, the yaml exporter in papis.yaml is defined as

def exporter(documents: List[papis.document.Document]) -> str:
    string = yaml.dump_all(
        [papis.document.to_dict(document) for document in documents],
        allow_unicode=True)
    return str(string)

and declared in pyproject.toml as

[project.entry-points."papis.exporter"]
yaml = "papis.yaml:exporter"

where yaml is the name of the entrypoint, papis.yaml is the module in which it is located and exporter is the callable used to invoke the plugin, i.e. the format is <name> = "<module>:<callable>". The exporter can be retrieved by name using

import papis.plugin

extension_manager = papis.plugin.get_extension_manager("papis.exporter")
yaml_exporter = extension_manager["yaml"].plugin

yaml_string = yaml_exporter(mydocs)

Due to the entrypoint mechanism used by stevedore, any third-party package can add plugins to Papis in this fashion. More information about each type of plugin available in Papis is given below.

Exporter

TO DOCUMENT

Command

TO DOCUMENT

Importer

Papis allows implementing additional plugins for importing external metadata into its database through so-called “importers” and “downloaders”. The difference between a downloader and an importer is largely semantic. Downloaders are mostly meant to scrape websites or download files from a remote location.

As an example we show here how to implement a custom downloader for the ACL Anthology. An Importer is generally simpler, as it does not require scraping remote websites. We recommend taking a look at one of the existing importers (e.g. in papis/crossref.py) or downloaders (e.g. in papis/downloaders/sciencedirect.py) to get an idea about existing features and implementations.

For a downloader, we create a new file in papis/downloaders and start writing a class that inherits from papis.downloaders.Downloader. This can look something like

from typing import Any, Dict, Optional

import papis.document
import papis.downloaders.base


class Downloader(papis.downloaders.Downloader):
    def __init__(self, url: str) -> None:
        super().__init__(
            url,
            # A name for the downloader that is shown to the user at times
            name="acl",
            # The extensions that are expected from the downloaded files
            expected_document_extension="pdf",
            # Priority is sorted ascendingly (0 is the largest) and is used to
            # present the downloaders to the user and in automatic merging
            priority=10,
        )

The main way to recognize if a downloader can be used with a given URI is through the match() method. This generally checks if a given URI matches a website URL, e.g.

@classmethod
def match(cls, url: str) -> Optional[papis.downloaders.Downloader]:
    return Downloader(url) if re.match(r".*aclanthology\.org.*", url) else None

By default, a downloader implements a get_data() method to retrieve metadata. This already does a good job in fetching basic metadata (title, authors, etc) through standard elements such as the Dublin Core Metadata. We can however extend it for any specific downloader. For instance, some documents in the ACL Anthology provide a “code” field, with a link to e.g. a Github repository. We will try to extract code repository URL using bs4. An instance of bs4 with the parsed HTML can be obtained and manipulated as follows

def get_data(self) -> Dict[str, Any]:
    soup = self._get_soup()
    data = papis.downloaders.base.parse_meta_headers(soup)

    paper_details = soup.find("div", "row acl-paper-details").find("dl")
    for dt in elem.find_all("dt"):
        if "Code" in dt.text:
            data["code"] = dt.find_next_sibling().find("a").attrs["href"]
            break

    return data

Metadata can also be obtained from BibTeX by overriding the get_bibtex_url() method. This can be useful if, for instance, the get_data method fails to correctly identify the abstract section. In our example we can fix this by scraping the metadata found in the BibTeX file. Luckily, for ACL, the BibTeX URL is simply the document URL with a .bib extension. We can implement it as

def get_bibtex_url(self) -> Optional[str]:
    url = self.ctx.data.get("url")
    return f"{url}.bib" if url is not None else url

To download files from a remote resource, the downloader relies on data[“pdf_url”] by default. However, if this does not exist or does not return the actual document PDF, we can override the get_document_url() method.

def get_document_url(self) -> Optional[str]:
    if "pdf_url" in self.ctx.data:
        return str(self.ctx.data["pdf_url"])

    return None

Finally, to install the plugin and have it recognized by the extension system that Papis uses, it needs to be added to pyproject.toml. This can be done with extending the papis.downloader entrypoint as follows

[project.entry-points."papis.downloader"]
acl = "papis.downloaders.acl:Downloader"

Explore

TO DOCUMENT