.. _plugin-architecture:
Plugin architecture
===================
General architecture
--------------------
Papis uses `entry points
`__
and `importlib.metadata `__
for general plugin management. However, other modules are not expected to
interact with it and instead use the helper wrappers given by ``papis.plugin``.
The different plugins in Papis (e.g. ``papis.command``, ``papis.exporter`` etc.)
define a namespace for themselves and load various objects that have been declared as
`entry points `__
(plugins) in the package
`metadata
`__.
For example, the ``yaml`` exporter in ``papis.yaml`` is defined (loosely) as:
.. code:: python
def exporter(documents: List[papis.document.Document]) -> str:
string = yaml.dump_all(
[papis.document.to_dict(document) for document in documents],
allow_unicode=True)
return str(string)
and declared in ``pyproject.toml`` as:
.. code:: toml
[project.entry-points."papis.exporter"]
yaml = "papis.yaml:exporter"
where ``yaml`` is the name of the entry point, ``papis.yaml`` is the module
in which it is located and ``exporter`` is the callable used to invoke the
plugin, i.e. the format is `` = ":"``. The exporter can
be retrieved by name using:
.. code:: python
from papis.plugin import get_plugin_by_name
yaml_exporter = get_plugin_by_name("papis.exporter", "yaml")
yaml_string = yaml_exporter(mydocs)
Due to the entry point mechanism, any third-party package can add plugins to
Papis in this fashion. More information about each type of plugin available in
Papis is given below.
Exporter
--------
TO DOCUMENT
Command
-------
TO DOCUMENT
Importer
--------
Papis allows implementing additional plugins for importing external metadata
into its database through so-called "importers" and "downloaders". The
difference between a downloader and an importer is largely semantic. Downloaders
are mostly meant to scrape websites or download files from a remote location.
As an example we show here how to implement a custom downloader for the
`ACL Anthology `__. An :class:`~papis.importer.Importer`
is generally simpler, as it does not require scraping remote websites. We
recommend taking a look at one of the existing importers (e.g. in ``papis/crossref.py``)
or downloaders (e.g. in ``papis/downloaders/sciencedirect.py``) to get an idea
about existing features and implementations.
For a downloader, we create a new file in ``papis/downloaders`` and start writing
a class that inherits from :class:`papis.downloaders.Downloader`. This can look
something like:
.. code:: python
from typing import Any, Dict, Optional
import papis.document
import papis.downloaders.base
class Downloader(papis.downloaders.Downloader):
def __init__(self, url: str) -> None:
super().__init__(
url,
# A name for the downloader that is shown to the user at times
name="acl",
# The extensions that are expected from the downloaded files
expected_document_extension="pdf",
# Priority is sorted ascendingly (0 is the largest) and is used to
# present the downloaders to the user and in automatic merging
priority=10,
)
The main way to recognize if a downloader can be used with a given URI is
through the :meth:`~papis.downloaders.Downloader.match` method. This generally
checks if a given URI matches a website URL, e.g.:
.. code:: python
@classmethod
def match(cls, url: str) -> Optional[papis.downloaders.Downloader]:
return Downloader(url) if re.match(r".*aclanthology\.org.*", url) else None
By default, a downloader implements a :meth:`~papis.downloaders.Downloader.get_data`
method to retrieve metadata. This already does a good job in fetching basic
metadata (title, authors, etc.) through standard elements such as the
`Dublin Core Metadata `__.
We can however extend it for any specific downloader. For instance, some
documents in the ACL Anthology provide a "code" field, with a link to e.g. a
GitHub repository. We will try to extract a code repository URL using
:mod:`bs4`. An instance of :mod:`bs4` with the parsed HTML can be obtained and
manipulated as follows:
.. code:: python
def get_data(self) -> Dict[str, Any]:
soup = self._get_soup()
data = papis.downloaders.base.parse_meta_headers(soup)
paper_details = soup.find("div", "row acl-paper-details").find("dl")
for dt in elem.find_all("dt"):
if "Code" in dt.text:
data["code"] = dt.find_next_sibling().find("a").attrs["href"]
break
return data
Metadata can also be obtained from BibTeX by overriding the
:meth:`~papis.downloaders.Downloader.get_bibtex_url` method. This can be useful
if, for instance, the ``get_data`` method fails to correctly identify the abstract
section. In our example we can fix this by scraping the metadata found in the
BibTeX file. Luckily, for ACL, the BibTeX URL is simply the document URL with a
``.bib`` extension. We can implement it as:
.. code:: python
def get_bibtex_url(self) -> Optional[str]:
url = self.ctx.data.get("url")
return f"{url}.bib" if url is not None else url
To download files from a remote resource, the downloader relies on
``data["pdf_url"]`` by default. However, if this does not exist or does not
return the actual document PDF, we can override the
:meth:`~papis.downloaders.Downloader.get_document_url` method:
.. code:: python
def get_document_url(self) -> Optional[str]:
if "pdf_url" in self.ctx.data:
return str(self.ctx.data["pdf_url"])
return None
Finally, to install the plugin and have it recognized by the extension system
that Papis uses, it needs to be added to ``pyproject.toml``. This can be done with
extending the ``papis.downloader`` entrypoint as follows:
.. code:: toml
[project.entry-points."papis.downloader"]
acl = "papis.downloaders.acl:Downloader"
Explore
-------
TO DOCUMENT