Final Thesis: Implementing PAKET, a Production-Ready AI enhanced Keyword Extractor

Abstract: Keywords are fragments of the core content of a text and can be used to cluster documents, visualize information or enrich metadata. The extraction process is a well-researched topic in the information retrieval community and existing solutions work well, although they are usually not designed for production, but for scientific experimentation. Being suitable for production means developing for the real world, i.e. anticipating how potential stakeholders can be satisfied best. This thesis presents a solution for keyword extraction that is intended to be production-ready. Quality building criteria are applied throughout the software development cycle, e.g. by drafting requirements demanding the software architecture to be sustainable and following general principles like design by contract. Separately, a graphical UI is provided, which demonstrates the main functionality and serves as a proof-of-concept. The result is a deployable application, which not only extracts keywords from text, but also text from files. Over fifteen different MIME types are supported. The keyword extractor ranks keywords by replicating the YAKE! algorithm for 1-grams and filters them in a post-processing step. Filtering is performed by using language-specific, pre-trained NLP pipelines provided through the spaCy library, and fuzzy matching. Currently, the two languages implemented are English and German, but the design allows the number of languages to be extended upon request. The application enables the integration via web service offering a
RESTful-API.

Keywords: Keyword extraction, NLP pipelines

PDF: Master Thesis

Reference: Marlon Weghorn. Implementing PAKET, a Production-Ready AI enhanced Keyword Extractor. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.