NLP paper recommender: text processing, vector embeddings and vector database
In this data science project, I deploy an NLP app that recommends papers based on the similarity between the papers’ abstracts and the user’s interests using text processing, vector embeddings, Pinecone and Streamlit.
This project was motivated by my interest in learning about NLP and LLM and its materialization was adapted from Pau Labarta Bajo's blog entry on NLP engineering. The app is relatively simple at its core: it reads a user's prompt about what topic she would like to read about, processes it using common text processing tasks, generates its vector embedding, connects to a Pinecone index containing vector embeddings of abstracts and finds what abstracts are the most similar to the user's input.
While the research area is narrow (based on my former academic research area: Behavioral Operations), I'd like to emphasize the app's data science components. Initially, I downloaded data from 282 abstracts using the Scopus Search API. I processed their data and generated vector embeddings using their abstracts. Specifically, I used the TF-IDF vectorizer with dimension equal to 128 using unigrams and bigrams. I saved the vectorizer and pushed the embeddings into a Pinecone index. This first step was done in notebooks. Note that the vectorizer and the Pinecone index are used in the deployed app.
Then, I developed the frontend and remaining supporting Python scripts. The scripts (i) process the user's prompt using common text processing tasks, (ii) vectorize it using the saved TF-IDF vectorizer from the first step, (iii) connect to the Pinecone index and query the vector database using the vectorized prompt and cosine similarity as the matching criterion (which was defined as the matching criterion when creating the Pinecone index in the first step). The frontend lists the top 10 papers, showing their titles, journals and DOIs, and it's available as a Streamlit app.
Details about the implementation and the results can be found in the project’s repo.