Classifying Topics for Research Outputs Using Non-negative Matrix Factorization

A Machine Learning Project

(5 minutes read)



masked-wc


Capturing and describing texts of digital contents housed in academic repositories can be challenging. Topic modeling via unsupervised machine learning can offer a solution to such challenges. One way this may be done is through non-negative matrix factorization (or NNMF) which extracts key features from documents via text mining frameworks to generate clusters of individual terms or words that form specific topics or concepts. In so doing, it is able to uncover hidden or latent semantic structures in textual documents.

As final project for LIBR 599C: Python Programming, one of my favorite courses at SLAIS, I classified over 1000 research papers using non-negative matrix factorization. The objective was to identify latent topics or themes within the various documents. I used the NNMF technique because it is known to generate more accurate classification output than other commonly used classification methods such as Latent Semantic Indexing (LSI) or Latent Dirichlet Allocation (Sarkar, 2016; Muller & Guido, 2015). The steps I used to implement the procedure are described in the succeeding sections