Classifying Topics for Research Outputs Using Non-negative Matrix Factorization

A Machine Learning Project

(5 minutes read)

masked-wc

Capturing and describing texts of digital contents housed in academic repositories can be challenging. Topic modeling via unsupervised machine learning can offer a solution to such challenges. One way this may be done is through non-negative matrix factorization (or NNMF) which extracts key features from documents via text mining frameworks to generate clusters of individual terms or words that form specific topics or concepts. In so doing, it is able to uncover hidden or latent semantic structures in textual documents.

As final project for LIBR 599C: Python Programming, one of my favorite courses at SLAIS, I classified over 1000 research papers using non-negative matrix factorization. The objective was to identify latent topics or themes within the various documents. I used the NNMF technique because it is known to generate more accurate classification output than other commonly used classification methods such as Latent Semantic Indexing (LSI) or Latent Dirichlet Allocation (Sarkar, 2016; Muller & Guido, 2015). The steps I used to implement the procedure are described in the succeeding sections

Abstract Design

workflow3

Project implementation workflow adapted from Raschka (2015)

I followed a workflow modified from Rachka (2015) which, as can be seen in the diagram above, involved acquiring the relevant data then cleaning and preprocessing to ensure suitability for analysis. The process also entailed creating a matrix to extract document terms, then applying topic models and finally classifying the generated outputs. The topic modeling was implemented using the following equation: NNMF algorithm , where Y and Z represent non-negative matrix factors that can be multiplied to approximately reconstruct X, thus ensuring that all the matrices are non-negative. This can further be represented as: nnmf , and this function is incorporated in some of the NNMF modules in the the scikit-learn package used in the project.

I started the project by downloading the necessary data - 1000+ information-related research papers - from Web of Science, restricting my search to only papers produced by faculty and similar members at the University of British Columbia, using the following search query:

TS=(information* AND literacy* OR digital* OR archival*) AND OG=University of British Columbia

The downloaded data - in .csv format - contained over 60 columns featuring author names, paper titles, abstracts, years of publication and similar labels. Given the state it was in, it was necessary to clean it up a bit to make it suitable for the analysis. This I did using the Pandas library to remove all columns save the title and abstract columns

The Preprocessing Activities

Having completed the basic data cleaning tasks, I then implemented a series of preprocessing tasks to further prep the dataset for modeling. As an important stage in data mining and machine learning processes, data preprocessing is seen as useful for enhancing the quality of data and information to be fed to models that are designed to learn from them (Muller & Guido, 2015). In short, if my model was going to be able to extract relevant topics from the datasets I fed it, then this task needed to be performed. I used the NLTK library to accomplish this, performing tasks such as:

tokenizing sentences into lists of words,
removing punctuations, stopwords, and short words (of length less than 3),
lemmatizing each tokenized word

I used the code below to perform these tasks:

The above code creates the function used to implement the lemantization. This will be used to lemmatize words or strings that are to be tokenized. The tokenization function is shown below, along with others that are intended to remove punctuations, stopwords or words of length exceeding 3:

Prior to implementing the above tasks, I created functions to extract bigrams and trigrams from the texts of the data. Bigrams and Trigrams are two and three words in sentences or texts that frequently occur together. Most of the bigrams and trigrams in my dataset, for example included information_system, library_and_information, data_collection, health_information and several others. I used the Gensim library for this task, specifically, its phrases module to implement these bigrams and trigrams. Following this, I concatenated all the tokenized words into sentences then stored these in a new column I labelled processed. I used the code below to accomplish this:

I also used a wordcloud to visualize the tokenized data, using the following code:

The generated output can be seen below:

topic-wordcloud

Wordcloud of tokenized data

Generating the Topic Model

Having completed the preprocessing tasks, I was now set to commence the topic modeling implementation. Topic modeling normally entail representing each word and term within documents as vector. In python, this can be accomplished using the scikit-learn library, specifically, its TfidVectorizer module. In this project, I followed normal convention and opted to use approximately 1000 words from the dataset for this task. It is normally believed exceeding the 1000-word treshold would entail a lot of computing power (Muller & Guido, 2015). Next, I used the sklearn decomposition module - NMF to create matrix decomposition. In other words, the document terms that were generated via the TfidVectorizer were to be decomposed into multiple matrices. This would result in two sets of matrices: document-topic matrix and term-topic matrix. And by reverse sorting the rows of generated topics in the term-topic matrix, I obtained the top terms for each topic. I accomplished this using the code below:

As can be decerned in the code above, a function: viewTopic helps visualize top 10 topics and their corresponding terms. As seen in the diagram below, Topic # 02 is concerned with model prediction, parameter-based simulations and data prediction.

	Topic # 00	Topic # 01	Topic # 02	Topic # 03	Topic # 04	Topic # 05	Topic # 06	Topic # 07	Topic # 08	Topic # 09
0	health	conclusion	model	child	behavior	foraging	patient	species	social	memory
1	intervention	youth	predict	parent	data	rat	disease	nest	group	spatial
2	participant	experiment	modeling	children	analysis	task	treatment	habitat	negative	access
3	report	experience	simulation	adhd	result	food	patients	forest	participant	error
4	woman	expect	parameter	school	provide	animal	clinical	fish	partner	search
5	physical_activity	expand	base	problem	different	lesion	adherence	marine	preference	recent
6	group	exist	price	behavior	use	phase	therapy	prey	benefit	performance
7	program	exhibit	obtain	family	change	injection	years	range	situation	status
8	include	exercise	use	parental	study	sessions	include	population	people	present
9	practice	exchange	prediction	factor	method	spatial	trial	movement	behavior	statement

Top 10 generated topics and their corresponding terms

In order to get a good visualization of the information contained in the generated topic model, I opted to use the python visualization library, pyLDAvis. I used the code below to produce the visualization and to save an .html version for a more concise and interactive viewing. The .html version may be found here.

The code I used for the pyLDAvis output can be seen below:

Here is the visualization

pyldavis

pyLDAvis visualization of generated topics

Classifying the Generated Topics

I used the generated topics to classify the research output/dataset. In doing this, I used the top 10 topics from the modeling algorithm implemented earlier. The following python code was used to accomplish this:

I saved the results of the classification showing top 10 generated topics in a .csv table. The table below shows the results of the classification:

class

pyLDAvis visualization of generated topics

From the table above, we can see that among the generated topics, topic 2 corresponds with the title and abstracts of the papers it was used to classify. As can be seen, they all deal with the issue of mathematical data modeling. Other paper outputs seen in the table show similar pattern. This can thus be concluded that the classification process met its objective.

Conclusion and Moving Forward

From this project, I have realized that techniques like topic modeling can be important starting points for uncovering deep insights and patterns buried within datasets. Uncovering such insights and patterns present endless possibilities for research and practice. Given my deep interest in using machine learning tools and techniques to facilitate my research work, and the fact that my current knowledge of this procedure is still rudimentary, I plan to develop more knowledge and understanding of this technique and similar ones that focus on clustering and grouping of text documents and analyzing their similarities.

Works Cited

Muller, A. and Guido, S. (2015). Introduction to machine learning with Python : a guide for data scientists. O'Reilly Media, Inc. Sebastopol, CA
Sarkar, D. (2015). Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data. Apress. 901 Grayson Street Suite 204 Berkely, CA, United States. ISBN:978-1-4842-2387-1. Pages:385
Loper, E. and Bird, S. (2002). NLTK: the Natural Language Toolkit. ETMTNLP '02: Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1July 2002 Pages 63–70https://doi.org/10.3115/1118108.1118117
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.