A Machine Learning Project
(5 minutes read)

Capturing and describing texts of digital contents housed in academic repositories can be challenging. Topic modeling via unsupervised machine learning can offer a solution to such challenges. One way this may be done is through non-negative matrix factorization (or NNMF) which extracts key features from documents via text mining frameworks to generate clusters of individual terms or words that form specific topics or concepts. In so doing, it is able to uncover hidden or latent semantic structures in textual documents.
As final project for LIBR 599C: Python Programming, one of my favorite courses at SLAIS, I classified over 1000 research papers using non-negative matrix factorization. The objective was to identify latent topics or themes within the various documents. I used the NNMF technique because it is known to generate more accurate classification output than other commonly used classification methods such as Latent Semantic Indexing (LSI) or Latent Dirichlet Allocation (Sarkar, 2016; Muller & Guido, 2015). The steps I used to implement the procedure are described in the succeeding sections

I followed a workflow modified from Rachka (2015) which, as can be seen in the diagram above, involved acquiring the relevant data then cleaning and preprocessing to ensure suitability for analysis. The process also entailed creating a matrix to extract document terms, then applying topic models and finally classifying the generated outputs. The topic modeling was implemented using the following equation:
, where Y and Z represent non-negative matrix factors that can be multiplied to approximately reconstruct X, thus ensuring that all the matrices are non-negative. This can further be represented as:
, and this function is incorporated in some of the NNMF modules in the the scikit-learn package used in the project.
I started the project by downloading the necessary data - 1000+ information-related research papers - from Web of Science, restricting my search to only papers produced by faculty and similar members at the University of British Columbia, using the following search query:
TS=(information* AND literacy* OR digital* OR archival*) AND OG=University of British Columbia
The downloaded data - in .csv format - contained over 60 columns featuring author names, paper titles, abstracts, years of publication and similar labels. Given the state it was in, it was necessary to clean it up a bit to make it suitable for the analysis. This I did using the Pandas library to remove all columns save the title and abstract columns
.Having completed the basic data cleaning tasks, I then implemented a series of preprocessing tasks to further prep the dataset for modeling. As an important stage in data mining and machine learning processes, data preprocessing is seen as useful for enhancing the quality of data and information to be fed to models that are designed to learn from them (Muller & Guido, 2015). In short, if my model was going to be able to extract relevant topics from the datasets I fed it, then this task needed to be performed. I used the NLTK library to accomplish this, performing tasks such as:
I used the code below to perform these tasks:
I also used a wordcloud to visualize the tokenized data, using the following code:
The generated output can be seen below:

Having completed the preprocessing tasks, I was now set to commence the topic modeling implementation. Topic modeling normally entail representing each word and term within documents as vector. In python, this can be accomplished using the scikit-learn library, specifically, its TfidVectorizer module. In this project, I followed normal convention and opted to use approximately 1000 words from the dataset for this task. It is normally believed exceeding the 1000-word treshold would entail a lot of computing power (Muller & Guido, 2015). Next, I used the sklearn decomposition module - NMF to create matrix decomposition. In other words, the document terms that were generated via the TfidVectorizer were to be decomposed into multiple matrices. This would result in two sets of matrices: document-topic matrix and term-topic matrix. And by reverse sorting the rows of generated topics in the term-topic matrix, I obtained the top terms for each topic. I accomplished this using the code below:
As can be decerned in the code above, a function: viewTopic helps visualize top 10 topics and their corresponding terms. As seen in the diagram below, Topic # 02 is concerned with model prediction, parameter-based simulations and data prediction.
| Topic # 00 | Topic # 01 | Topic # 02 | Topic # 03 | Topic # 04 | Topic # 05 | Topic # 06 | Topic # 07 | Topic # 08 | Topic # 09 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | health | conclusion | model | child | behavior | foraging | patient | species | social | memory |
| 1 | intervention | youth | predict | parent | data | rat | disease | nest | group | spatial |
| 2 | participant | experiment | modeling | children | analysis | task | treatment | habitat | negative | access |
| 3 | report | experience | simulation | adhd | result | food | patients | forest | participant | error |
| 4 | woman | expect | parameter | school | provide | animal | clinical | fish | partner | search |
| 5 | physical_activity | expand | base | problem | different | lesion | adherence | marine | preference | recent |
| 6 | group | exist | price | behavior | use | phase | therapy | prey | benefit | performance |
| 7 | program | exhibit | obtain | family | change | injection | years | range | situation | status |
| 8 | include | exercise | use | parental | study | sessions | include | population | people | present |
| 9 | practice | exchange | prediction | factor | method | spatial | trial | movement | behavior | statement |
In order to get a good visualization of the information contained in the generated topic model, I opted to use the python visualization library, pyLDAvis. I used the code below to produce the visualization and to save an .html version for a more concise and interactive viewing. The .html version may be found here.
The code I used for the pyLDAvis output can be seen below:Here is the visualization

I used the generated topics to classify the research output/dataset. In doing this, I used the top 10 topics from the modeling algorithm implemented earlier. The following python code was used to accomplish this:
I saved the results of the classification showing top 10 generated topics in a .csv table. The table below shows the results of the classification:

From the table above, we can see that among the generated topics, topic 2 corresponds with the title and abstracts of the papers it was used to classify. As can be seen, they all deal with the issue of mathematical data modeling. Other paper outputs seen in the table show similar pattern. This can thus be concluded that the classification process met its objective.
From this project, I have realized that techniques like topic modeling can be important starting points for uncovering deep insights and patterns buried within datasets. Uncovering such insights and patterns present endless possibilities for research and practice. Given my deep interest in using machine learning tools and techniques to facilitate my research work, and the fact that my current knowledge of this procedure is still rudimentary, I plan to develop more knowledge and understanding of this technique and similar ones that focus on clustering and grouping of text documents and analyzing their similarities.