Improvement of a Ticketing System using Machine Learning

Master-Praktikum: Machine Learning for Information Systems Students

UCC Munich - Technische Universität München (TUM)

cover

Running

install poetry via python -m pip install poetry==1.1.6
install dependencies via poetry install --remove-untracked
install spacy model via poetry run spacy download de_core_news_md
run Jupyter Notebook via poetry run jupyter lab
if you want to do text translation as well, you need credentials for the Google Translate API

select only the columns that we actually need from tickets.csv
match the ticket ID from tickets.csv with the corresponding ticket file and read it in
extract the first message of the conversation from the ticket file
clean the message of any weird characters
filter out NaN values and empty strings and such in the dataframe
detect the message language
translate English messages to German via Google Translate’s API

Difficulties:

the data set is polluted with autogenerated tickets without useful text and test tickets e.g. 2000006141, 2000000276, 2000000277, 2000011001, 2000010033 (maybe show in presentation?) -> we try to filter those out, by matching specific substrings and using only tickets where the first message is at least 100 characters long
from 12184 initial tickets we found about 6000 to be usable, out of which 65% were in German and 35% in English
the fact that the tickets are in two languages (German/English) is a problem for the model, because key words used for predictions e.g. password/Passwort don’t match. Either we can treat these two types of tickets as two distinct datasets and train multiple models, or translate the English tickets to German. We decided to go with the latter, mainly because we’d have more data to train on like this.

we want to leverage the text of the first message that the customer sends to the support to minimize human effort
we found two applications: predict operator (Bearbeiter) and category (Kategorie ID)
Omar/Simon said the operator is a good feature to predict, as it’s determined afterwards and thus reliably correct. It’s a difficult problem though, as there 24 different operators (originally there are 34, but we filtered out 10 that essentially had no tickets). We achieve an accuracy of 26.4%, and a top-3-accuracy of 44.7% on the test set with our model
according to Omar/Simon, the category is decided by the user, and is thus not super helpful to predict. We achieve an accuracy of 43.1%, and a top-3-accuracy of 69.1% on the test set with our model, even though there are 21 different categories. This shows, nonetheless, that our model could be used pretty nicely to predict any predefined categories or support teams (where the operators underneath can change but the support team name rests the same), which could help considerably with sorting tickets and assigning the correct support team or operator

Model Structure:

first we convert the text into a token count vector: CountVectorizer – this removes some “stop words” like und or die and makes everything lowercase in the process
then we apply tf-idf (term-frequency times inverse document-frequency) to these counts via sklearns TfidfTransformer – with this we identify important words that set the ticket apart from the others
finally, we use the Naive Bayes classifier MultinomialNB to classify the input data onto discrete classes (the operator or the ticket category)

Difficulties:

operators change often (see plots in analysis) -> see improvements
how to know which words are important in the ticket text? Commonly used words are e.g. und, die, für, das, … which are useless for predictions -> we filter out stop words via spacy and apply tf-idf

a big problem we encountered is that operators change often (show plots in analysis)
currently we only use the ticket text for predictions, but ultimately that is unfair to the model, as e.g. one year a specific operator is assigned tickets of a specific type, and later on that same ticket would be assigned to a replacement operator
we thus would need to include the time component into the input, which would probably increase the model performance significally
in production we would retrain the model on current data continuously, a process called online learning