Improvement of a Ticketing System using Machine Learning
Master-Praktikum: Machine Learning for Information Systems Students
UCC Munich - Technische Universität München (TUM)
Running
- install poetry via
python -m pip install poetry==1.1.6
- install dependencies via
poetry install --remove-untracked
- install spacy model via
poetry run spacy download de_core_news_md
- run Jupyter Notebook via
poetry run jupyter lab
- if you want to do text translation as well, you need credentials for the Google Translate API
Preprocessing
- select only the columns that we actually need from
tickets.csv
- match the ticket ID from
tickets.csv
with the corresponding ticket file and read it in - extract the first message of the conversation from the ticket file
- clean the message of any weird characters
- filter out NaN values and empty strings and such in the dataframe
- detect the message language
- translate English messages to German via Google Translate’s API
Difficulties:
- the data set is polluted with autogenerated tickets without useful text and test tickets e.g.
2000006141
,2000000276
,2000000277
,2000011001
,2000010033
(maybe show in presentation?) -> we try to filter those out, by matching specific substrings and using only tickets where the first message is at least 100 characters long - from 12184 initial tickets we found about 6000 to be usable, out of which 65% were in German and 35% in English
- the fact that the tickets are in two languages (German/English) is a problem for the model, because key words used for predictions e.g.
password
/Passwort
don’t match. Either we can treat these two types of tickets as two distinct datasets and train multiple models, or translate the English tickets to German. We decided to go with the latter, mainly because we’d have more data to train on like this.
ML Model
- we want to leverage the text of the first message that the customer sends to the support to minimize human effort
- we found two applications: predict operator (Bearbeiter) and category (Kategorie ID)
- Omar/Simon said the operator is a good feature to predict, as it’s determined afterwards and thus reliably correct. It’s a difficult problem though, as there 24 different operators (originally there are 34, but we filtered out 10 that essentially had no tickets). We achieve an accuracy of 26.4%, and a top-3-accuracy of 44.7% on the test set with our model
- according to Omar/Simon, the category is decided by the user, and is thus not super helpful to predict. We achieve an accuracy of 43.1%, and a top-3-accuracy of 69.1% on the test set with our model, even though there are 21 different categories. This shows, nonetheless, that our model could be used pretty nicely to predict any predefined categories or support teams (where the operators underneath can change but the support team name rests the same), which could help considerably with sorting tickets and assigning the correct support team or operator
Model Structure:
- first we convert the text into a token count vector: CountVectorizer – this removes some “stop words” like und or die and makes everything lowercase in the process
- then we apply tf-idf (term-frequency times inverse document-frequency) to these counts via sklearns TfidfTransformer – with this we identify important words that set the ticket apart from the others
- finally, we use the Naive Bayes classifier MultinomialNB to classify the input data onto discrete classes (the operator or the ticket category)
Difficulties:
- operators change often (see plots in analysis) -> see improvements
- how to know which words are important in the ticket text? Commonly used words are e.g. und, die, für, das, … which are useless for predictions -> we filter out stop words via
spacy
and apply tf-idf
Improvements
- a big problem we encountered is that operators change often (show plots in analysis)
- currently we only use the ticket text for predictions, but ultimately that is unfair to the model, as e.g. one year a specific operator is assigned tickets of a specific type, and later on that same ticket would be assigned to a replacement operator
- we thus would need to include the time component into the input, which would probably increase the model performance significally
- in production we would retrain the model on current data continuously, a process called online learning