I recently started working as data scientist and I have been assigned one project to work. They want to create prediction model which can predict numbers of incident tickets for each month. What are some of things which I should take under consideration before I start building model? Also, is there any study materials or case study which anybody recommend.
Maybe start with the basics: number of calendar days in the month, number of business days in the month, number of tickets in the prior month.
Likely much depends on the nature of these tickets. Check for seasonal effects if applicable to your business or the mix of businesses served (e.g., a retail business may be busier in Q4, a tax firm might be busier around tax season, a school might be busier in the final weeks of the term). Cross reference with software release schedules if that is applicable to your business (if you're supporting a product, maybe tickets spike when updates are released or are muted when updates are held back). Cross reference with volume of clients or employees or students or whoever it is that submits IT tickets as submissions are lower if there are fewer clients to submit them and vice versa. Look at historical submissions, ask subject matter experts if they can explain any spikes or dips.
Sounds interesting--good luck!
Thank you Justin for providing really good guidance.
I have created simple model using excel by considering number of calendar days in the month, number of business days in the month, number of tickets in the prior month.
Would I be able use location as one of the parameter for model? If so how should I convert string to int?
I am working on application incident data. There is number of patches happening in every month as far as I can see there definitely spike incidents during those days.
Interesting--wonder if you could get some kind of patch calendar or patch release schedule to help predict the spikes.
I suppose location could be tried, but I would wonder if location was serving more as a proxy for the real driver (e.g., number of people at one location being more than at another, hours worked by people at one location being more than people at another site, one location being open more days a week than another), and I would want to confirm that whatever it was about location that drove IT tickets in the past is something we would expect to continue to drive IT tickets in the future.
Never hurts to take a look, though. If the locations are just categories like "site 1", "site 2", and so forth, the easiest thing might be to code as dummy variables, meaning you have a "is site 1" bit, "is site 2" bit, and so on for each of the N categories (or perhaps N-1 depending on how you implement). Search for "modeling categorical data" for some ideas.
Yes I was able to get Patching calendar and release calendar as well. I was able to filter out how many tickets were opened during that time but thing is that we have patching which is going on for almost week so how can I filter out which tickets are coming in for because of patching and which one are regular tickets. or I shouldn't worry about filtering this out?
Regarding location I thought about using that as one of the parameter as well but we have about 700 locations where we are getting tickets from so I am not sure how to approach dummy variable. Is there any documentation or blog that you can recommend for this?
Also I thought about doing text analysis as well which can give me certain keywords which are being used most while tickets are being created.
If you have 700 categories, that could indeed be tedious to code the N-1 dummy variables manually. The good news is some software can take care of this for you automatically. For example in R, you might use factor() to handle the categories for you. The thing to search for is "modeling categorical data in 'X'", where 'X' is whatever software you have (R, Stata, etc.).
If you do have a way to identify tickets that are a consequence of planned patching or releases and those that are not, then you might apply a dummy variable "TicketDueToPatch" or "TicketDueToSoftwareRelease" to better isolate the volume due to the patches from the baseline volume. Then in your prediction, say you have a month with no patches (no patches in the last month of each quarter say), the model might predict lower for that month and higher for the other months that have patches, or even higher if a month has many patches.
That could be interesting to do a text analysis on the tickets. Sounds like a fun project.
Justin Thanks for all of the above knowledge.
Would you be able to suggest any resource which will have included some coding regarding this type of project?
And what are some library that I would need to use?
One library or language is often as good as another for many purposes, just make sure you do have some sort of statistical software, for example R, Stata, IDL, SAS. If you have Excel, make sure to enable the analysis tool pack.
It would be good to read up on the business side of this. Search for "workforce planning" or "workforce management" for call centers.
Would suggest you to have a good understanding of domain well and develop a quality repository of observations for the behavior of the domain. This would be helpful for you to do a quality feature selection and identify the same as potential drivers