I recently started working as data scientist and I have been assigned one project to work. They want to create prediction model which can predict numbers of incident tickets for each month. What are some of things which I should take under consideration before I start building model? Also, is there any study materials or case study which anybody recommend.
Maybe start with the basics: number of calendar days in the month, number of business days in the month, number of tickets in the prior month.
Likely much depends on the nature of these tickets. Check for seasonal effects if applicable to your business or the mix of businesses served (e.g., a retail business may be busier in Q4, a tax firm might be busier around tax season, a school might be busier in the final weeks of the term). Cross reference with software release schedules if that is applicable to your business (if you're supporting a product, maybe tickets spike when updates are released or are muted when updates are held back). Cross reference with volume of clients or employees or students or whoever it is that submits IT tickets as submissions are lower if there are fewer clients to submit them and vice versa. Look at historical submissions, ask subject matter experts if they can explain any spikes or dips.
Sounds interesting--good luck!
Thank you Justin for providing really good guidance.
I have created simple model using excel by considering number of calendar days in the month, number of business days in the month, number of tickets in the prior month.
Would I be able use location as one of the parameter for model? If so how should I convert string to int?
I am working on application incident data. There is number of patches happening in every month as far as I can see there definitely spike incidents during those days.
Interesting--wonder if you could get some kind of patch calendar or patch release schedule to help predict the spikes.
I suppose location could be tried, but I would wonder if location was serving more as a proxy for the real driver (e.g., number of people at one location being more than at another, hours worked by people at one location being more than people at another site, one location being open more days a week than another), and I would want to confirm that whatever it was about location that drove IT tickets in the past is something we would expect to continue to drive IT tickets in the future.
Never hurts to take a look, though. If the locations are just categories like "site 1", "site 2", and so forth, the easiest thing might be to code as dummy variables, meaning you have a "is site 1" bit, "is site 2" bit, and so on for each of the N categories (or perhaps N-1 depending on how you implement). Search for "modeling categorical data" for some ideas.
Yes I was able to get Patching calendar and release calendar as well. I was able to filter out how many tickets were opened during that time but thing is that we have patching which is going on for almost week so how can I filter out which tickets are coming in for because of patching and which one are regular tickets. or I shouldn't worry about filtering this out?
Regarding location I thought about using that as one of the parameter as well but we have about 700 locations where we are getting tickets from so I am not sure how to approach dummy variable. Is there any documentation or blog that you can recommend for this?
Also I thought about doing text analysis as well which can give me certain keywords which are being used most while tickets are being created.