Kuldeep Jiwani's Videos (Data Science Central) - Data Science Central 2021-11-28T11:38:40Z https://www.datasciencecentral.com/video/video/listForContributor?screenName=2b7mv2c69k8h0&rss=yes&xn_auth=no ODSC APAC 2020: Non-Parametric PDF estimation for advanced Anomaly Detection tag:www.datasciencecentral.com,2020-12-13:6448529:Video:1004959 2020-12-13T05:26:56.850Z Kuldeep Jiwani https://www.datasciencecentral.com/profile/KuldeepJiwani <a href="https://www.datasciencecentral.com/video/odsc-apac-2020-non-parametric-pdf-estimation-for-advanced-anomaly"><br /> <img alt="Thumbnail" height="134" src="https://storage.ning.com/topology/rest/1.0/file/get/8281307697?profile=RESIZE_710x&amp;ss=00%3A00%3A01.000&amp;width=240&amp;height=134" width="240"></img><br /> </a> <br></br>Anomaly Detection have been one of most sought after analytical solutions for businesses operating in the domain of Network Operation, Service Operation, Manufacturing etc. and many other sectors where continuity of operations is essential. Any degradation in operational service or an outage, implies high losses and possible customer… <a href="https://www.datasciencecentral.com/video/odsc-apac-2020-non-parametric-pdf-estimation-for-advanced-anomaly"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/8281307697?profile=RESIZE_710x&amp;ss=00%3A00%3A01.000&amp;width=240&amp;height=134" width="240" height="134" alt="Thumbnail" /><br /> </a><br />Anomaly Detection have been one of most sought after analytical solutions for businesses operating in the domain of Network Operation, Service Operation, Manufacturing etc. and many other sectors where continuity of operations is essential. Any degradation in operational service or an outage, implies high losses and possible customer churn. The data in such real world applications is generally noisy, have complex patterns and often correlated.<br /> <br /> There are techniques like Auto-Encoders available for modelling complex patterns, but they can't explain the cause in original feature space. The traditional univariate anomaly detection techniques uses the z-score and p-value methods. These rely upon unimodality and choice of correct parametric form. If assumptions are not satisfied then there would be a high number of False-Positives and False-Negatives.<br /> <br /> This is where the need for estimating a PDF (Probability Density Function) arises that too without assuming a prior parametric form i.e. Non-Parametric approach. The PDF needs to be modelled as close to the true distribution as possible. That is it should have a low bias and low variance to avoid over-smoothing and under-smoothing. Only then we would have better chances of identifying true anomalies.<br /> <br /> Approaches like KDE - Kernel Density Estimation assist in such non-parametric estimations. As per research the type of kernel has a lesser role to play than the bandwidth for a good PDF estimation. The default bandwidth selection technique used in both Python and R packages over-smooths the PDF and is not suitable for Anomaly Detection.<br /> <br /> We will explain another method, where we run optimisation over a cost function based on modelling Gaussian kernel via FFT (Fast Fourier Transform), to obtain the appropriate bandwidth. Then we will show how we can apply it for Anomaly Detection even when the data is multi-modal (have multiple peaks) and the distribution can be of any shape.<br /> <br /> Based on research paper "Optimal Kernel Density Estimation using FFT based cost function", for ICDM 2020, New York ICDM 2020: Optimal Kernel Density Estimation using FFT based cost function tag:www.datasciencecentral.com,2020-08-22:6448529:Video:978205 2020-08-22T06:07:42.700Z Kuldeep Jiwani https://www.datasciencecentral.com/profile/KuldeepJiwani <a href="https://www.datasciencecentral.com/video/icdm-2020-optimal-kernel-density-estimation-using-fft-based-cost"><br /> <img alt="Thumbnail" height="150" src="https://storage.ning.com/topology/rest/1.0/file/get/7563938465?profile=RESIZE_710x&amp;ss=00%3A00%3A01.000&amp;width=240&amp;height=150" width="240"></img><br /> </a> <br></br>The full research paper is available in the journal: <a href="https://lnkd.in/ghCZFMp">https://lnkd.in/ghCZFMp</a><br></br> <br></br> Abstract: Kernel density estimation (KDE) is an important method in nonparametric learning, but it is highly sensitive to the bandwidth parameter. The existing techniques tend to under smooth or over smooth the… <a href="https://www.datasciencecentral.com/video/icdm-2020-optimal-kernel-density-estimation-using-fft-based-cost"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/7563938465?profile=RESIZE_710x&amp;ss=00%3A00%3A01.000&amp;width=240&amp;height=150" width="240" height="150" alt="Thumbnail" /><br /> </a><br />The full research paper is available in the journal: <a href="https://lnkd.in/ghCZFMp">https://lnkd.in/ghCZFMp</a><br /> <br /> Abstract: Kernel density estimation (KDE) is an important method in nonparametric learning, but it is highly sensitive to the bandwidth parameter. The existing techniques tend to under smooth or over smooth the density estimation. Especially when data is noisy, which is a common trait of real-world data sources. This paper proposes a fully data driven approach to avoid under smoothness and over smoothness in density estimation. This paper uses a cost function to achieve optimal bandwidth by evaluating a weighted error metric, where the weight function ensures low bias and low variance during learning. The density estimation uses the computationally efficient Fast Fourier Transform (FFT) to estimate the univariate Gaussian kernel density. Thus bringing the computation cost of a single density evaluation from O(n2) to O(m log(m)), where m &lt;&lt; n and m being the grid points of FFT. Based upon simulation results this paper significantly outperforms the de-facto classical methods and the more recent papers over a standard benchmark dataset. The results specially shines apart from the recent and classical approaches when data contains significant noise. Sessionisation via stochastic periods for root event identification by Kuldeep Jiwani #ODSC_India tag:www.datasciencecentral.com,2020-08-22:6448529:Video:978107 2020-08-22T05:53:08.187Z Kuldeep Jiwani https://www.datasciencecentral.com/profile/KuldeepJiwani <a href="https://www.datasciencecentral.com/video/sessionisation-via-stochastic-periods-for-root-event"><br /> <img alt="Thumbnail" height="180" src="https://storage.ning.com/topology/rest/1.0/file/get/7563815068?profile=original&amp;width=240&amp;height=180" width="240"></img><br /> </a> <br></br>In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world,… <a href="https://www.datasciencecentral.com/video/sessionisation-via-stochastic-periods-for-root-event"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/7563815068?profile=original&amp;width=240&amp;height=180" width="240" height="180" alt="Thumbnail" /><br /> </a><br />In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Thus we could automatically identify signatures of various actors and take appropriate actions.<br /> <br /> Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model """"Stochastic periods"""" and learn their probability distributions in an unsupervised manner.<br /> <br /> In this talk we will do a walk through of a real security use cases solved via Sessionisation for the SOC (Security Operations Centre) centre of an international firm with offices in 56 countries being monitored via a central SOC team.<br /> <br /> In this talk we will go through a Sessionisation technique based on stochastic periods. The journey would begin by extracting relevant data from a sequence of timestamped events. Then we would apply various techniques like FFT (Fast Fourier Transform), kernel density estimation, optimal signal selection, Gaussian Mixture Models, etc. and eventually discover patterns in time stamped events.<br /> <br /> Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling Topological space creation and Clustering at BigData scale by Kuldeep Jiwani at #ODSC_India tag:www.datasciencecentral.com,2020-08-22:6448529:Video:978099 2020-08-22T05:46:14.157Z Kuldeep Jiwani https://www.datasciencecentral.com/profile/KuldeepJiwani <a href="https://www.datasciencecentral.com/video/sessionisation-via-stochastic-periods-for-root-event"><br /> <img alt="Thumbnail" height="180" src="https://storage.ning.com/topology/rest/1.0/file/get/7563815068?profile=original&amp;width=240&amp;height=180" width="240"></img><br /> </a> <br></br>Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting… <a href="https://www.datasciencecentral.com/video/sessionisation-via-stochastic-periods-for-root-event"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/7563815068?profile=original&amp;width=240&amp;height=180" width="240" height="180" alt="Thumbnail" /><br /> </a><br />Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.<br /> <br /> In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.<br /> <br /> This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.