]]>

]]>

]]>

Scaling AI with Dynamic Inference Paths in Neural NetworksIntroductionIBM Research, with the help of the University of Texas Austin and the University of Maryland, has tried to expedite the performance of neural networks by creating technology, called BlockDrop. Behind the design of this technology lies the objective and promise of speeding up convolutional neural network operations without any loss of fidelity, which can offer a great savings of cost to the ML community.This could “further enhance and expedite the application and use as well as boost the performance of neural nets, leading to particularly in places and on cloud/edge servers with limited computing capability and power limitations”.An increase in accuracy level have been accompanied by increasingly complex and deep network architectures. This presents a problem for domains where fast inference is essential, particularly in delay-sensitive and realtime scenarios such as autonomous driving, robotic navigation, or user-interactive applications on mobile devices.Further research results show regularization techniques for fully connected layers, is less effective for convolutional layers, as activation units in these layers are spatially correlated and information can still flow through convolutional networks despite dropout.BlockDrop method introduced by IBM Research is a “complementary method to existing model compression techniques, as this form of structured NEURAL NETWORK based dropout, drops spatially correlated information, resulting in compressed representations. The residual blocks of a neural network can be kept for evaluation, and can be further pruned for greater speed”.The below figure illustrates blockdrop mechanism for a given image input to the convolution network. The green regions in the 2 right side figures include the activation units which contain semantic information in the input image. The activations dropped at random is not effective in removing semantic information.For a NN with iteration at each step, there are nearby activations contain closely related information. The best strategy employed in spatial compression algorithms is to drop continuous regions that represent similar region and context either by color or shape. By this it helps to remove certain semantic information (e.g., head or feet), propelling remaining units to learn detailed features for classifying input image.Policy Network for Dynamic Inference PathsBlockDrop mechanism learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. It exploits the robustness of Residual Networks (ResNets) by dropping layers that aren’t necessary to compute to achieve the desired level of accuracy, resulting in the dynamic selection of residual blocks for a given novel image. Thus it aids in:Allocating system resources in a more efficient manner with the objective of saving cost.Facilitating further insights into ResNets, e.g., whether and how different blocks encode information about objects and understanding the dynamics behind encoding object-level features.Achieving minimal block usage through more compressed representations by emphasizing decisions at an image pixel level. These image-specific decisions (with features) undertaken at different layers of hidden neurons, helps to optimally drop blocks.For example, given a pre-trained ResNet, a policy network is trained into an “associative reinforcement learning setting for the dual reward of utilizing a minimal number of blocks while preserving recognition accuracy”.Experiments on CIFAR and ImageNet reveal learned policies not only accelerate inference but also encode meaningful visual information. A ResNet-101 model, with this method, achieves a speedup of 20% on average, going as high as 36% for some images, while maintaining the same 76.4% top-1 accuracy on ImageNet.BlockDrop strategy learns a model, referred to as the policy network, that, given a novel input image, outputs the posterior probabilities of all the binary decisions for dropping or keeping each block in a pre-trained ResNet.The policy network is trained using curriculum learning to maximize a reward that incentivizes the use of as few blocks as possible while preserving the prediction accuracy.In addition, the pre-trained ResNet is further jointly fine-tuned with the policy network to produce feature transformations targeted for block dropping behavior. The method represents an instantiation of associative reinforcement learning where all the decisions are taken in a single step given the context (i.e., the input instance). This results in lightweight policy execution and scalable to very deep networks.Deep Learning Neural networks like a recurrent model (LSTM) could also serve as the policy network, however, research findings reveal a CNN to be more efficient with similar performance.The below Figure represents a conceptual overview of BlockDrop, that learns a policy to select the minimal configuration of blocks needed to correctly classify a given input image. The resulting instance-specific paths in the network not only reflect the image’s difficulty, where easier samples have been known to use fewer blocks. It has also been possible to encode meaningful visual information with patterns of blocks, that correspond to clusters of visual features. Source -IBMThe above figure depicts policy network architecture of Blockdrop. On any given new image, the policy network outputs dropping and keeping decisions for each block in a pre-trained ResNet. This final active blocks retained are used for evaluating prediction.Each and every both block usage and prediction accuracy, have been known to cumulatively account for Policy rewards. The policy network is further trained to optimize the expected reward with a curriculum learning strategy, which helps to provide a generic algorithm for global optimization of non-convex functions.In order to attain this objective the policy network is jointly fine-tuned with the ResNet. Source -IBMThe above figure illustrates samples from ImageNet. The top row contains images that are classified with high accuracy with the least number of blocks by removing redundancy, while samples in the bottom row utilize the most blocks and take in more space.Samples using fewer blocks are indeed easier to identify since they contain single frontal view objects positioned in the center, while several objects, occlusion, or cluttered background occur in samples that require more blocks.This is based on the hypothesis “that block usage is a function of instance difficulty where BlockDrop automatically learns “sorting” images into easy or hard cases”.Usage (Reference https://github.com/Tushar-N/blockdrop.git)Library and Usagegit clone https://github.com/Tushar-N/blockdrop.git pip install -r requirements.txt wget -O blockdrop-checkpoints.tar.gz https://www.cs.utexas.edu/~tushar/blockdrop/blockdrop-checkpoints.tar.gz tar -zxvf blockdrop-checkpoints.tar.gz #Train a model on CIFAR 10 built upon a ResNet-110 python cl_training.py --model R110_C10 --cv_dir cv/R110_C10_cl/ --lr 1e-3 --batch_size 2048 --max_epochs 5000 #Train a model on ImageNet built upon a ResNet-101 python cl_training.py --model R101_ImgNet --cv_dir cv/R101_ImgNet_cl/ --lr 1e-3 --batch_size 2048 --max_epochs 45 --data_dir data/imagenet/ # Finetune a ResNet-110 on CIFAR 10 using the checkpoint from cl_training python finetune.py --model R110_C10 --lr 1e-4 --penalty -10 --pretrained cv/cl_training/R110_C10/ckpt_E_5300_A_0.754_R_2.22E-01_S_20.10_#_7787.t7 --batch_size 256 --max_epochs 2000 --cv_dir cv/R110_C10_ft_-10/ # Finetune a ResNet-101 on ImageNet using the checkpoint from cl_training python finetune.py --model R101_ImgNet --lr 1e-4 --penalty -5 --pretrained cv/cl_training/R101_ImgNet/ckpt_E_4_A_0.746_R_-3.70E-01_S_29.79_#_484.t7 --data_dir data/imagenet/ --batch_size 320 --max_epochs 10 --cv_dir cv/R101_ImgNet_ft_-5/ python test.py --model R110_C10 --load cv/finetuned/R110_C10_gamma_10/ckpt_E_2000_A_0.936_R_1.95E-01_S_16.93_#_469.t7 python test.py --model R101_ImgNet --load cv/finetuned/R101_ImgNet_gamma_5/ckpt_E_10_A_0.764_R_-8.46E-01_S_24.77_#_10.t7 R110_C10 Model Output Accuracy: 0.936 Block Usage: 16.933 ± 3.717 FLOPs/img: 1.81E+08 ± 3.43E+07 Unique Policies: 469 Imagenet Model Output Accuracy: 0.764 Block Usage: 24.770 ± 0.980F LOPs/img: 1.25E+10 ± 4.28E+08Unique Policies: 10ConclusionIn this blog, we have discussed the BlockDrop strategy aimed to speed up the training of neural networks. It has the following characteristics:Speed AI-based computer vision operations and save the running time of servers.Approximately takes 200 times less power per pixel than comparable systems using traditional hardware.Facilitates the deployment of top-performing deep neural network models on mobile devices by effectively reducing the storage and computational costs of such networks.Determines the minimal configuration of layers, or blocks, needed to correctly classify a given input image. The simplicity of images helps to remove more layers and save more time.Application has been extended to ResNets for faster inference by selectively choosing residual blocks to evaluate in a learned and optimized manner conditioned on inputs.Extensive experiments conducted on CIFAR and ImageNet show considerable gains over existing methods in terms of the efficiency and accuracy trade-off.ReferencesBlockDrop: Dynamic Inference Paths in Residual Networks https://arxiv.org/pdf/1711.08393.pdfhttps://www.ibm.com/blogs/research/2018/12/ai-year-review/See More

]]>

Traditional vs Deep Learning Algorithms in the Telecom Industry — Cloud Architecture and Algorithm CategorizationGoogle Cloud Architecture for Machine Learning Algorithms in the Telecom IndustryIntroductionThe unprecedented growth of mobile devices, applications, and services have placed the utmost demand on mobile and wireless networking infrastructure. Rapid research and development of 5G systems have found ways to support mobile traffic volumes, real-time extraction of fine-grained analytics, and agile management of network resources, so as to maximize user experience.Moreover, inference from heterogeneous mobile data from distributed devices experiences challenges due to computational and battery power limitations. As a result, models employed in the edge-based scenario are constrained to light-weight to achieve a trade-off between model complexity and accuracy. Also, model compression, pruning, and quantization are largely in place.In this blog, we try to understand the different use-cases, problems, and solutions that can be leveraged with ML as follows:Different telecom use-cases solved by traditional ML models for customer satisfaction/end-user experience catered to higher ROI.Limitations of traditional models, the evolution of deep learning model, and its usage in the telecom industry.Categorization of different ML models and how it fits in an end to end cloud architecture starting from app-level data ingestion to running predictive models in the pipeline.Use cases of traditional Machine Learning algorithmsIn this section, let’s look at the different use cases in the telecom industry where different ML and AI algorithms have played a significant role in network traffic prediction, customer retention, and fraud analysis.Smart traffic prediction and path optimizationThe network and service control layer contains multi-dimension convergent management and control functions to manage and control traditional and SDN/NFV cloud networks.Adding AI reasoning capability would allow Intelligent Network Operations and Management. Network performance data can help identify sleeping cells and trigger an automatic restart, Network Optimization (coverage optimization, capacity optimization, Massive MIMO optimization) RCA (Root Cause Analysis), and Intelligent Transmission Route Optimization and Network Strategy Optimization, etc.SecurityFeatures governing Network security include:Fast tracing and filtering records with Naïve Bayesian Classification, Support Vector Machine, K Nearest Neighbor, Neural Network.Rule extraction with Ensemble Methods like Aggregated Decision Trees.Identification and interception of malicious behaviors, prevention of attacks, etc with Naive Bayes, Multilayer Perceptron Neural Networks (MLPNNs), Radial Base Function Neural Networks (RBFNN) and SVM algorithms.Sentiment analysis with social mediaAs network operators have turned to Machine Learning to analyze brand coverage and customer sentiment, social posts help them to monitor language patterns and sentiment to identify trends like the factors driving new customers to subscribe or when do subscribers seek out a competitor.System design and architecture for highly accurate customer churn modeling Customer Service Recommendation and Business PersonalizationService recommenders may also be used to boost existing services or to identify why users do not adopt some services and, in turn, suggest them value-added services based on their profile and choice. In addition, they also predict churn based on the usage patterns of past churners and changes in other usage profiles.The below figure illustrates an SVM (Support Vector Machine)-based music recommendation system that extracts personal user-level information, timing, location, activity records, along with musical context to suggest suitable music services.Music Recommendation System a VAS leveraged by telecom operators—With customer-generated network data, it is easier to automate the process of grouping customers into segments, like profiling customers based on their calling and messaging behavior.Personalized adsOperators try to present product/service advertisements that are tailored to an individual, situation and device. This type of target-advertising, when directed at right intended customer bases, helps operators and advertisers to zero in on customers with ads that fit their needs and interests.Customer segmentation on call-recordsDifferent clustering techniques and classification techniques like K-means and others cluster mobile customers based on their call detail records and analyze their consumer behavior. PCA-based dimensionality reduction techniques can be used for the identification of relevant and recurrent patterns (e.g. location to identify common presence patterns) among the CDRs of a given user. Further, matrix factorization is employed to infer location preferences on sparse CDR data and generate location-based recommendations.Clustering and Classification phase for predicting user-churnCustomer Churn PredictionThe above figure illustrates the application of SVM, Naive Bayes, Decision Tree, Boosting, Bagging, Random Forest in Customer Churn Prediction through supervised/unsupervised (clustering) techniques.Traffic Flow Predictionk-NN and Linear Discriminant Analysis (LDA), SVM, Decision Trees are used to map network traffic into different classes of interest-based on QoS requirements. The traffic classification framework uses statistics that are insensitive to application protocol, which includes both packet-level and flow-level features.Flow clustering using Expectation Maximization: Based on flow features (packet length, inter-arrival time, byte count, etc.) EM algorithm groups the traffic into a small number of clusters.AutoClass: Unsupervised Bayesian classifier using EM algorithm to select best clusters from a set of training data. To achieve global maxima, it repeats EM searches multiple times.K-means: Unsupervised ML using the first few packets of traffic flow. It was assumed that the first few packets capture the application negotiation phase, which is distinct among applications.Density-based spatial clustering(DBSCAN) has the ability to classify noisy data in contrast to k-Means and AutoClass.Profiling by association: PBA takes as input an IP-to-IP connectivity graph and information about a small subset of IP hosts and produces a prediction about the class of all the flows (edges) in the graph.Topic Models for Mobile Short Message Service CommunicationLatent Dirichlet Allocation (LDA), a generative topic modeling technique, is used to extract latent features arising from mobile Short Messaging Service (SMS) communication for automatic discovery of user interest. The mobile SMS documents are partitioned into segments, wherein the discovered topics in each segment are propagated to influence the discovery of latent features. This technique filters malicious mobile SMS communication. Topic models can effectively detect distinctive latent features to support automatic content filtering and remove security threats to mobile subscribers and operators.Customer SegmentationClustering to segment customer profiles requires complex multivariate time series analysis-based models, that have limitations around scalability and ability to accurately represent temporal behavior sequences (TBS) of users, illustrated in the figures below. TBS may be short, noisy, and non-stationary, where the LDA model serves as the best to represent the temporal behavior of mobile subscribers as compact and interpretable profiles, relaxing the strict temporal ordering of user preferences.Categorization of Deep Learning algorithms and their use cases in the Telecom Industry, Source –https://pdfs.semanticscholar.org/55c1/9610017a65319b130911651fbb2e3b552e51.pdfThe model generating subscriber behavior documents. (left) , LDA model to generate interpretable subscriber profiles (right), MTS-Multivariate TimeSeriesCategorization of Deep Learning algorithms and their use cases in the Telecom IndustryAdvantages of Deep Learning in Mobile and Wireless NetworkingThe Telecom industry acknowledges several benefits of employing Deep Learning to address network engineering problems:Traditional ML algorithms require feature engineering, which is expensive. Deep learning can automatically extract high-level features from data that has a complex structure and inner correlations. Feature Engineering needs to be automated, particularly in the context of mobile networks, as mobile data is generated by heterogeneous sources, is often noisy, and exhibits non-trivial spatial/temporal patterns, whose labeling requires an outstanding human effort.Deep Learning is capable of handling large amounts of data and control model over-fitting. Deep ML models are suited to high volumes of different types of data generated from mobile networks at a fast pace. Training traditional ML algorithmsg., Support Vector Machine (SVM) and Gaussian Process (GP) sometimes requires to store all the data in memory, which is computationally infeasible under big data scenarios. In contrast to traditional ML models that do not scale, Stochastic Gradient Descent (SGD) employed to train NNs only requires sub-sets of data at each training step.Traditional supervised learning is only effective when sufficient labeled data is available. However, most current mobile systems generate unlabeled or semi-labeled data, where some of the Deep Learning algorithms like restricted Boltzmann Machine (RBM), Generative Adversarial Network (GAN), one/zero shot learning demand wider applicability to solve telecom domain problems.Compressive representations learned by deep neural networks can be shared across different networks/telecom providers, while this is limited or difficult to achieve in other ML paradigms (e.g., linear regression, random forest, etc.). Therefore, a single model can be trained to fulfill multiple objectives, without requiring complete model retraining for different tasks, thereby saving CPU and memory of mobile networks.Deep Learning is effective in handing multivariate geometric mobile data, user-location, represented bycoordinates, topology, metrics, and order through dedicated Deep Learning architectures such as PointNet++ and Graph CNN.PointNet++ Architecture (left) and Graph CNN Architecture (right)Hierarchical neural network similar to conventional CNNsApplies PointNet recursively on a nested partitioning of the input point setBetter able to capture local structures and finer detailsDespite the challenges posed by Deep Learning models, emerging tools and technology make them tangible in mobile networks, as illustrated in the figures below: (i) Advanced Parallel Computing, (ii) Distributed Machine Learning Systems, (iii) Dedicated Deep Learning libraries, (iv) Fast optimization algorithms, and (v) Fog Computing.CPU/GPU/Processing capability to support Deep Learning ArchitecturesDeep Learning has a wide range of applications in mobile and wireless networks.Mobile big data collected within the network helps in traffic classification, and Call Detail Record (CDR) mining.Deep Learning-Driven App-Level Mobile Data Analysis shifts the attention towards mobile data analytics on edge devices.Deep Learning-Driven User Mobility Analysis identifies movement patterns of mobile users, either at group or individual levels.Deep Learning-Driven User Localization helps localize users in indoor or outdoor environments, based on different signals received from mobile devices or wireless channels.Deep Learning-Driven Wireless Sensor Networks find application in centralized vs. de-centralized sensing, WSN data analysis, WSN localization and other applications.Deep Learning-Driven Network Control finds the usage of deep reinforcement learning and deep imitation learning on network optimization, routing, scheduling, resource allocation, and radio control.Deep Learning-Driven Network Security leverages Deep Learning to improve network security, which we cluster by focus as infrastructure, software, and privacy-related.Deep Learning-Driven Signal Processing scrutinizes physical layer aspects that benefit from Deep Learning.Deep Learning-based RCNN and Fast-RCNN algorithms are used in Telecom Inventory management via object recognition and localization on Google Street View Images Media recognition (applied on pictures, sound, video and traffic bursts)/Photo-tagging helps subscribers learn and classify known patterns in a collaborative image-classification system and then use this to identify the category to which previously unseen patterns belong. Transfer Deep Learning approach with ontology priors provides effective means of discovering intermediate image representations from deep networks and ensures good generalization abilities across two different domains (Web images as the source domain and personal photos as the target).Now, let’s take a quick look at the different Deep Learning platforms available, mobile hardware supported along with its speed and mobile compatibility.Comparison of Mobile Deep Learning ModelsCloud Architecture with mobile data ingestion and Model Training, PredictionThe figure below depicts the different components involved in building the ML platform — Network Monitoring/Optimization, Media Settlement, Advertising, Audience Orientation, Pattern Recognition, Sensor Data Mining and Mobility AnalyticsIncoming real-time data from mobile SDKReal-time data collection and computing engine receiving data from SDK, with a messaging pipeline to cache frequently received recordsOffline Computing and Analysis EngineBI and Data Warehousing EngineCloud Architecture with GCP for telecom Machine Learning and AI algorithmsCloud Architecture with GCP for telecom Machine Learning and AI algorithmsNetwork Monitoring and OptimizationNetwork State Prediction refers to inferring mobile network traffic or performance indicators, given historical cellular measurements of EnodeB, Sector and Carrier data. MLPs and Deep Learning LSTM-based techniques are used to predict users’ QoE, and evaluate the best-beam for transmission based on:Average user throughputNumber of active users in a cellAverage data volume per userChannel quality indicators (uplink and downlink)Beam Index (BI)Beam Reference Signal Received Power (BRSRP)Distance (of UE to serving cell site)Position (GPS location of UE)Speed (UE mobility)Channel Quality Indicator (CQI)Historic values based on past events and measurements including previous serving beam information, time spent on each serving beam, and distance trendsBy leveraging sparse coding and max-pooling, semi-supervised Deep Learning models have been developed to classify received frame/packet patterns and infer the original properties of flows in a WiFi network.Mobility metrics based for Network Capacity EstimationFurther, AI-capable 5G networks aid in:Building a panoramic data map of each network slice-based on user-subscription, network performance, QoS, event logsForecasting network resourcesAnticipate network outages, equipment failures, and performance degradationPredicting UE mobility in 5G networks, allowing Access and Mobility Management Function (AMF) to update mobility patterns based on user subscription, historical statistics, and instantaneous radio conditions.Enhancing security in 5G networks, preventing attacks and frauds by recognizing user patterns, and tagging certain events to prevent similar attacks in the future.Predicting Mobile traffic at city scaleSpatio-temporal correlations of geographic mobile traffic can be predicted with an AE-based architecture and LSTMs. Global and multiple local stacked AEs are used for spatial feature extraction, dimension reduction and training parallelism, while compressed representations extracted are subsequently processed by LSTMs, to perform final forecasting. The following figure illustrates a typical AE-LSTM architecture, where AutoEncoder model is used to extract features and LSTM model is used to predict the traffic flow:AE-LSTM for traffic flow predictionHybrid Multimodal Deep Learning method can be used for short-term traffic flow forecasting. The model, as illustrated in the figure below, is composed of one-dimensional Convolutional Neural Networks (1D CNN) and Gated Recurrent Units (GRU) with the attention mechanism, and can jointly and adaptively learn the spatial-temporal correlation features and long temporal interdependence of multi-modality traffic data.Hybrid Multimodal Deep Learning framework for traffic flow forecastingMultiple 3D Convolutional Neural Networks use 3D-CNNs to learn the Spatio-temporal correlation features jointly from low-level to high-level layers for traffic data.Multiple 3D CNN architectureOther commonly used traditional ML models for modeling Spatio-temporal characteristics include SVM and the Autoregressive Integrated Moving Average (ARIMA).ST-DenNetFus based Deep Learning framework is used to predict network demand (i.e. uplink and downlink throughput) in every region of a city as illustrated in the figure below. The ST-DenNetFus architecture captures unique properties (e.g., temporal closeness, period, and trend) from Spatio-temporal data, through various branches of dense neural networks (CNN). ST-DenNetFus also introduces extra branches for fusing external data sources (e.g., crowd mobility patterns, temporal functional regions, and the day of the week) that have not been considered before in the network demand prediction problem of various dimensionalities.ST-DenNetFus ArchitectureMobile Traffic Super-Resolution (MTSR) technique is used to infer network-wide fine-grained mobile traffic consumption given coarse-grained counterparts obtained by probing. MTSR works on the principle of image super-resolution, designed with a dedicated CNN with multiple skip connections between layers, named deep zipper network, along with a Generative Adversarial Network (GAN). This helps perform precise MTSR, reduces traffic measurement overheads and improves the fidelity of inferred traffic snapshots.GAN operating principle in the MTSR problem. The generator is employed in the prediction phase.MLPs, CNNs, and LSTMs perform encrypted mobile traffic classification as deep NNs can automatically extract complex features (e.g., identify protocols in a TCP flow dataset). CNN's have also been used to identify malware traffic, where images and unusual patterns that malware traffic exhibits are classified by representation learning.CDR Mining involves extracting knowledge from specific instances of telecommunication transactions such as phone number, cell ID, session start/end time, traffic consumption, etc. Using Deep Learning to mine useful information from CDR data can serve a variety of functions, including:Estimating metro density from streaming CDR data, by using RNNs. The goal is to take the trajectory of a mobile phone user as a sequence of locations, which can then be fed to RNN-based models to handle the sequential data.CDR data can also be used to study demographics, where a CNN is used to predict the age and gender of mobile users.CDR data is also used to predict tourists’ next locations.Human activity chains generation by using an Input-Output based HMM-LSTM generative model.CDR Analysis PipelineRNN-based predictors significantly outperform traditional ML methods, including Naive Bayes, SVM, RF and MLP.Deep Learning-Driven App-level Mobile Data AnalysisAnalysis of mobile data, therefore, becomes an important and popular research direction in the mobile networking domain, as rapid emergence of IoT sensors and its data collection strategies have been able to provide a powerful solution for app-level data mining.App-level mobile data analysis include: (i) Cloud-based computing and (ii) Edge-based computing. In the former, mobile devices act as data collectors and messengers that constantly send data to cloud servers, via local points of access with limited data pre-processing capabilities. In Edge-based computing, pre-trained models are offloaded from the cloud to an individual. The primary applications include mobile healthcare, mobile pattern recognition and mobile Natural Language Processing (NLP), and Automatic Speech Recognition (ASR).Mobile Health: Wearable health monitoring devices being introduced in the market, incorporates medical sensors that capture the physical conditions of their carriers and provide real-time feedback (e.g., heart rate, blood pressure, breath status, etc.), or trigger alarms to remind users of taking medical actions.Deep Learning-driven MobiEar to aid deaf people’s awareness of emergencies operates efficiently on smart phones and only requires infrequent communication with servers for updates. UbiEar, a lightweight CNN architecture designed for acoustic event sensing and notification system, operates on the Android platform and is able to assist hard-to-hear sufferers in recognizing acoustic events, without requiring location information.Deep Learning-based (DL) models (CNNs and RNNs) are able to classify lifestyle and environmental traits of volunteers, different types of Human Activity Recognition with heterogeneous and high-dimensional mobile sensor data, including accelerometer, magnetometer, and gyroscope measurements. ConvLSTMs are known for fusing data gathered from multiple sensors and perform activity recognition.Mobile motion sensors collect data via video capture, accelerometer readings, motion — Passive Infra-Red (PIR) sensing, specific actions and activities that a human subject performs. Such models trained on server for domain-specific tasks through federated learning, finally serve a broad range of devices.Mobile Pattern Recognition based on patterns observed in the output of the mobile camera or other sensors. All these DL models demonstrate superior prediction accuracy over RFs and logistic regression.Object Classification finds huge applications in mobile devices as devices take photos and rely on photo-tagging process. One such DL-based framework is the DeepCham that generates high-quality domain-aware training instances for adaptation from in-situ mobile photos. It has a distributed algorithm which identifies qualifying images stored in each mobile device for training and a user labeling process for recognizable objects identified from qualifying images using suggestions automatically generated by a generic deep model.Mobile classifiers can also assist Virtual Reality (VR) applications, where Deep Learning object detectors are incorporated into a mobile Augmented Reality (AR) system. Object detectors use CNN-based frameworks used for facial expression recognition when users wear head-mounted displays in the VR environment.The figure below demonstrates a lightweight Deep Learning-based object detection framework that combines spatial relations for:Training and detection with the lightweight Single Shot Detector (SSD)Combination of vision-based detection results and spatial relationshipsRegistration, geo-visualization and interactionMobile Outdoor Augmented Reality methodThe figure below demonstrates app-level data collection and transfer from edge devices to the cloud for algorithm training and prediction.Deep Learning-Driven Mobility AnalysisMobility data is usually subject to stochasticity, loss, and noise, which creates a problem in precise modeling. As Deep Learning is able to perform automatic feature extraction, it becomes a strong candidate for human mobility modeling. CNN's and RNNs are the most successful architectures in such applications as they can effectively exploit spatial and temporal correlations.The “DeepSpace” model, built with a hierarchal CNN structure, predicts individuals’ trajectories/moving paths with much higher accuracy as compared to naive CNNs, stacked RNN and LSTM, n-grams, and k nearest neighbor method. In addition to providing support to 2 parallel prediction models, the coarse prediction model and fine prediction models to deal with the continuous mobile data stream, the framework supports online training and learning to extract optimal feature set size for the online data.Hierarchical framework with coarse model and fine models, suited for spatial mobile data in an online learning systemThe “DeepMove” model predicts human mobility from lengthy and sparse trajectories using an attentional recurrent network. DeepMove is first designed as a multi-modal embedding recurrent neural network to capture the complicated sequential transitions by jointly embedding multiple factors that govern human mobility. Further, it’s also extended to include a historical attention model to capture the multi-level periodicity. As illustrated in the following figure, the historical attention module is equipped with an auto-selector, comprised of two components:An attention candidate generator to generate the candidates, which are exactly the regularities of the mobility and an attention selector to match the candidate vectors with the query vector, i.e., the current mobility status.Architecture of DeepMoveGPS records and traffic accident data are combined to understand the correlation between human mobility and traffic accidents. The design includes a stacked de-noising Auto Encoder to learn a compact representation of human mobility, and subsequently use that to predict traffic accident risk.DBNs (Deep Belief Networks) are employed to predict and simulate human emergency behavior and mobility in a natural disaster, learning from GPS records of 1.6 million users.A Deep Learning-based approach called ST-ResNet (illustrated in the figure below) is used to collectively forecast the inflow and outflow of crowds in each and every region of a city. The architecture of the ST-ResNet (residual neural network framework) is based on unique properties of Spatio-temporal data, to model the temporal closeness, period, and trend properties of crowd traffic. Each property is designed to have a branch of residual convolutional units, which models the spatial properties of crowd traffic. ST-ResNet learns to dynamically aggregate the output of the three residual neural networks based on data, assigning different weights to different branches and regions, along with external factors, such as weather and day of the week.ST-ResNet architecture. Conv: Convolution; ResUnit: Residual Unit; FC: Fully-connected.Deep Learning Driven User LocalizationLocation-based services and applications (e.g. mobile AR, GPS) demand precise individual positioning technology. Deep Learning can enable high localization accuracy with both device-free and device-based localization services.Limitations of Deep Learning in Mobile and Wireless NetworkingAlthough Deep Learning has unique advantages when addressing mobile network problems, it also has several shortcomings, which partially restrict its applicability in this domain. Specifically:Deep Learning (including deep reinforcement learning) is vulnerable to adversarial/cyber attacks (especially CNN), where artifact inputs that are intentionally designed by an attacker to fool Machine Learning models into making mistakes. They can trigger mis-adjustments of a model with high likelihood.Deep Learning algorithms are largely black boxes and have low interpretability. This limits the applicability of Deep Learning, e.g. in network economics. Still, businesses continue to employ statistical methods that have high interpretability, whilst sacrificing on accuracy that could be attainable from Deep Learning models.Deep Learning is heavily reliant on data, and models further benefit from training data augmentation. This creates an opportunity for mobile networking, as networks generate tremendous amounts of data. However, data collection may be costly, and face privacy concern, therefore, it may be difficult to obtain sufficient information for model training.Deep Learning can be computationally demanding and heavily relies on advanced parallel computing (e.g., GPUs, high-performance chips). Deploying Neural Networks on embedded and mobile devices has additional constraints on energy and capability.Deep neural networks usually have many hyperparameters (e.g., for a CNN, it includes number, shape, stride, and dilation of filters, as well as for the residual connections) and finding their optimal configuration can be difficult. The AutoML platform2 provides the first solution to this problem, by employing progressive neural architecture search.ConclusionIn this blog, we discussed different traditional vs Deep Learning algorithms, DL-based architectures, their pros and cons, and applications in the telecom industry. We also explored the data ingestion, categorization, and model deployment architecture in production. We looked at the recent advances in ML driver mobile-app development (in object detection, speaker identification, emotion recognition, stress detection, and ambient scene analysis), in-built technologies to sustain limited mobile battery by building memory-energy efficient apps and model compression techniques.ReferencesMachine-learning technologies in telecommunications https://pdfs.semanticscholar.org/a367/f8cad03c1353e9fc36970e4cb4b8edc21fc0.pdfDeep Learning in Mobile and Wireless Networking: A Survey : https://pdfs.semanticscholar.org/55c1/9610017a65319b130911651fbb2e3b552e51.pdf See More

]]>

Anomaly Detection from Head and Abdominal Fetal ECG — A Case study of IOT anomaly detection using Generative Adversarial NetworksUnsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal MetricsMotivationIn this blog, we discuss the role of Variation Auto Encoder in detecting anomalies from fetal ECG signals.Variational Auto Encoder ways to accurately determine anomalies from seasonal metrics occurring at regular intervals ( i.e. daily/weekly/bi-weekly/monthly or periodic events at finer granular levels of mins/secs) so as to facilitate timely actions from the concerned team. Such timely actions help to recover from serious issues such as predictive maintenance) in the field of web applications, retail, IoT, telecom, and healthcare industry.The metrics/KPIs that plays an important role in determining anomalies are composed of noises that are assumed to be independent, zero-mean Gaussian at every point. In fact, the seasonal KPIs comprises of seasonal patterns with local variations, and statistics of the Gaussian noises.Role of IoT/WearablesPortable low-power fetal ECG collectors like wearables have been designed for research and analysis and, which can collect maternal abdominal ECG signals in real-time. The ECG data can be sent to a smartphone client via Bluetooth to individually analyze signals captured from the fetal brain and maternal abdomen. The extracted fetal ECG signals can be used to detect any anomaly in fetal behavior.Variation Auto-EncoderDeep Bayesian networks employ black-box learning patterns with neural networks to express the relationships between variables in the training dataset. Variational Auto Encoders are nothing but Deep Bayesian Networks which are often used in training and prediction, uses Neural Networks to model posteriors of the distributions.Variational Auto Encoders (VAEs) supports optimization by setting a lower bound on the likelihood via a reparameterization of the Evidence Lower Bound (ELBO). The ELBO method uses a 2 step process of maximizing the log-likelihood, the likelihood tries to make the generated sample (image/data) more correlated to the latent variable, which makes the model more deterministic. In addition, it minimizes the KL divergence between the posterior and the prior.Characteristics/Architecture of DoNutThe Donut recognizes the normal pattern of a partially abnormal x, and find a good posterior in order to estimate how well x follows the normal pattern. The fundamental characteristic of Donut is to enhance its ability to find good posteriors by reconstructing normal points within abnormal windows. This property is infused in its training property by M-ELBO (Modified ELBOW), that turns out to be superior, in contrast to excluding all windows containing anomalies and missing points from the training data.Thus summarizing the three techniques employed in VAE based anomaly detection algorithm in Donut architecture includes the following:Modified ELBO – Ensures that an average, a certain minimum number of bits of information are encoded per latent variable, or per group of the latent variable. This helps to increase the information capacity and reconstruction accuracy.Missing Data Injection for training – A kind of data augmentation procedure used to fill the missing points as zeros. It amplifies the effect of ELBO by injecting the missing data before the training epoch starts and recovering the missing points after the epoch is finished.MCMC Imputation for better anomaly detection – Improves posterior estimation by synthetically generated missing points. The network structure of Donut. Gray nodes are random variables, and white nodes are layers. The data preparation stage deals with Standardization, Missing value Injection and grouping data in terms of Sliding Window (length say (W) over key metrics), where each point xt is being processed as xt−W +1, . . . , x. The training process encompasses Modified ELBO and Missing Data Injection. In the final prediction stage, MCMC Imputation (as shown in the figure below) is applied to yield a better posterior distribution. MCMC Imputation and Anomaly Detection Source (Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web ApplicationsTo know more about ELBO in VATE check out https://medium.com/@hfdtsinghua/derivation-of-elbo-in-vae-25ad7991fdf7 or refer to the references below.File Importsimport numpy as np from donut import complete_timestamp, standardize_kpi import pandas as pd import csv import matplotlib.pyplot as plt import seaborn as sns sns.set(rc={'figure.figsize':(11, 4)}) from sklearn.metrics import accuracy_score import mne import pandas as pd import numpy as np import matplotlib.pyplot as pltLoading and TimeStamping the dataHere we add timestamps to the Fetal ECG data, under the assumption that each data point is recorded at an interval of 1 second, (although the data-set source suggests that the signal are recorded at 1 Khz.). We further resample the data at an interval of 1 minute by taking an average of 60 samples.data_path = '../abdominal-and-direct-fetal-ecg-database-1.0.0/' file_name = 'r10.edf' edf = mne.io.read_raw_edf(data_path+file_name) header = ','.join(edf.ch_names) np.savetxt('r10.csv', edf.get_data().T, delimiter=',', header=header) df = pd.read_csv('r10.csv') periods = df.shape[0] dti = pd.date_range('2018-01-01', periods=periods, freq='s') print(dti.shape, df.shape) df['DateTs'] = dti df.set_index('DateTs') df.index = pd.to_datetime(df.index, unit='s') df1 = df.resample('1T').mean()Once the data is indexed by time-stamps we plot the individual features and try to explore seasonality patterns if any. We also add a label feature metric, signifying potential anomalies that could be present in the input data by considering at high-level of brain signal fluctuations (>= .00025 and <= -.00025). We chose the brain signal, as it closely resembles the signal curves and spikes of 4 other abdominal signals.Data Labelling and Plotting the FeaturesAs there are total 5 signals (one from fetal brain and 4 from abdomendf1.rename_axis('timestamp', inplace=True) print(cols, df1.index.name) df1['label'] = np.where((df1['# Direct_1'] >= .00025) | (df1['# Direct_1'] <= -.00025), 1, 0) print(df1.head(5)) for i in range(0, len(cols)): if(cols[i] != 'timestamp'): plt.figure(figsize=(20, 10)) plt.plot(df1[cols[i]], marker='^', color='red') plt.title(cols[i]) plt.savefig('figs/f_' + str(i) + '.png') Training the data using Adversarial Networksdf2 = df1.reset_index() df2 = df2.reset_index(drop=True) #drop the index, instead use as it as a feature vector before discovering the missing data points # Read the raw data for 1st feature Direct_1 timestamp, values, labels = df2['timestamp'], df2['# Direct_1'], df2['label'] # If there is no label, simply use all zeros. labels = np.zeros_like(values, dtype=np.int32) # Complete the timestamp, and obtain the missing point indicators. timestamp, missing, (values, labels) = \ complete_timestamp(timestamp, (values, labels)) # Split the training and testing data. test_portion = 0.3 test_n = int(len(values) * test_portion) train_values, test_values = values[:-test_n], values[-test_n:] train_labels, test_labels = labels[:-test_n], labels[-test_n:] train_missing, test_missing = missing[:-test_n], missing[-test_n:] # Standardize the training and testing data. train_values, mean, std = standardize_kpi( train_values, excludes=np.logical_or(train_labels, train_missing)) test_values, _, _ = standardize_kpi(test_values, mean=mean, std=std) import tensorflow as tf from donut import Donut from tensorflow import keras as K from tfsnippet.modules import Sequential from donut import DonutTrainer, DonutPredictor # We build the entire model within the scope of `model_vs`, # it should hold exactly all the variables of `model`, including # the variables created by Keras layers. with tf.variable_scope('model') as model_vs: model = Donut( h_for_p_x=Sequential([ K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), ]), h_for_q_z=Sequential([ K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), ]), x_dims=120, z_dims=5, ) trainer = DonutTrainer(model=model, model_vs=model_vs, max_epoch=512) predictor = DonutPredictor(model) with tf.Session().as_default(): trainer.fit(train_values, train_labels, train_missing, mean, std) test_score = predictor.get_score(test_values, test_missing) pred_score = np.array(test_score).reshape(-1, 1) print(len(test_missing), len(train_missing), len(pred_score), len(test_values)) y_pred = np.argmax(pred_score, axis=1)The model is trained with default parameters as listed below:use_regularization_loss=True, max_epoch=512, batch_size=256, valid_batch_size=1024, valid_step_freq=100, initial_lr=0.001, optimizer=tf.train.AdamOptimizer, grad_clip_norm=10.0 #Clip gradient by this norm. The model summary with its trainable parameters, number of hidden layers can be obtained as :Trainable Parameters (24,200 in total) donut/p_x_given_z/x_mean/bias (120,) 120 donut/p_x_given_z/x_mean/kernel (50, 120) 6,000 donut/p_x_given_z/x_std/bias (120,) 120 donut/p_x_given_z/x_std/kernel (50, 120) 6,000 donut/q_z_given_x/z_mean/bias (5,) 5 donut/q_z_given_x/z_mean/kernel (50, 5) 250 donut/q_z_given_x/z_std/bias (5,) 5 donut/q_z_given_x/z_std/kernel (50, 5) 250 sequential/forward/_0/dense/bias (50,) 50 sequential/forward/_0/dense/kernel (5, 50) 250 sequential/forward/_1/dense_1/bias (50,) 50 sequential/forward/_1/dense_1/kernel (50, 50) 2,500 sequential_1/forward/_0/dense_2/bias (50,) 50 sequential_1/forward/_0/dense_2/kernel (120, 50) 6,000 sequential_1/forward/_1/dense_3/bias (50,) 50 sequential_1/forward/_1/dense_3/kernel (50, 50) 2,500 This model is obtained from the following code snippet: model = Donut( h_for_p_x=Sequential([ K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), ]), h_for_q_z=Sequential([ K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), K.layers.Dense(50, kernel_regularizer=K.regularizers.l2(0.001), activation=tf.nn.relu), ]), x_dims=120, z_dims=5, )This DoNut Network contains uses The variational auto-encoder (“Auto-Encoding Variational Bayes”,Kingma, D.P. and Welling) which is a deep Bayesian network, with observed variable x and latent variable z. The VAE is generated using TFSnippet (library for writing and testing tensorflow models). The generative process of Auto-Encoder is initiated with parameter z with prior distribution p(z), and a hidden network h(z), then uses observed variable x with distribution p(x | h(z)). The posterior inference p(z | x), variational inference techniques are adopted, to train a separated distribution q(z | h(x)).Here each Sequential function creates a multi-layer perception, with 2 hidden layers of 50 units and RELU activation. The 2 distributions “h_for_p_x” and “h_for_q_z“, are created using the same Sequential function (as evident from Model Summary (Sequential and Sequential_1) and they represent the hidden networks for “p_x_given_z” and “q_z_given_x”.Plotting the Anomalies/Non-Anomalies together or IndividuallyWe plot the anomalies (in red) together with non-anomalies (green) and also try to superimpose both of them together in the same graph so as to analyze the combined impact.In the Donut prediction, the higher the prediction score the data is less anomalous. We prefer to choose (-3) as the threshold margin of predicting anomalous points.We also compute the number of inliers and outliers and plot them against time-stamped values along the x-axis. plt.figure(figsize=(20, 10)) split_test = int((test_portion)*df.shape[0]) anomaly = np.where(pred_score > -3, 0, 1) df3 = df2.iloc[-anomaly.shape[0]:] df3['outlier'] = anomaly df3.reset_index(drop=True) print(df3.head(2), df3.shape) print("Split", split_test, df3.shape) di = df3[df3['outlier'] == 0] do = df3[df3['outlier'] == 1] di = di.set_index(['timestamp']) do = do.set_index(['timestamp']) print("Outlier and Inlier Numbers", do.shape, di.shape, di.columns, do.columns) outliers = pd.Series(do['# Direct_1'], do.index) inliers = pd.Series(di['# Direct_1'], di.index) plt.plot(do['# Direct_1'], marker='^', color='red', label="Anomalies") plt.plot(di['# Direct_1'], marker='^', color='green', label="Non Anomalies") plt.legend(['Anomalies', 'Non Anomalies']) plt.title('Anomalies and Non Anomalies from Fetal Head Scan') plt.show() di = di.reset_index() do = do.reset_index() plt.figure(figsize=(20, 10)) do.plot.scatter(y ='# Direct_1', x = 'timestamp', marker='^', color='red', label="Anomalies") plt.legend(['Anomalies']) plt.xlim(df3['timestamp'].min(), df3['timestamp'].max()) plt.ylim(-.0006, .0006) plt.title('Anomalies from Fetal Head Scan') plt.show() plt.figure(figsize=(20, 10)) di.plot.scatter(y='# Direct_1', x='timestamp', marker='^', color='green', label="Non Anomalies") plt.legend(['Non Anomalies']) plt.xlim(df3['timestamp'].min(), df3['timestamp'].max()) plt.ylim(-.0006, .0006) plt.title('Non Anomalies from Fetal Head Scan') plt.show()Anomaly Plots for Direct electrocardiogram recorded from fetal headThe three consecutive plot displays anomalous and non-anomalous points plotted against each other or separately as labeled, especially for signals obtained from Fetal Head Scan. Anomaly Plots for Direct electrocardiogram recorded from maternal abdomenThe three consecutive plot displays anomalous and non-anomalous points plotted against each other or separately as labeled, especially for signals obtained from Fetus’s Maternal Abdomen. =ConclusionSome of the key. learnings of the Donut Architecture are:Dimensionality reduction based anomaly detection techniques need to use reconstruction mechanism to identify the variance and consequently identify the anomalies.Anomaly detection with generative models needs to train with both normal and abnormal data.Not relying on data imputation by any algorithm weaker than VAE, as this may degrade the performance.In order to discover the anomalies fast, the reconstruction probability for the last point in every window of x is computed.We should also explore other variants of Auto Encoders (RNN, LSTM, LSTM with Attention Networks, Stacked Convolutional Bidirectional LSTM) in discovering anomalies for IoT devices.The complete source code is available at https://github.com/sharmi1206/featal-ecg-anomaly-detectionReferenceshttps://physionet.org/content/adfecgdb/1.0.0/Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications – https://arxiv.org/abs/1802.03903Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse: https://papers.nips.cc/paper/9138-dont-blame-the-elbo-a-linear-vae-perspective-on-posterior-collapse.pdfhttps://github.com/NetManAIOps/donut — Installation and API UsageUnderstanding disentangling in β-VAE https://arxiv.org/pdf/1804.03599.pdf%20.A Fetal ECG Monitoring System Based on the Android Smartphone: https://www.mdpi.com/1424-8220/19/3/446See More

]]>

MotivationThere are five types of traditional time series models most commonly used in epidemic time series forecasting, which includesAutoregressive (AR),Moving Average (MA),Autoregressive Moving Average (ARMA),Autoregressive Integrated Moving Average (ARIMA), andSeasonal Autoregressive Integrated Moving Average (SARIMA) models.AR models express the current value of the time series linearly in terms of its previous values and the current residual, whereas MA models express the current value of the time series linearly in terms of its current and previous residual series.ARMA models are a combination of AR and MA models, in which the current value of the time series is expressed linearly in terms of its previous values and in terms of current and previous residual series. The time series defined in AR, MA, and ARMA models are stationary processes, which means that the mean of the series of any of these models and the covariance among its observations do not change with time.For non-stationary time series, transformation of the series to a stationary series has to be performed first. ARIMA model generally fits the non-stationary time series based on the ARMA model, with a differencing process which effectively transforms the non-stationary data into a stationary one. SARIMA models, which combine seasonal differencing with an ARIMA model, are used for time series data modeling with periodic characteristics.Comparing the performance of all algorithmic models available for time series, it was found that the machine learning methods were all out-performed by simple classical methods, where ETS and ARIMA models performed the best overall. The following figure represents the model comparisons. Bar Chart Comparing Model Performance (sMAPE) for One-Step Forecasts: Source : https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889However apart from traditional time-series forecasting, if we look at the advancements in the field of deep learning for time series prediction , we see Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) have gained lots of attention in recent years with their applications in many disciplines including computer vision, natural language processing and finance. Deep learning methods are capable of identifying structure and pattern of data such as non-linearity and complexity in time series forecasting.There remains an open research question on how the newly developed deep learning-based algorithms for forecasting time series data, such as “Long Short-Term Memory (LSTM)”, are superior to the traditional algorithms.The blog is structured as follows:Understanding deep learning algorithms RNN, LSTM, and the role of ensemble learning with LSTM to aid in performance improvement.Understanding conventional time series modeling technique ARIMA and how it helps to improve time series forecasting in ensembling methods when used in conjunction with MLP and multiple linear regression.Understanding problems and scenarios where ARIMA can be used vs LSTM and the pros and cons behind adopting one against the other.Understanding how time series modeling with SARIMA can be clubbed with other spatial, decision-based, and event-based models using ensemble learning.However a more detailed understanding The study does not look at more complex time series problems, such as those datasets with Complex irregular temporal structures, Missing observations, Heavy noise and Complex interrelationships between multiple variates.LSTMLSTM is a special kind of RNN composed of a set of cells with features to memorize the sequence of data. The cell captures and stores the data streams. Further the cells inter-connect one module of past to another module of present one to convey information from several past time instants to the present one. Due to the use of gates in each cell, data in each cell can be disposed of, filtered, or added for the next cells.The gates are based on the sigmoidal neural network layer, enable the cells to optionally let data pass through or disposed. Each sigmoid layer yields numbers in the range of zero and one, depicting the amount of every segment of data ought to be let through in each cell. More precisely, an estimation of zero value implies that “let nothing pass through”; whereas; an estimation of one indicates that “let everything pass through.” Three types of gates are involved in each LSTM with the goal of controlling the state of each cell:Forget Gate outputs a number between 0 and 1, where 1 shows “completely keep this”; whereas, 0 implies “completely ignore this.”Memory Gate chooses which new data need to be stored in the cell through a sigmoid layer followed by a tanh layer. The initial sigmoid layer, called the “input door layer” chooses which values will be modified. Next, a tanh layer makes a vector of new candidate values that could be added to the state.Output Gate decides what will be yield out of each cell. The yielded value will be based on the cell state along with the filtered and newly added data.Both in terms of learning how it works, and the implementation, the LSTM-model provides considerably more options for fine-tuning compared to ARIMAEnsembles of LSTM for time-series ForecastingSeveral research and studies have found that a single LSTM network that is trained with a particular dataset is very likely to perform poorly on an entirely different time series unless rigorous parameter optimization is performed. As LSTM is very successful in the forecasting domain, research use a so-called stacking ensemble approach where multiple LSTM networks are stacked and combined to provide a more accurate prediction, aiming to propose a more generalized model to forecasting problemsResearch on four different forecasting problems has concluded that the stacked LSTM networks outperformed the regular LSTM networks as well as the ARIMA model, in terms of the evaluation measure RMSE when compared together.The general quality of the ensemble method studied could be increased by tuning the parameters for each individual LSTM. The reasons for the poor performance of a single LSTM network are heavy tuning of parameters for LSTM networks and the use of individual LSTM networks that perform poorly when it is used for a different dataset than the one it was trained with. Hence the concept of ensembling LSTM networks evolved to yield a better choice for forecasting problems, to reduce the need to heavily optimize parameters and to increase the quality of the predictions.Among other ensembling techniques, hybrid ensemble learning with Long Short-Term Memory (LSTM), as depicted in the above figure can be used to forecast financial time series. An adaBoost algorithm is used to combine predictions from several individual Long Short-Term Memory (LSTM) networks.Firstly, by using the AdaBoost algorithm the database is trained to get the training data by generating samples with replacement from the original dataset. Secondly, LSTM is utilized to forecast each training sample separately. Thirdly, the AdaBoost algorithm is used to integrate the forecasting results of all the LSTM predictors to generate the ensemble results. The empirical results on two major daily exchange rate datasets and two stock market index datasets demonstrate that AdaBoost-LSTM ensemble learning approach outperforms other single forecasting models and ensemble learning approaches.AdaBoost-LSTM ensemble learning approach looks promising for financial time series data forecasting, for the time series data with nonlinearity and irregularity, such as exchange rates and stock indexes. Source - https://www.semanticscholar.org/paper/A-real-time-ensemble-classification-algorithm-for-Zhu-Zhao/a5538d776d0fb9e22e77c58545ce1de4d0f2b29fAnother example of ensemble learning in LSTM as depicted in the above figure, occurs when the input layer contains inputs from time t1 to tn, input for each time instant is fed to each LSTM layer. The output from each LSTM layer hk which represents the part of information time k is fed to the final output layer, which aggregates and computes the mean from all of the outputs received. Further, the mean is fed into a logistic regression layer to predict the label of the sample.ARIMAThe ARIMA Algorithm is a class of models that captures temporal structures in time series data. However using only the ARIMA model, it is hard to model the nonlinear relationships between variables.Autoregressive Integrated Moving Average Model (ARIMA) is a generalized model of Autoregressive Moving Average (ARMA) that combines Autoregressive (AR) process and Moving Average (MA) processes and builds a composite model of the time series.AR: Autoregression. A regression model that uses the dependencies between an observation and a number of lagged observations.I: Integrated. To make the time series stationary by measuring the differences of observations at different time.MA: Moving Average. An approach that takes into accounts the dependency between observations and the residual error terms when a moving average model is used to the lagged observations (q). A simple form of an AR model of order p, i.e., AR (p), can be written as a linear process given by: Here xt represents the stationary variable, c is constant, the terms in ∅t are autocorrelation coefficients at lags 1, 2, … , p and ξt, the residuals, are the Gaussian white noise series with mean zero and variance σt².The general form of an ARIMA model is denoted as ARIMA (p, q, d). With seasonal time series data, it is likely that short-run non-seasonal components contribute to the model. ARIMA model is typically represented as ARIMA (p, q, d), where: —p is the number of lag observations utilized in training the model (i.e., lag order).d is the number of times differencing is applied (i.e. the degree of differencing).q is known as the size of the moving average window (i.e., order of moving average).As for example, ARIMA (5,1,0) indicates that the lag value is set to 5 for autoregression. It uses a difference order of 1 to make the time series stationary and finally does not consider any moving average window (i.e., a window with zero sizes). RMSE can be used as an error metric to evaluate the performance of the model and to assess the accuracy of the prediction and evaluate the forecasts.Therefore, we need to estimate the seasonal ARIMA model, which incorporates both non-seasonal and seasonal factors in a multiplicative model. The general form of a seasonal ARIMA model is denoted as (p, q, d) X (P, Q, D)S, where p is the non-seasonal AR order, d is the non-seasonal differencing, q is the non-seasonal MA order, P is the seasonal AR order, D is the seasonal differencing, Q is the seasonal MA order, and S is the time span of repeating seasonal pattern, respectively. The most important step in estimating seasonal ARIMA model is to identify the values of (p, q, d) and (P, Q, D) .Based on the time plot of the data, if for instance, the variance grows with time, we should use variance-stabilizing transformations and differencing.Then, using autocorrelation function (ACF) to measure the amount of linear dependence between observations in a time series that are separated by a lag p, and the partial autocorrelation function (PACF) to determine how many autoregressive terms q are necessary, and inverse autocorrelation function (IACF) for detecting over differencing, we can identify the preliminary values of autoregressive order p, the order of differencing d, the moving average order q and their corresponding seasonal parameters P, D and Q. The parameter d is the order of difference frequency changing from non-stationary time series to stationary time series.In the popular univariate method of “Auto-Regressive Moving Average (ARMA)” for a single time series data, Auto-Regressive (AR) and Moving Average (MA) models are combined. Univariate “Auto-Regressive Integrated Moving Average (ARIMA)” is a special type of ARMA where differencing is taken into account in the model.Multivariate ARIMA models and Vector Auto-Regression (VAR) models are the other most popular forecasting models, which in turn, generalize the univariate ARIMA models and univariate autoregressive (AR) model by allowing for more than one evolving variable.ARIMA is a linear regression based forecasting approach, best suited for forecasting one-step out-of-sample forecast. Here, the algorithm developed performs multi-step out-of-sample forecast with re-estimation, i.e., each time the model is re-fitted to build the best estimation model. The algorithm, works on input “time series” data set, builds a forecast model and reports the root mean-square error of the prediction. It stores two data structures to hold the accumulatively added training data set at each iteration, “history”, and the continuously predicted values for the test data sets, “prediction.” Seasonal ARIMA models, Source: http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xegbohtmlnode44.htmlEnsemble learning with ARIMAThe three prediction models namely ARIMA, Multilayer Perceptron (MLP), and Multiple Linear Regression (MLR) are trained, validated, and tested individually to obtain target pollutant concentration prediction. To train and fit ARIMA model, the p, d, q values are estimated based on AutoCorrelated Function (ACF) and Partial Auto-Correlated Function (PACF). The MLP model is built using the following parameters: The solver used for weight optimization is ‘lbfgs’ as it can converge faster and perform better for less dimensional data. Source : http://nebula.wsimg.com/5b49ad24a16af2a07990487493272154?AccessKeyId=DFB1BA3CED7E7997D5B1&disposition=0&alloworigin=1It gives better results compared to stochastic gradient descent optimizer. The activation function ‘relu’ is used which stands for Rectified Linear units (RELU) function. It avoids the problem of vanishing gradient. The predictions from each model are then combined into a final prediction using a weighted average ensemble technique. The Weighted Average Ensemble is a method where the prediction of each model is multiplied by the weight and then their average is calculated. The weights for each base model is adjusted based on the performance ability of each model.The predictions from each model are combined using the weighted average technique, where each model is given different weights based on its performance. The model with better performance is given more weight. The weights are assigned such that the sum of weights must be equal to 1.SARIMAARIMA, is one of the most widely used forecasting methods for univariate time series data forecasting, but it does not support time series with a seasonal component. The ARIMA model is extended (SARIMA) to support the seasonal component of the series. SARIMA (Seasonal Autoregressive Integrated Moving Average), a method for time series forecasting is used on univariate data containing trends and seasonality. SARIMA is composed of trend and seasonal elements of the series.Some of the parameters that are same as ARIMA model are:p: Trend autoregression order.d: Trend difference order.q: Trend moving average orderThere are four seasonal elements that are not part of ARIMA are:P: Seasonal autoregressive order.D: Seasonal difference order.Q: Seasonal moving average order.m: The number of time steps for a single seasonal period.Thus SARIMA model can be specified as:SARIMA (p, d, q) (P,D,Q) mIf m is 12, it specifies monthly data suggests a yearly seasonal cycle.SARIMA time series models can also be combined with spatial and event-based models to yield ensemble models that solve multi-dimensional ML problems. Such an ML model can be designed to predict cell load in cellular networks at different times of the day round the year as illustrated below in the sample figureAutocorrelation, trend, and seasonality (weekday, weekend effects) from time series analysis can be used to interpret temporal influence.Regional and cell wise load distribution can be used to predict sparse and overloaded cells in varying intervals of time.Events (holidays, special mass gatherings, and others) can be predicted using decision trees. Source : https://www.cse.cuhk.edu.hk/lyu/_media/conference/slzhao_sigspatial17.pdf?id=publications%3Aall_by_year&cache=cacheDataSet, problem and Model SelectionOn analyzing the domain of the problem to be solved by either classical machine learning or deep learning mechanisms, certain factors needs to be taken into consideration before conclusively choosing the right model.The amount by which performance metrics differ in classical time-series models (ARIMA/SARIMA) vs deep learning models.The business impact long-term or short-term created due to model selection.Design, Implementation and maintenance cost of the more complex model.The loss of interpretability.First, the data are highly dynamic. It is often difficult to tease out the structure that is embedded in time series data. Second, time series data can be nonlinear and contain highly complex autocorrelation structure. Data points across different periods of time can be correlated with each other and a linear approximation sometimes fails to model all the structure in the data. Traditional methods such as autoregressive models attempt to estimate parameters of a model that can be viewed as a smooth approximation to the structure that generated the data.Under the above factors, ARIMA has been found to better model data that follow linear relationships while RNN (depending on the activation function) better model data that has non-linear relationships. ARIMA model offers a good choice to data scientists for applying it to datasets. Such datasets can be further processed with non-linear models like RNN, when the data still contains non-linear relationships in the residuals with the Lee, White and Granger (LWG) test.On applying LSTM and ARIMA on a set of financial data, the results indicated that LSTM was superior to ARIMA, as LSTM-based algorithm improved the prediction by 85% on average compared to ARIMA.ConclusionThe study concludes with some case studies why specific machine learning methods perform so poorly in practice, given their impressive performance in other areas of artificial intelligence. The challenge leaves it open to evaluating reasons of poor performance for ARIMA/SARIMA and LSTM models, and devise mechanisms to improve model’s poor performance and accuracy. Some of the areas of application of the models and their performance is listed below:ARIMA yields better results in forecasting short term, whereas LSTM yields better results for long term modeling.Traditional time series forecasting methods (ARIMA) focus on univariate data with linear relationships and fixed and manually-diagnosed temporal dependence.Machine learning problems with the substantial dataset, its found that the average reduction in error rates obtained by LSTM is between 84–87 percent when compared to ARIMA indicating the superiority of LSTM to ARIMA.The number of training times, known as “epoch” in deep learning, has no effect on the performance of the trained forecast model and it exhibits a truly random behavior.LSTMs when compared to simpler NNs like RNN and MLPappear to be more suited at fitting or overfitting the training dataset rather than forecasting it.Neural networks (LSTMs and other deep learning methods) with huge datasets offer ways to divide it into several smaller batches and train the network in multiple stages. The batch size/each chunk size refers to the total number of training data used. The term iteration is used to represent number of batches needed to complete training a model using the entire dataset.LSTM is undoubtedly more complicated and difficult to train and in most cases do not exceed the performance of a simple ARIMA model.Classical methods like ETS and ARIMA out-perform machine learning and deep learning methods for one-step forecasting on univariate datasets.Classical methods like Theta and ARIMA out-perform machine learning and deep learning methods for multi-step forecasting on univariate datasets.Classical methods like ARIMA focus on fixed temporal dependence: the relationship between observations at different times, which necessitates analysis and specification of the number of lag observations provided as input.Machine learning and deep learning methods do not yet deliver on their promise for univariate time series forecasting and there is much research left to be done.Neural networks add the capability to learn possibly noisy and nonlinear relationships with arbitrarily defined but fixed numbers of inputs. In addition, NNs output multivariate and multi-step forecasting.Recurrent neural networks (RNNs) add the explicit handling of ordered observations and is able to adapt itself to learn the temporal dependencies from context. With one observation at a time from a sequence, RNN can learn what relevant observations it has seen previously and determine its relevancy in forecasting.As LSTMs are equipped to learn long term correlations in a sequence, they can model complex multivariate sequences without the need to specify any time window. Referenceshttps://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889https://machinelearningmastery.com/findings-comparing-classical-and-machine-learning-methods-for-time-series-forecasting/https://arxiv.org/pdf/1803.06386.pdfhttps://pdfs.semanticscholar.org/e58c/7343ea25d05f6d859d66d6bb7fb91ecf9c2f.pdfKrstanovic and H. Paulheim, “Ensembles of recurrent neural networks for robust time series forecasting”, in Artificial Intelligence XXXIV, M. Bramer and M. Petridis, Eds., Cham: Springer International Publishing, 2017, pp. 34– 46, ISBN: 978–3–319–71078–5.https://link.springer.com/chapter/10.1007/978-3-319-93713-7_55http://nebula.wsimg.com/5b49ad24a16af2a07990487493272154?AccessKeyId=DFB1BA3CED7E7997D5B1&disposition=0&alloworigin=1https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3641111/Traffic Prediction Based Power Saving in Cellular Networks: A Machine Learning Method https://www.cse.cuhk.edu.hk/lyu/_media/conference/slzhao_sigspatial17.pdf?id=publications%3Aall_by_year&cache=cacheSee More