Subscribe to DSC Newsletter

Book: Machine Learning: a Probabilistic Perspective

Today’s Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. 

The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package—PMTK (probabilistic modeling toolkit)—that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

Click here for more information. 

Content

1 Introduction 1

1.1 Machine learning: what and why? 1

  • Types of machine learning 2

1.2 Supervised learning 3

  • Classification 3
  •  Regression 8

1.3 Unsupervised learning 9

  • Discovering clusters 10
  • Discovering latent factors 11
  • Discovering graph structure 13
  • Matrix completion 14

1.4 Some basic concepts in machine learning 16

  • Parametric vs non-parametric models 16
  • A simple non-parametric classifier: K-nearest neighbors 16
  • The curse of dimensionality 18
  • Parametric models for classification and regression 19
  • Linear regression 19
  • Logistic regression 21
  • Overfitting 22
  • Model selection 22
  • No free lunch theorem 24

2 Probability 27

2.1 Introduction 27

2.2 A brief review of probability theory 28

  • Discrete random variables 28
  • Fundamental rules 28
  • Bayes rule 29
  • Independence and conditional independence 30
  • Continuous random variables 32
  • Quantiles 33
  • Mean and variance 33

2.3 Some common discrete distributions 34

  • The binomial and Bernoulli distributions 34
  • The multinomial and multinoulli distributions 35
  • The Poisson distribution 37
  • The empirical distribution 37

2.4 Some common continuous distributions 38

  • Gaussian (normal) distribution 38
  • Degenerate pdf 39
  • The Laplace distribution 41
  • The gamma distribution 41
  • The beta distribution 42
  • Pareto distribution 43

2.5 Joint probability distributions 44

  • Covariance and correlation 44
  • The multivariate Gaussian 46
  • Multivariate Student t distribution 46
  • Dirichlet distribution 47

2.6 Transformations of random variables 49

  • 2.6.1 Linear transformations 49
  • 2.6.2 General transformations 50
  • 2.6.3 Central limit theorem 51

2.7 Monte Carlo approximation 52

  • 2.7.1 Example: change of variables, the MC way 53
  • 2.7.2 Example: estimating π by Monte Carlo integration 54
  • 2.7.3 Accuracy of Monte Carlo approximation 54

2.8 Information theory 56

  • Entropy 56
  • KL divergence 57
  • Mutual information 59

3 Generative models for discrete data 65

3.1 Introduction 65

3.2 Bayesian concept learning 65

  • 3.2.1 Likelihood 67
  • 3.2.2 Prior 67
  • 3.2.3 Posterior 68
  • 3.2.4 Posterior predictive distribution 71
  • 3.2.5 A more complex prior 72

3.3 The beta-binomial model 72

  • 3.3.1 Likelihood 73
  • 3.3.2 Prior 74
  • 3.3.3 Posterior 75
  • 3.3.4 Posterior predictive distribution 77

3.4 The Dirichlet-multinomial model 78

  • Likelihood 79
  • Prior 79
  • Posterior 79
  • Posterior predictive 81

3.5 Naive Bayes classifiers 82

  • Model fitting 83
  • Using the model for prediction 85
  • The log-sum-exp trick 86
  • Feature selection using mutual information 86
  • Classifying documents using bag of words 87

4 Gaussian models 97

4.1 Introduction 97

  • Notation 97
  • Basics 97
  • MLE for an MVN 99
  • Maximum entropy derivation of the Gaussian * 101

4.2 Gaussian discriminant analysis 101

  • Quadratic discriminant analysis (QDA) 102
  • Linear discriminant analysis (LDA) 103
  • Two-class LDA 104
  • MLE for discriminant analysis 106
  • Strategies for preventing overfitting 106
  • Regularized LDA * 107
  • Diagonal LDA 108
  • Nearest shrunken centroids classifier * 109

4.3 Inference in jointly Gaussian distributions 110

  • Statement of the result 111
  • Examples 111
  • Information form 115
  • Proof of the result * 116

4.4 Linear Gaussian systems 119

  • Statement of the result 119
  • Examples 120
  • Proof of the result * 124

4.5 Digression: The Wishart distribution * 125

  • Inverse Wishart distribution 126
  • Visualizing the Wishart distribution * 127

4.6 Inferring the parameters of an MVN 127

  • Posterior distribution of μ 128
  • Posterior distribution of Σ * 128
  • Posterior distribution of μ and Σ * 132
  • Sensor fusion with unknown precisions * 138

5 Bayesian statistics 149

5.1 Introduction 149

5.2 Summarizing posterior distributions 149

  • MAP estimation 149
  • Credible intervals 152
  • Inference for a difference in proportions 154

5.3 Bayesian model selection 155

  • 5.3.1 Bayesian Occam’s razor 156
  • 5.3.2 Computing the marginal likelihood (evidence) 158
  • 5.3.3 Bayes factors 163
  • 5.3.4 Jeffreys-Lindley paradox * 164

5.4 Priors 165

  • Uninformative priors 165
  • Jeffreys priors * 166
  • Robust priors 168
  • Mixtures of conjugate priors 168

5.5 Hierarchical Bayes 171

  • Example: modeling related cancer rates 171
  • Empirical Bayes 172
  • Example: beta-binomial model 173
  • Example: Gaussian-Gaussian model 173

5.7 Bayesian decision theory 176

  • Bayes estimators for common loss functions 177
  • The false positive vs false negative tradeoff 180
  • Other topics * 184

6 Frequentist statistics 191

6.1 Introduction 191

6.2 Sampling distribution of an estimator 191

  • Bootstrap 192
  • Large sample theory for the MLE * 193

6.3 Frequentist decision theory 194

  • Bayes risk 195
  • Minimax risk 196
  • Admissible estimators 197

6.4 Desirable properties of estimators 200

  • Consistent estimators 200
  • Unbiased estimators 200
  • Minimum variance estimators 201
  • The bias-variance tradeoff 202

6.5 Empirical risk minimization 204

  • Regularized risk minimization 205
  • Structural risk minimization 206
  • Estimating the risk using cross validation 206
  • Upper bounding the risk using statistical learning theory * 209
  • Surrogate loss functions 210

6.6 Pathologies of frequentist statistics * 211

  • Counter-intuitive behavior of confidence intervals 212
  • p-values considered harmful 213
  • The likelihood principle 214
  • Why isn’t everyone a Bayesian? 215

7 Linear regression 217

7.1 Introduction 217

7.2 Model specification 217

  • Maximum likelihood estimation (least squares) 217
  • Derivation of the MLE 219
  • Geometric interpretation 220
  • Convexity 221

7.4 Robust linear regression * 223

7.5 Ridge regression 225

  • Basic idea 225
  • Numerically stable computation * 227
  • Connection with PCA * 228
  • Regularization effects of big data 230

7.6 Bayesian linear regression 231

  • Computing the posterior 232
  • Computing the posterior predictive 233
  • Bayesian inference when σ2 is unknown * 234
  • EB for linear regression (evidence procedure) 238

8 Logistic regression 245

8.1 Introduction 245

8.2 Model specification 245

8.3 Model fitting 245

  • MLE 246
  • Steepest descent 247
  • Newton’s method 249
  • Iteratively reweighted least squares (IRLS) 250
  • Quasi-Newton (variable metric) methods 251
  • Regularization 252
  • Multi-class logistic regression 252

8.4 Bayesian logistic regression 254

  • Laplace approximation 255
  • Derivation of the BIC 255
  • Gaussian approximation for logistic regression 256
  • Approximating the posterior predictive 256
  • Residual analysis (outlier detection) * 260

8.5 Online learning and stochastic optimization 261

  • Online learning and regret minimization 262
  • Stochastic optimization and risk minimization 262
  • The LMS algorithm 264
  • The perceptron algorithm 265
  • A Bayesian view 266

8.6 Generative vs discriminative classifiers 267

  • Pros and cons of each approach 268
  • Dealing with missing data 269
  • Fisher’s linear discriminant analysis (FLDA) * 271

9 Generalized linear models and the exponential family 281

9.1 Introduction 281

9.2 The exponential family 281

  • Definition 282
  • Examples 282
  • Log partition function 284
  • MLE for the exponential family 286
  • Bayes for the exponential family * 287
  • Maximum entropy derivation of the exponential family * 289

9.3 Generalized linear models (GLMs) 290

  • Basics 290
  • ML and MAP estimation 292
  • Bayesian inference 293

9.4 Probit regression 293

  • ML/MAP estimation using gradient-based optimization 294
  • Latent variable interpretation 294
  • Ordinal probit regression * 295
  • Multinomial probit models * 295

9.5 Multi-task learning 296

  • Hierarchical Bayes for multi-task learning 296
  • Application to personalized email spam filtering 296
  • Application to domain adaptation 297
  • Other kinds of prior 297

9.6 Generalized linear mixed models * 298

  • Example: semi-parametric GLMMs for medical data 298
  • Computational issues 300

9.7 Learning to rank * 300

  • The pointwise approach 301
  • The pairwise approach 301
  • The listwise approach 302
  • Loss functions for ranking 303

10 Directed graphical models (Bayes nets) 307

10.1 Introduction 307

  • Chain rule 307
  • Conditional independence 308
  • Graphical models 308
  • Graph terminology 309
  • Directed graphical models 310

10.2 Examples 311

  • Naive Bayes classifiers 311
  • Markov and hidden Markov models 312
  • Medical diagnosis 313
  • Genetic linkage analysis * 315
  • Directed Gaussian graphical models * 318

10.3 Inference 319

10.4 Learning 320

  • Plate notation 320
  • Learning from complete data 322
  • Learning with missing and/or latent variables 323

10.5 Conditional independence properties of DGMs 324

  • d-separation and the Bayes Ball algorithm (global Markov properties) 324
  • Other Markov properties of DGMs 327
  • Markov blanket and full conditionals 327

10.6 Influence (decision) diagrams * 328

11 Mixture models and the EM algorithm 337

11.1 Latent variable models 337

11.2 Mixture models 337

  • 11.2.1 Mixtures of Gaussians 339
  • 11.2.2 Mixture of multinoullis 340
  • 11.2.3 Using mixture models for clustering 340
  • 11.2.4 Mixtures of experts 342

11.3 Parameter estimation for mixture models 345

  • 11.3.1 Unidentifiability 346
  • 11.3.2 Computing a MAP estimate is non-convex 347

11.4 The EM algorithm 348

  • Basic idea 349
  • EM for GMMs 350
  • EM for mixture of experts 357
  • EM for DGMs with hidden variables 358
  • EM for the Student distribution * 359
  • EM for probit regression * 362
  • Theoretical basis for EM * 363
  • Online EM 365
  • Other EM variants * 367

11.5 Model selection for latent variable models 370

  • Model selection for probabilistic models 370
  • Model selection for non-probabilistic methods 370

11.6 Fitting models with missing data 372

  • EM for the MLE of an MVN with missing data 373

12 Latent linear models 381

12.1 Factor analysis 381

  • FA is a low rank parameterization of an MVN 381
  • Inference of the latent factors 382
  • Unidentifiability 383
  • Mixtures of factor analysers 385
  • EM for factor analysis models 386
  • Fitting FA models with missing data 387

12.2 Principal components analysis (PCA) 387

  • Classical PCA: statement of the theorem 387
  • Proof * 389
  • Singular value decomposition (SVD) 392
  • Probabilistic PCA 395
  • EM algorithm for PCA 396

12.3 Choosing the number of latent dimensions 398

  • Model selection for FA/PPCA 398
  • Model selection for PCA 399

12.4 PCA for categorical data 402

12.5 PCA for paired and multi-view data 404

  • Supervised PCA (latent factor regression) 405
  • Partial least squares 406
  • Canonical correlation analysis 407

12.6 Independent Component Analysis (ICA) 407

  • Maximum likelihood estimation 410
  • The FastICA algorithm 411
  • Using EM 414
  • Other estimation principles * 415

13 Sparse linear models 421

13.1 Introduction 421

13.2 Bayesian variable selection 422

  • The spike and slab model 424
  • From the Bernoulli-Gaussian model to 0 regularization 425
  • Algorithms 426

13.3 L-1  regularization: basics 429

  • Why does L-1 regularization yield sparse solutions? 430
  • Optimality conditions for lasso 431
  • Comparison of least squares, lasso, ridge and subset selection 435
  • Regularization path 436
  • Model selection 439
  • Bayesian inference for linear models with Laplace priors 440

13.4 L-1 regularization: algorithms 441

  • Coordinate descent 441
  • LARS and other homotopy methods 441
  • Proximal and gradient projection methods 442
  • EM for lasso 447

13.5 L-1 regularization: extensions 449

  • Group Lasso 449
  • Fused lasso 454
  • Elastic net (ridge and lasso combined) 455

13.6 Non-convex regularizers 457

  • Bridge regression 458
  • Hierarchical adaptive lasso 458
  • Other hierarchical priors 462

13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) 463

  • ARD for linear regression 463
  • Whence sparsity? 465
  • Connection to MAP estimation 465
  • Algorithms for ARD * 466
  • ARD for logistic regression 468

13.8 Sparse coding * 468

  • Learning a sparse coding dictionary 469
  • Results of dictionary learning from image patches 470
  • Compressed sensing 472
  • Image inpainting and denoising 472

14 Kernels 479

14.1 Introduction 479

14.2 Kernel functions 479

  • RBF kernels 480
  • Kernels for comparing documents 480
  • Mercer (positive definite) kernels 481
  • Linear kernels 482
  • Matern kernels 482
  • String kernels 483
  • Pyramid match kernels 484
  • Kernels derived from probabilistic generative models 485

14.3 Using kernels inside GLMs 486

  •  Kernel machines 486
  • L1VMs, RVMs, and other sparse vector machines 487

14.4 The kernel trick 488

  • Kernelized nearest neighbor classification 489
  • Kernelized K-medoids clustering 489
  • Kernelized ridge regression 492
  • Kernel PCA 493

14.5 Support vector machines (SVMs) 496

  • SVMs for regression 497
  • SVMs for classification 498
  • Choosing C 504
  • Summary of key points 504
  • A probabilistic interpretation of SVMs 505

14.6 Comparison of discriminative kernel methods 505

14.7 Kernels for building generative models 507

  • Smoothing kernels 507
  • Kernel density estimation (KDE) 508
  • From KDE to KNN 509
  • Kernel regression 510
  • Locally weighted regression 512

15 Gaussian processes 515

15.1 Introduction 515

15.2 GPs for regression 516

  • Predictions using noise-free observations 517
  • Predictions using noisy observations 518
  • Effect of the kernel parameters 519
  • Estimating the kernel parameters 521
  • Computational and numerical issues * 524
  • Semi-parametric GPs * 524

15.3 GPs meet GLMs 525

  • Binary classification 525
  • Multi-class classification 528
  • GPs for Poisson regression 531

15.4 Connection with other methods 532

  • Linear models compared to GPs 532
  • Linear smoothers compared to GPs 533
  • SVMs compared to GPs 534
  • L1VM and RVMs compared to GPs 534
  • Neural networks compared to GPs 535
  • Smoothing splines compared to GPs * 536
  • RKHS methods compared to GPs * 538

15.5 GP latent variable model 540

15.6 Approximation methods for large datasets 542

16 Adaptive basis function models 543

16.1 Introduction 543

16.2 Classification and regression trees (CART) 544

  • Basics 544
  • Growing a tree 545
  • Pruning a tree 549
  • Pros and cons of trees 550
  • Random forests 550
  • CART compared to hierarchical mixture of experts * 551

16.3 Generalized additive models 552

  • Backfitting 552
  • Computational efficiency 553
  • Multivariate adaptive regression splines (MARS) 553

16.4 Boosting 554

  • Forward stagewise additive modeling 555
  • L2boosting 557
  • AdaBoost 558
  • LogitBoost 559
  • Boosting as functional gradient descent 560
  • Sparse boosting 561
  • Multivariate adaptive regression trees (MART) 562
  • Why does boosting work so well? 562
  • A Bayesian view 563

16.5 Feedforward neural networks (multilayer perceptrons) 563

  • Convolutional neural networks 564
  • Other kinds of neural networks 568
  • A brief history of the field 568
  • The backpropagation algorithm 569
  • Identifiability 572
  • Regularization 572
  • Bayesian inference * 576

16.6 Ensemble learning 580

  • Stacking 580
  • Error-correcting output codes 581
  • Ensemble learning is not equivalent to Bayes model averaging 581

16.7 Experimental comparison 582

  • Low-dimensional features 582
  • High-dimensional features 583

16.8 Interpreting black-box models 585

17 Markov and hidden Markov models 589

17.1 Introduction 589

17.2 Markov models 589

  • Transition matrix 589
  • Application: Language modeling 591
  • Stationary distribution of a Markov chain * 596
  • Application: Google’s PageRank algorithm for web page ranking * 600

17.3 Hidden Markov models 603

  • Applications of HMMs 604

17.4 Inference in HMMs 606

  • Types of inference problems for temporal models 606
  • The forwards algorithm 609
  • The forwards-backwards algorithm 610
  • The Viterbi algorithm 612
  • Forwards filtering, backwards sampling 616

17.5 Learning for HMMs 617

  • Training with fully observed data 617
  • EM for HMMs (the Baum-Welch algorithm) 618
  • Bayesian methods for “fitting” HMMs * 620
  • Discriminative training 620
  • Model selection 621

17.6 Generalizations of HMMs 621

  • Variable duration (semi-Markov) HMMs 622
  • Hierarchical HMMs 624
  • Input-output HMMs 625
  • Auto-regressive and buried HMMs 626
  • Factorial HMM 627
  • Coupled HMM and the influence model 628
  • Dynamic Bayesian networks (DBNs) 628

18 State space models 631

18.1 Introduction 631

18.2 Applications of SSMs 632

  • SSMs for object tracking 632
  • Robotic SLAM 633
  • Online parameter learning using recursive least squares 636
  • SSM for time series forecasting * 637

18.3 Inference in LG-SSM 640

  • The Kalman filtering algorithm 640
  • The Kalman smoothing algorithm 643

18.4 Learning for LG-SSM 646

  • Identifiability and numerical stability 646
  • Training with fully observed data 647
  • EM for LG-SSM 647
  • Subspace methods 647
  • Bayesian methods for “fitting” LG-SSMs 647

18.5 Approximate online inference for non-linear, non-Gaussian SSMs 647

  • Extended Kalman filter (EKF) 648
  • Unscented Kalman filter (UKF) 650
  • Assumed density filtering (ADF) 652

18.6 Hybrid discrete/continuous SSMs 655

  • Inference 656
  • Application: data association and multi-target tracking 658
  • Application: fault diagnosis 659
  • Application: econometric forecasting 660

19 Undirected graphical models (Markov random fields) 661

19.1 Introduction 661

19.2 Conditional independence properties of UGMs 661

  • Key properties 661
  • An undirected alternative to d-separation 663
  • Comparing directed and undirected graphical models 664

19.3 Parameterization of MRFs 665

  • The Hammersley-Clifford theorem 665
  • Representing potential functions 667

19.4 Examples of MRFs 668

  • Ising model 668
  • Hopfield networks 669
  • Potts model 671
  • Gaussian MRFs 672
  • Markov logic networks * 674

19.5 Learning 676

  • Training maxent models using gradient methods 676
  • Training partially observed maxent models 677
  • Approximate methods for computing the MLEs of MRFs 678
  • Pseudo likelihood 678
  • Stochastic maximum likelihood 679
  • Feature induction for maxent models * 680
  • Iterative proportional fitting (IPF) * 681

19.6 Conditional random fields (CRFs) 684

  • Chain-structured CRFs, MEMMs and the label-bias problem 684
  • Applications of CRFs 686
  • CRF training 692

19.7 Structural SVMs 693

  • SSVMs: a probabilistic view 693
  • SSVMs: a non-probabilistic view 695
  • Cutting plane methods for fitting SSVMs 698
  • Online algorithms for fitting SSVMs 700
  • Latent structural SVMs 701

20 Exact inference for graphical models 707

20.1 Introduction 707

20.2 Belief propagation for trees 707

  • Serial protocol 707
  • Parallel protocol 709
  • Gaussian BP * 710
  • Other BP variants * 712

20.3 The variable elimination algorithm 714

  • The generalized distributive law * 717
  • Computational complexity of VE 717
  • A weakness of VE 720

20.4 The junction tree algorithm * 720

  • Creating a junction tree 720
  • Message passing on a junction tree 722
  • Computational complexity of JTA 725
  • JTA generalizations * 726

20.5 Computational intractability of exact inference in the worst case 726

  • Approximate inference 727

21 Variational inference 731

21.1 Introduction 731

21.2 Variational inference 732

  • Alternative interpretations of the variational objective 733
  • Forward or reverse KL? * 733

21.3 The mean field method 735

  • Derivation of the mean field update equations 736
  • Example: mean field for the Ising model 737

21.4 Structured mean field * 739

  • Example: factorial HMM 740

21.5 Variational Bayes 742

  • Example: VB for a univariate Gaussian 742
  • Example: VB for linear regression 746

21.6 Variational Bayes EM 749

  • 21.6.1 Example: VBEM for mixtures of Gaussians * 750

21.7 Variational message passing and VIBES 756

21.8 Local variational bounds * 756

  • Motivating applications 756
  • Bohning’s quadratic bound to the log-sum-exp function 758
  • Bounds for the sigmoid function 760
  • Other bounds and approximations to the log-sum-exp function * 762
  • Variational inference based on upper bounds 763

22 More variational inference 767

22.1 Introduction 767

22.2 Loopy belief propagation: algorithmic issues 767

  • A brief history 767
  • LBP on pairwise models 768
  • LBP on a factor graph 769
  • Convergence 771
  • Accuracy of LBP 774
  • Other speedup tricks for LBP * 775

22.3 Loopy belief propagation: theoretical issues * 776

  • UGMs represented in exponential family form 776
  • The marginal polytope 777
  • Exact inference as a variational optimization problem 778
  • Mean field as a variational optimization problem 779
  • LBP as a variational optimization problem 779
  • Loopy BP vs mean field 783

22.4 Extensions of belief propagation * 783

  • Generalized belief propagation 783
  • Convex belief propagation 785

22.5 Expectation propagation 787

  • EP as a variational inference problem 788
  • Optimizing the EP objective using moment matching 789
  • EP for the clutter problem 791
  • LBP is a special case of EP 792
  • Ranking players using TrueSkill 793
  • Other applications of EP 799

22.6 MAP state estimation 799

  • Linear programming relaxation 799
  • Max-product belief propagation 800
  • Graphcuts 801
  • Experimental comparison of graphcuts and BP 804
  • Dual decomposition 806

23 Monte Carlo inference 815

23.1 Introduction 815

23.2 Sampling from standard distributions 815

  • Using the cdf 815
  • Sampling from a Gaussian (Box-Muller method) 817

23.3 Rejection sampling 817

  • Basic idea 817
  • Example 818
  • Application to Bayesian statistics 819
  • Adaptive rejection sampling 819
  • Rejection sampling in high dimensions 820

23.4 Importance sampling 820

  • Basic idea 820
  • Handling unnormalized distributions 821
  • Importance sampling for a DGM: likelihood weighting 822
  • Sampling importance resampling (SIR) 822

23.5 Particle filtering 823

  • Sequential importance sampling 824
  • The degeneracy problem 825
  • The resampling step 825
  • The proposal distribution 827
  • Application: robot localization 828
  • Application: visual object tracking 828
  • Application: time series forecasting 831

23.6 Rao-Blackwellised particle filtering (RBPF) 831

  • RBPF for switching LG-SSMs 831
  • Application: tracking a maneuvering target 832
  • Application: Fast SLAM 834

24 Markov chain Monte Carlo (MCMC) inference 837

24.1 Introduction 837

24.2 Gibbs sampling 838

  • Basic idea 838
  • Example: Gibbs sampling for the Ising model 838
  • Example: Gibbs sampling for inferring the parameters of a GMM 840
  • Collapsed Gibbs sampling * 841
  • Gibbs sampling for hierarchical GLMs 844
  • BUGS and JAGS 846
  • The Imputation Posterior (IP) algorithm 847
  • Blocking Gibbs sampling 847

24.3 Metropolis Hastings algorithm 848

  • Basic idea 848
  • Gibbs sampling is a special case of MH 849
  • Proposal distributions 850
  • Adaptive MCMC 853
  • Initialization and mode hopping 854
  • Why MH works * 854
  • Reversible jump (trans-dimensional) MCMC * 855

24.4 Speed and accuracy of MCMC 856

  • The burn-in phase 856
  • Mixing rates of Markov chains * 857
  • Practical convergence diagnostics 858
  • Accuracy of MCMC 860
  • How many chains? 862

24.5 Auxiliary variable MCMC * 863

  • Auxiliary variable sampling for logistic regression 863
  • Slice sampling 864
  • Swendsen Wang 866
  • Hybrid/Hamiltonian MCMC * 868

24.6 Annealing methods 868

  • Simulated annealing 869
  • Annealed importance sampling 871
  • Parallel tempering 871

24.7 Approximating the marginal likelihood 872

  • The candidate method 872
  • Harmonic mean estimate 872
  • Annealed importance sampling 873

25 Clustering 875

25.1 Introduction 875

  • Measuring (dis)similarity 875
  • Evaluating the output of clustering methods * 876

25.2 Dirichlet process mixture models 879

  • From finite to infinite mixture models 879
  • The Dirichlet process 882
  • Applying Dirichlet processes to mixture modeling 885
  • Fitting a DP mixture model 886

25.3 Affinity propagation 887

25.4 Spectral clustering 890

  • Graph Laplacian 891
  • Normalized graph Laplacian 892
  • Example 893

25.5 Hierarchical clustering 893

  • Agglomerative clustering 895
  • Divisive clustering 898
  • Choosing the number of clusters 899
  • Bayesian hierarchical clustering 899

25.6 Clustering datapoints and features 901

  • Biclustering 903
  • Multi-view clustering 903

26 Graphical model structure learning 907

26.1 Introduction 907

26.2 Structure learning for knowledge discovery 908

  • Relevance networks 908
  • Dependency networks 909

26.3 Learning tree structures 910

  • Directed or undirected tree? 911
  • Chow-Liu algorithm for finding the ML tree structure 912
  • Finding the MAP forest 912
  • Mixtures of trees 914

26.4 Learning DAG structures 914

  • Markov equivalence 914
  • Exact structural inference 916
  • Scaling up to larger graphs 920

26.5 Learning DAG structure with latent variables 922

  • Approximating the marginal likelihood when we have missing data 922
  • Structural EM 925
  • Discovering hidden variables 926
  • Case study: Google’s Rephil 928
  • Structural equation models * 929

26.6 Learning causal DAGs 931

  • Causal interpretation of DAGs 931
  • Using causal DAGs to resolve Simpson’s paradox 933
  • Learning causal DAG structures 935

26.7 Learning undirected Gaussian graphical models 938

  • MLE for a GGM 938
  • Graphical lasso 939
  • Bayesian inference for GGM structure * 941
  • Handling non-Gaussian data using copulas * 942

26.8 Learning undirected discrete graphical models 942

  • Graphical lasso for MRFs/CRFs 942
  • Thin junction trees 944

27 Latent variable models for discrete data 945

27.1 Introduction 945

27.2 Distributed state LVMs for discrete data 946

  • 27.2.1 Mixture models 946
  • 27.2.2 Exponential family PCA 947
  • 27.2.3 LDA and mPCA 948
  • 27.2.4 GaP model and non-negative matrix factorization 949

27.3 Latent Dirichlet allocation (LDA) 950

  • 27.3.1 Basics 950
  • 27.3.2 Unsupervised discovery of topics 953
  • 27.3.3 Quantitatively evaluating LDA as a language model 953
  • 27.3.4 Fitting using (collapsed) Gibbs sampling 955
  • 27.3.5 Example 956
  • 27.3.6 Fitting using batch variational inference 957
  • 27.3.7 Fitting using online variational inference 959
  • 27.3.8 Determining the number of topics 960

27.4 Extensions of LDA 961

  • Correlated topic model 961
  • Dynamic topic model 962
  • LDA-HMM 963
  • Supervised LDA 967

27.5 LVMs for graph-structured data 970

  • Stochastic block model 971
  • Mixed membership stochastic block model 973
  • Relational topic model 974

27.6 LVMs for relational data 975

  • Infinite relational model 976
  • Probabilistic matrix factorization for collaborative filtering 979

27.7 Restricted Boltzmann machines (RBMs) 983

  • Varieties of RBMs 985
  • Learning RBMs 987
  • Applications of RBMs 991

28 Deep learning 995

28.1 Introduction 995

28.2 Deep generative models 995

  • Deep directed networks 996
  • Deep Boltzmann machines 996
  • Deep belief networks 997
  • Greedy layer-wise learning of DBNs 998

28.3 Deep neural networks 999

  • Deep multi-layer perceptrons 999
  • Deep auto-encoders 1000
  • Stacked denoising auto-encoders 1001

28.4 Applications of deep networks 1001

  • Handwritten digit classification using DBNs 1001
  • Data visualization and feature discovery using deep auto-encoders 1002
  • Information retrieval using deep auto-encoders (semantic hashing) 1003
  • Learning audio features using 1d convolutional DBNs 1004
  • Learning image features using 2d convolutional DBNs 1005

28.5 Discussion 1005

Notation 1009

Bibliography 1015

Indexes 1047

  • Index to code 1047
  • Index to keywords 1050

Views: 4040

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Macgilchrist on November 23, 2017 at 10:47am
Superb! I will use this book for one of my MSc courses.
Great work. Thank you.
Prof. R. Macgilchrist.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service