# Restricted Boltzmann machine (RBM)

A restricted Boltzmann machine (RBM) is a type of artificial neural network (ANN) for machine learning of probability distributions. An artificial neural network is a system of hardware and/or software patterned after the operation of neurons in the human brain.

Created by Geoff Hinton, RBM algorithms are useful in defining dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modeling. Like perceptrons, they are a relatively simple type of neural network.

RBMs fall into the categories of Stochastic and generative models of artificial intelligence. Stochastic refers to anything based on probabilities and generative means that it uses AI to produce (generate) a desired output. Generative models contrast with discriminative models, which classify existing data.

Like all multi-layer neural networks, RBMs have layers of artificial neurons, in their case two. The first layer is the input layer. The second is a hidden layer that only accepts what the first layer passes on. The restriction spoken of in RBM is that the different neurons within the same layer can’t communicate with one another. Instead, neurons can only communicate with other layers. (In a standard Boltzmann machine, neurons in the hidden layer intercommunicate.) Each node within a layer performs its own calculations. After performing its calculations, the node then makes a stochastic decision about whether to pass on the on to the next layer.

Though RBM are still sometimes used, they have mostly been replaced by generative adversarial networks or vibrational auto-encoders.

### Boltzmann Machine -A Probabilistic Graphical Models

Sir Geoffrey Hinton, the “Godfather of Deep Learning” coined Boltzmann Machine in 1985 for the first time. A well-known figure and personality in the deep learning community Sir Geoffrey Hinton also a professor at the University of Toronto.

**
**

Boltzmann Machines – A kind of imaginary recurrent neural network and this normally get interpreted from the probabilistic graphical models. In a short and concise manner a neural network which is fully connected and consist of visible and hidden units. It operates in asynchronous mode with stochastic updates for each of its unit.

These machines are also called as probability distributions on high dimensional binary vectors. It’s a generative unsupervised model used for probability distribution from an original dataset. A great demanding/hungry tool for computation power however restricting its network topology the behaviour can be controlled.

It is indeed an algorithm which is useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modelling. Like any other neural network, these machines also have (both BM and RBM) an input layer or referred to as the visible layer and one or several hidden layers or referred to as the hidden layer.

## Restricted Boltzmann Machines

Boltzmann machines are probability distributions on high dimensional binary vectors which are analogous to Gaussian Markov Random Fields in that they are fully determined by first and second-order moments.

It is used for pattern storage and retrieval. As per wiki “A Boltzmann machine is also called stochastic Hopfield network with hidden units) is a type of stochastic recurrent neural network and Markov random field.” RBM itself has many applications, some of them are listed as below

- Collaborative filtering
- Multiclass classification
- Information retrieval
- Motion capture modelling
- Segmentation
- Modelling natural images

Deep belief nets use the Boltzmann machine especially the Restricted Boltzmann machine as a key component but first order weight updates.

**Lecture 10 Boltzmann machine**

## Limitations of neural networks grow clearer in business

AI often means neural networks, but intensive training requirements are prompting enterprises to look for alternatives to neural networks that are easier to implement.

The rise in prominence AI today can be credited largely to improvements in one algorithm category: the neural network. But experts say that the limitations neural networks mean enterprises will need to embrace a fuller lineup algorithms to advance AI.

"With neural networks, there's this huge complication," said David Barrett, founder and CEO Expensify Inc. "You end up with trillions dimensions. If you want to change something, you need to start entirely from scratch. The more we tried [neural networks], we still couldn't get them to work."

Neural network technology is seen as cutting-edge today, but the underlying algorithms are nothing new. They were proposed as theoretically possible decades ago.

What's new is that we now have the massive stores data needed to train algorithms and the robust compute power to process all this data in a reasonable period time. As neural networks have moved from theoretical to practical, they've come to power some the most advanced AI applications, like computer vision, language translation and self-driving cars.

**Tom Goldstein: "An empirical look at generalization in neural nets"**

## Training requirements for neural networks are too high

But the problem, as Barrett and others see it, is that neural networks simply require too much brute force. For example, if you show the algorithm a billion examples images containing certain objects, it will learn to classify that object in new images effectively. But that's a high bar for training, and meeting that requirement is sometimes impossible.

That was the case for Barrett and his team. At the 2018 Artificial Intelligence Conference in New York, he described how Expensify is using natural language processing to automate customer service for its expense reporting software. Neural networks weren't a good fit for Expensify because the San Francisco company didn't have the corpus of historical data necessary.

Expensify's customer inquiries are often esoteric, Barrett said. Even when customers' concerns map to common problems, their phrasing is unique and, therefore, hard to classify using a system that demands many training examples.

So, Barrett and his team developed their own approach. He didn't identify the specific type of algorithms their tool is based on, but he said it compares pieces of conversations to conversations that have proceeded successfully in the past. It doesn't need to classify queries with precision like a neural network would because it's more focused on moving the conversation along a path rather than delivering the right response to a given query. This gives the bot a chance to ask clarifying questions that reduce ambiguity.

*"The challenge of AI is it's built to answer perfectly formed
questions," Barrett said. "The challenge of the real world is
different."*

**Deep Boltzmann Machines**

## A 'broad church' of algorithms is needed in AI

Part of the reason for the enthusiasm around neural network technology is that many people are just finding out about it, said Zoubin Ghahramani, chief scientist at Uber. But for those that have known about and used it for years, the limitations of neural networks are well known.

That doesn't mean it's time for people to ignore neural networks, however. Instead, Ghahramani said it comes down to using the right tool for the right job. He described an approach to incorporating Bayesian inference, in which the estimated probability of something occurring is updated when more evidence becomes available, into machine learning models.

*"To have successful AI applications that solve challenging real-world
problems, you have to have a broad church of methods," he said in a
press conference. "You can't come in with one hammer trying to solve all
problems."*

Another alternative to neural network technology is deep reinforcement learning, which is optimized to achieve a goal over many steps by incentivizing effective steps and penalizing unfavorable steps. The AlphaGo program, which beat human champions at the game Go, used a combination of neural networks and deep reinforcement learning to learn the game.

Deep reinforcement learning algorithms essentially learn through trial and error, whereas neural networks learn through example. This means deep reinforcement requires less labeled training data upfront.

Kathryn Hume, vice president of product and strategy at Integrate.ai Inc., a Toronto-based software company that helps enterprises integrate AI into existing business processes, said any type of model that reduces the reliance on labeled training data is important. She mentioned Bayesian parametric models, which assess the probability of an occurrence based on existing data rather than requiring some minimum threshold of prior examples, one of the primary limitations of neural networks.

"We need not rely on just throwing a bunch of information into a pot," she said. "It can move us away from the reliance on labeled training data when we can infer the structure of data," rather than using algorithms like neural networks, which require millions or billions of examples of labeled training before they can make predictions.

## What is a Neural Network and How Does it Work?

esearch on artificial neural networks was motivated by the observation that human intelligence emerges from highly parallel networks of relatively simple, non-linear neurons that learn by adjusting the strengths of their connections. This observation leads to a central computational question: How is it possible for networks of this general kind to learn the complicated internal representations that are required for difficult tasks such as recognizing objects or understanding language? Deep learning seeks to answer this question by using many layers of activity vectors as representations and learning the connection strengths that give rise to these vectors by following the stochastic gradient of an objective function that measures how well the network is performing. It is very surprising that such a conceptually simple approach has proved to be so effective when applied to large training sets using huge amounts of computation and it appears that a key ingredient is depth: shallow networks simply do not work as well.

We reviewed the basic concepts and some of the breakthrough achievements of deep learning several years ago.63 Here we briefly describe the origins of deep learning, describe a few of the more recent advances, and discuss some of the future challenges. These challenges include learning with little or no external supervision, coping with test examples that come from a different distribution than the training examples, and using the deep learning approach for tasks that humans solve by using a deliberate sequence of steps which we attend to consciously—tasks that Kahneman56 calls system 2 tasks as opposed to system 1 tasks like object recognition or immediate natural language understanding, which generally feel effortless.

## From Hand-Coded Symbolic Expressions to Learned Distributed Representations

There are two quite different paradigms for AI. Put simply, the logic-inspired paradigm views sequential reasoning as the essence of intelligence and aims to implement reasoning in computers using hand-designed rules of inference that operate on hand-designed symbolic expressions that formalize knowledge. The brain-inspired paradigm views learning representations from data as the essence of intelligence and aims to implement learning by hand-designing or evolving rules for modifying the connection strengths in simulated networks of artificial neurons.

In the logic-inspired paradigm, a symbol has no meaningful internal structure: Its meaning resides in its relationships to other symbols which can be represented by a set of symbolic expressions or by a relational graph. By contrast, in the brain-inspired paradigm the external symbols that are used for communication are converted into internal vectors of neural activity and these vectors have a rich similarity structure. Activity vectors can be used to model the structure inherent in a set of symbol strings by learning appropriate activity vectors for each symbol and learning non-linear transformations that allow the activity vectors that correspond to missing elements of a symbol string to be filled in. This was first demonstrated in Rumelhart et al.74 on toy data and then by Bengio et al.14 on real sentences. A very impressive recent demonstration is BERT,22 which also exploits self-attention to dynamically connect groups of units, as described later.

The main advantage of using vectors of neural activity to represent concepts and weight matrices to capture relationships between concepts is that this leads to automatic generalization. If Tuesday and Thursday are represented by very similar vectors, they will have very similar causal effects on other vectors of neural activity. This facilitates analogical reasoning and suggests that immediate, intuitive analogical reasoning is our primary mode of reasoning, with logical sequential reasoning being a much later development,56 which we will discuss.

**BOLTZMANN MACHINES**

## The Rise of Deep Learning

Deep learning re-energized neural network research in the early 2000s by introducing a few elements which made it easy to train deeper networks. The emergence of GPUs and the availability of large datasets were key enablers of deep learning and they were greatly enhanced by the development of open source, flexible software platforms with automatic differentiation such as Theano,16 Torch,25 Caffe,55 TensorFlow,1 and PyTorch.71 This made it easy to train complicated deep nets and to reuse the latest models and their building blocks. But the composition of more layers is what allowed more complex non-linearities and achieved surprisingly good results in perception tasks, as summarized here.

Why depth? Although the intuition that deeper neural networks could be more powerful pre-dated modern deep learning techniques, it was a series of advances in both architecture and training procedures,15,35,48 which ushered in the remarkable advances which are associated with the rise of deep learning. But why might deeper networks generalize better for the kinds of input-output relationships we are interested in modeling? It is important to realize that it is not simply a question of having more parameters, since deep networks often generalize better than shallow networks with the same number of parameters. The practice confirms this. The most popular class of convolutional net architecture for computer vision is the ResNet family of which the most common representative, ResNet-50 has 50 layers. Other ingredients not mentioned in this article but which turned out to be very useful include image deformations, drop-out, and batch normalization.

We believe that deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer.

For tasks like perception, this kind of compositionality works very well and there is strong evidence that it is used by biological perceptual systems.

Unsupervised pre-training. When the number of labeled training examples is small compared with the complexity of the neural network required to perform the task, it makes sense to start by using some other source of information to create layers of feature detectors and then to fine-tune these feature detectors using the limited supply of labels. In transfer learning, the source of information is another supervised learning task that has plentiful labels. But it is also possible to create layers of feature detectors without using any labels at all by stacking auto-encoders.

**Deep Learning Lecture 10.3 - Restricted Boltzmann Machines**

First, we learn a layer of feature detectors whose activities allow us to reconstruct the input. Then we learn a second layer of feature detectors whose activities allow us to reconstruct the activities of the first layer of feature detectors. After learning several hidden layers in this way, we then try to predict the label from the activities in the last hidden layer and we backpropagate the errors through all of the layers in order to fine-tune the feature detectors that were initially discovered without using the precious information in the labels. The pre-training may well extract all sorts of structure that is irrelevant to the final classification but, in the regime where computation is cheap and labeled data is expensive, this is fine so long as the pre-training transforms the input into a representation that makes classification easier.

In addition to improving generalization, unsupervised pre-training initializes the weights in such a way that it is easy to fine-tune a deep neural network with backpropagation. The effect of pre-training on optimization was historically important for overcoming the accepted wisdom that deep nets were hard to train, but it is much less relevant now that people use rectified linear units (see next section) and residual connections.43 However, the effect of pre-training on generalization has proved to be very important. It makes it possible to train very large models by leveraging large quantities of unlabeled data, for example, in natural language processing, for which huge corpora are available. The general principle of pre-training and fine-tuning has turned out to be an important tool in the deep learning toolbox, for example, when it comes to transfer learning or even as an ingredient of modern meta-learning.

The mysterious success of rectified linear units. The early successes of deep networks involved unsupervised pre-training of layers of units that used the logistic sigmoid nonlinearity or the closely related hyperbolic tangent. Rectified linear units had long been hypothesized in neuroscience29 and already used in some variants of RBMs70 and convolutional neural networks.54 It was an unexpected and pleasant surprise to discover35 that rectifying non-linearities (now called ReLUs, with many modern variants) made it easy to train deep networks by backprop and stochastic gradient descent, without the need for layerwise pre-training. This was one of the technical advances that enabled deep learning to outperform previous methods for object recognition, as outlined here.

Breakthroughs in speech and object recognition. An acoustic model converts a representation of the sound wave into a probability distribution over fragments of phonemes. Heroic efforts by Robinson using transputers and by Morgan et al. using DSP chips had already shown that, with sufficient processing power, neural networks were competitive with the state of the art for acoustic modeling. In 2009, two graduate students68 using Nvidia GPUs showed that pre-trained deep neural nets could slightly outperform the SOTA on the TIMIT dataset. This result reignited the interest of several leading speech groups in neural networks. In 2010, essentially the same deep network was shown to beat the SOTA for large vocabulary speech recognition without requiring speaker-dependent training and by 2012, Google had engineered a production version that significantly improved voice search on Android. This was an early demonstration of the disruptive power of deep learning.

**Dr. Meir Shimon - ARE YOU A BOLTZMANN BRAIN?**

At about the same time, deep learning scored a dramatic victory in the 2012 ImageNet competition, almost halving the error rate for recognizing a thousand different classes of object in natural images.60 The keys to this victory were the major effort by Fei-Fei Li and her collaborators in collecting more than a million labeled images31 for the training set and the very efficient use of multiple GPUs by Alex Krizhevsky. Current hardware, including GPUs, encourages the use of large mini-batches in order to amortize the cost of fetching a weight from memory across many uses of that weight. Pure online stochastic gradient descent which uses each weight once converges faster and future hardware may just use weights in place rather than fetching them from memory.

The deep convolutional neural net contained a few novelties such as the use of ReLUs to make learning faster and the use of dropout to prevent over-fitting, but it was basically just a feed-forward convolutional neural net of the kind that Yann LeCun and his collaborators had been developing for many years.64,65 The response of the computer vision community to this breakthrough was admirable. Given this incontrovertible evidence of the superiority of convolutional neural nets, the community rapidly abandoned previous hand-engineered approaches and switched to deep learning.

## Recent Advances

Here we selectively touch on some of the more recent advances in deep learning, clearly leaving out many important subjects, such as deep reinforcement learning, graph neural networks and meta-learning.

Soft attention and the transformer architecture. A significant development in deep learning, especially when it comes to sequential processing, is the use of multiplicative interactions, particularly in the form of soft attention.7,32,39,78 This is a transformative addition to the neural net toolbox, in that it changes neural nets from purely vector transformation machines into architectures which can dynamically choose which inputs they operate on, and can store information in differentiable associative memories. A key property of such architectures is that they can effectively operate on different kinds of data structures including sets and graphs.

Soft attention can be used by modules in a layer to dynamically select which vectors from the previous layer they will combine to compute their outputs. This can serve to make the output independent of the order in which the inputs are presented (treating them as a set) or to use relationships between different inputs (treating them as a graph).

The transformer architecture,85 which has become the dominant architecture in many applications, stacks many layers of "self-attention" modules. Each module in a layer uses a scalar product to compute the match between its query vector and the key vectors of other modules in that layer. The matches are normalized to sum to 1, and the resulting scalar coefficients are then used to form a convex combination of the value vectors produced by the other modules in the previous layer. The resulting vector forms an input for a module of the next stage of computation. Modules can be made multi-headed so that each module computes several different query, key and value vectors, thus making it possible for each module to have several distinct inputs, each selected from the previous stage modules in a different way. The order and number of modules does not matter in this operation, making it possible to operate on sets of vectors rather than single vectors as in traditional neural networks. For instance, a language translation system, when producing a word in the output sentence, can choose to pay attention to the cor-responding group of words in the input sentence, independently of their position in the text. While multiplicative gating is an old idea for such things as coordinate transforms and powerful forms of recurrent networks, its recent forms have made it mainstream. Another way to think about attention mechanisms is that they make it possible to dynamically route information through appropriately selected modules and combine these modules in potentially novel ways for improved out-of-distribution generalization.

**How a Boltzmann machine models data**

We believe that deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer.

Transformers have produced dramatic performance improvements that have revolutionized natural language processing,27,32 and they are now being used routinely in industry. These systems are all pre-trained in a self-supervised manner to predict missing words in a segment of text.

Perhaps more surprisingly, transformers have been used successfully to solve integral and differential equations symbolically.62 A very promising recent trend uses transformers on top of convolutional nets for object detection and localization in images with state-of-the-art performance.19 The transformer performs post-processing and object-based reasoning in a differentiable manner, enabling the system to be trained end-to-end.

Unsupervised and self-supervised learning. Supervised learning, while successful in a wide variety of tasks, typically requires a large amount of human-labeled data. Similarly, when reinforcement learning is based only on rewards, it requires a very large number of interactions. These learning methods tend to produce task-specific, specialized systems that are often brittle outside of the narrow domain they have been trained on. Reducing the number of human-labeled samples or interactions with the world that are required to learn a task and increasing the out-of-domain robustness is of crucial importance for applications such as low-resource language translation, medical image analysis, autonomous driving, and content filtering.

Humans and animals seem to be able to learn massive amounts of background knowledge about the world, largely by observation, in a task-independent manner. This knowledge underpins common sense and allows humans to learn complex tasks, such as driving, with just a few hours of practice. A key question for the future of AI is how do humans learn so much from observation alone?

### A key question for the future of AI is how do humans learn so much from observation alone?

In supervised learning, a label for one of N categories conveys, on average, at most log2(N) bits of information about the world. In model-free reinforcement learning, a reward similarly conveys only a few bits of information. In contrast, audio, images and video are high-bandwidth modalities that implicitly convey large amounts of information about the structure of the world. This motivates a form of prediction or reconstruction called self-supervised learning which is training to "fill in the blanks" by predicting masked or corrupted portions of the data. Self-supervised learning has been very successful for training transformers to extract vectors that capture the context-dependent meaning of a word or word fragment and these vectors work very well for downstream tasks.

For text, the transformer is trained to predict missing words from a discrete set of possibilities. But in high-dimensional continuous domains such as video, the set of plausible continuations of a particular video segment is large and complex and representing the distribution of plausible continuations properly is essentially an unsolved problem.

Contrastive learning. One way to approach this problem is through latent variable models that assign an energy (that is, a badness) to examples of a video and a possible continuation.a

Given an input video X and a proposed continuation Y, we want a model to indicate whether Y is compatible with X by using an energy function E(X, Y) which takes low values when X and Y are compatible, and higher values otherwise.

E(X, Y) can be computed by a deep neural net which, for a given X, is trained in a contrastive way to give a low energy to values Y that are compatible with X (such as examples of (X, Y) pairs from a training set), and high energy to other values of Y that are incompatible with X. For a given X, inference consists in finding one cacm6407_a.gif that minimizes E(X, Y) or perhaps sampling from the Y s that have low values of E(X, Y). This energy-based approach to representing the way Y depends on X makes it possible to model a diverse, multi-modal set of plausible continuations.

The key difficulty with contrastive learning is to pick good "negative" samples: suitable points Y whose energy will be pushed up. When the set of possible negative examples is not too large, we can just consider them all. This is what a softmax does, so in this case contrastive learning reduces to standard supervised or self- supervised learning over a finite discrete set of symbols. But in a real-valued high-dimensional space, there are far too many ways a vector cacm6407_b.gif could be different from Y and to improve the model we need to focus on those Ys that should have high energy but currently have low energy. Early methods to pick negative samples were based on Monte-Carlo methods, such as contrastive divergence for restricted Boltzmann machines48 and noise-contrastive estimation.

**The Deep Learning Revolution**

Generative Adversarial Networks (GANs)36 train a generative neural net to produce contrastive samples by applying a neural network to latent samples from a known distribution (for example, a Gaussian). The generator trains itself to produce outputs to which the model gives low energy). The generator can do so using backpropagation to get the gradient of Ewith respect to . The generator and the model are trained simultaneously, with the model attempting to give low energy to training samples, and high energy to generated contrastive samples.

GANs are somewhat tricky to optimize, but adversarial training ideas have proved extremely fertile, producing impressive results in image synthesis, and opening up many new applications in content creation and domain adaptation34 as well as domain or style transfer.87

Making representations agree using contrastive learning. Contrastive learning provides a way to discover good feature vectors without having to reconstruct or generate pixels. The idea is to learn a feed-forward neural network that produces very similar output vectors when given two different crops of the same image10 or two different views of the same object17 but dissimilar output vectors for crops from different images or views of different objects. The squared distance between the two output vectors can be treated as an energy, which is pushed down for compatible pairs and pushed up for incompatible pairs.

A series of recent papers that use convolutional nets for extracting representations that agree have produced promising results in visual feature learning. The positive pairs are composed of different versions of the same image that are distorted through cropping, scaling, rotation, color shift, blurring, and so on. The negative pairs are similarly distorted versions of different images which may be cleverly picked from the dataset through a process called hard negative mining or may simply be all of the distorted versions of other images in a minibatch. The hidden activity vector of one of the higher-level layers of the network is subsequently used as input to a linear classifier trained in a supervised manner. This Siamese net approach has yielded excellent results on standard image recognition benchmarks.6, Very recently, two Siamese net approaches have managed to eschew the need for contrastive samples. The first one, dubbed SwAV, quantizes the output of one network to train the other network,20 the second one, dubbed BYOL, smoothes the weight trajectory of one of the two networks, which is apparently enough to prevent a collapse.

**Restricted Boltzmann machine - definition**

Variational auto-encoders. A popular recent self-supervised learning method is the Variational Auto-Encoder (VAE).58 This consists of an encoder network that maps the image into a latent code space and a decoder network that generates an image from a latent code. The VAE limits the information capacity of the latent code by adding Gaussian noise to the output of the encoder before it is passed to the decoder. This is akin to packing small noisy spheres into a larger sphere of minimum radius. The information capacity is limited by how many noisy spheres fit inside the containing sphere. The noisy spheres repel each other because a good reconstruction error requires a small overlap between codes that correspond to different samples. Mathematically, the system minimizes a free energy obtained through marginalization of the latent code over the noise distribution. However, minimizing this free energy with respect to the parameters is intractable, and one has to rely on variational approximation methods from statistical physics that minimize an upper bound of the free energy.

## The Future of Deep Learning

The performance of deep learning systems can often be dramatically improved by simply scaling them up. With a lot more data and a lot more computation, they generally work a lot better. The language model GPT-318 with 175 billion parameters (which is still tiny compared with the number of synapses in the human brain) generates noticeably better text than GPT-2 with only 1.5 billion parameters. The chatbots Meena2 and BlenderBot73 also keep improving as they get bigger. Enormous effort is now going into scaling up and it will improve existing systems a lot, but there are fundamental deficiencies of current deep learning that cannot be overcome by scaling alone, as discussed here.

Comparing human learning abilities with current AI suggests several directions for improvement:

Supervised learning requires too much labeled data and model-free reinforcement learning requires far too many trials. Humans seem to be able to generalize well with far less experience.

Current systems are not as robust to changes in distribution as humans, who can quickly adapt to such changes with very few examples.

Current deep learning is most successful at perception tasks and generally what are called system 1 tasks. Using deep learning for system 2 tasks that require a deliberate sequence of steps is an exciting area that is still in its infancy.

What needs to be improved. From the early days, theoreticians of machine learning have focused on the iid assumption, which states that the test cases are expected to come from the same distribution as the training examples. Unfortunately, this is not a realistic assumption in the real world: just consider the non-stationarities due to actions of various agents changing the world, or the gradually expanding mental horizon of a learning agent which always has more to learn and discover. As a practical consequence, the performance of today's best AI systems tends to take a hit when they go from the lab to the field.

Our desire to achieve greater robustness when confronted with changes in distribution (called out-of-distribution generalization) is a special case of the more general objective of reducing sample complexity (the number of examples needed to generalize well) when faced with a new task—as in transfer learning and lifelong learning81—or simply with a change in distribution or in the relationship between states of the world and rewards. Current supervised learning systems require many more examples than humans (when having to learn a new task) and the situation is even worse for model-free reinforcement learning23 since each rewarded trial provides less information about the task than each labeled example. It has already been noted61,76 that humans can generalize in a way that is different and more powerful than ordinary iid generalization: we can correctly interpret novel combinations of existing concepts, even if those combinations are extremely unlikely under our training distribution, so long as they respect high-level syntactic and semantic patterns we have already learned. Recent studies help us clarify how different neural net architectures fare in terms of this systematic generalization ability. How can we design future machine learning systems with these abilities to generalize better or adapt faster out-of-distribution?

**Lecture 12.3 — Restricted Boltzmann Machines — [ Deep Learning | Geoffrey Hinton | UofT ]**

From homogeneous layers to groups of neurons that represent entities. Evidence from neuroscience suggests that groups of nearby neurons (forming what is called a hyper-column) are tightly connected and might represent a kind of higher-level vector-valued unit able to send not just a scalar quantity but rather a set of coordinated values. This idea is at the heart of the capsules architectures,47,59 and it is also inherent in the use of soft-attention mechanisms, where each element in the set is associated with a vector, from which one can read a key vector and a value vector (and sometimes also a query vector). One way to think about these vector-level units is as representing the detection of an object along with its attributes (like pose information, in capsules). Recent papers in computer vision are exploring extensions of convolutional neural networks in which the top level of the hierarchy represents a set of candidate objects detected in the input image, and operations on these candidates is performed with transformer-like architectures. Neural networks that assign intrinsic frames of reference to objects and their parts and recognize objects by using the geometric relationships between parts should be far less vulnerable to directed adversarial attacks,79 which rely on the large difference between the information used by people and that used by neural nets to recognize objects.

Multiple time scales of adaption. Most neural nets only have two timescales: the weights adapt slowly over many examples and the activities adapt rapidly changing with each new input. Adding an overlay of rapidly adapting and rapidly, decaying "fast weights"49 introduces interesting new computational abilities. In particular, it creates a high-capacity, short-term memory,4 which allows a neural net to perform true recursion in which the same neurons can be reused in a recursive call because their activity vector in the higher-level call can be reconstructed later using the information in the fast weights. Multiple time scales of adaption also arise in learning to learn, or meta-learning.

Higher-level cognition. When thinking about a new challenge, such as driving in a city with unusual traffic rules, or even imagining driving a vehicle on the moon, we can take advantage of pieces of knowledge and generic skills we have already mastered and recombine them dynamically in new ways. This form of systematic generalization allows humans to generalize fairly well in contexts that are very unlikely under their training distribution. We can then further improve with practice, fine-tuning and compiling these new skills so they do not need conscious attention anymore. How could we endow neural networks with the ability to adapt quickly to new settings by mostly reusing already known pieces of knowledge, thus avoiding interference with known skills? Initial steps in that direction include Transformers32 and Recurrent Independent Mechanisms.

It seems that our implicit (system 1) processing abilities allow us to guess potentially good or dangerous futures, when planning or reasoning. This raises the question of how system 1 networks could guide search and planning at the higher (system 2) level, maybe in the spirit of the value functions which guide Monte-Carlo tree search for AlphaGo.77

Machine learning research relies on inductive biases or priors in order to encourage learning in directions which are compatible with some assumptions about the world. The nature of system 2 processing and cognitive neuroscience theories for them5,30 suggests several such inductive biases and architectures,11,45 which may be exploited to design novel deep learning systems. How do we design deep learning architectures and training frameworks which incorporate such inductive biases?

The ability of young children to perform causal discovery37 suggests this may be a basic property of the human brain, and recent work suggests that optimizing out-of-distribution generalization under interventional changes can be used to train neural networks to discover causal dependencies or causal variables. How should we structure and train neural nets so they can capture these underlying causal properties of the world?

How are the directions suggested by these open questions related to the symbolic AI research program from the 20th century? Clearly, this symbolic AI program aimed at achieving system 2 abilities, such as reasoning, being able to factorize knowledge into pieces which can easily recombined in a sequence of computational steps, and being able to manipulate abstract variables, types, and instances. We would like to design neural networks which can do all these things while working with real-valued vectors so as to preserve the strengths of deep learning which include efficient large-scale learning using differentiable computation and gradient-based adaptation, grounding of high-level concepts in low-level perception and action, handling uncertain data, and using distributed representations.

## Introduction to Restricted Boltzmann Machines.

Invented by Geoffrey Hinton(Sometimes referred to as the Godfather of Deep Learning), a Restricted Boltzmann machine is an algorithm useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling.

Before moving forward let us first understand what is Boltzmann Machines?

### What are Boltzmann Machines?

A Boltzmann machine is a stochastic(non-deterministic) or generative deep learning model which has only visible(input) and hidden nodes.

The image below presents ten nodes in it and all of them are inter-connected and are also often referred to as States. Brown ones represent Hidden nodes (h)and blue ones represent Visible nodes (v). If you already understand Artificial, Convolutional, and Recurrent Neural networks, you’ll notice they never had their Input nodes connected, whereas Boltzmann Machines have their inputs connected & that is what makes them fundamentally unconventional. All these nodes exchange information among themselves and self-generate subsequent data hence termed as Generative deep model.

There is no output node in this model hence like our other classifiers, we cannot make this model learn 1 or 0 from the Target variable of the training dataset after applying gradient descent or stochastic gradient descent (SGD), etc. Exactly similar cases with our regressor models as well, where it cannot learn the pattern from Target variables. These attributes make the model non-deterministic. Thinking of how does this model then learns and predicts, is that intriguing enough?

Here, Visible nodes are what we measure and Hidden nodes are what we don’t measure. When we input data, these nodes learn all the parameters, their patterns, and correlation between those on their own and forms an efficient system, hence Boltzmann Machine is termed as an Unsupervised Deep Learning model. This model then gets ready to monitor and study abnormal behavior depending on what it has learned.

Hinton once referred to the illustration of a Nuclear Power plant as an example for understanding Boltzmann Machines. This is a complex topic so we shall proceed slowly to understand the intuition behind each concept, with a minimum amount of mathematics and physics involved.

So in the simplest introductory terms, Boltzmann Machines are primarily divided into two categories: Energy-based Models (EBMs) and Restricted Boltzmann Machines (RBMs). When these RBMs are stacked on top of each other, they are known as Deep Belief Networks (DBNs).

What are Restricted Boltzmann Machines?

A Restricted Boltzmann Machine (RBM) is a generative, stochastic, and 2-layer artificial neural network that can learn a probability distribution over its set of inputs.

Stochastic means “randomly determined”, and in RBMs, the coefficients that modify inputs are randomly initialized.

The first layer of the RBM is called the visible, or input layer, and the second is the hidden layer. Each circle represents a neuron-like unit called a node. Each node in the input layer is connected to every node of the hidden layer.

The restriction in a Restricted Boltzmann Machine is that there is no intra-layer communication(nodes of the same layer are not connected). This restriction allows for more efficient training algorithms than what is available for the general class of Boltzmann machines, in particular, the gradient-based contrastive divergence algorithm. Each node is a locus of computation that processes input and begins by making stochastic decisions about whether to transmit that input or not.

RBM received a lot of attention after being proposed as building blocks of multi-layer learning architectures called Deep Belief Networks(DBNs). When these RBMs are stacked on top of each other, they are known as DBNs.

## Difference between Autoencoders & RBMs

Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. Typically, the number of hidden units is much less than the number of visible ones. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation for input data.

- Working of Restricted Boltzmann Machine
- One aspect that distinguishes RBM from other Neural networks is that it has two biases.
- The hidden bias helps the RBM produce the activations on the forward pass, while
- The visible layer’s biases help the RBM learn the reconstructions on the backward pass.

The reconstructed input is always different from the actual input as there are no connections among visible nodes and therefore, no way of transferring information among themselves.

The above image shows the first step in training an RBM with multiple inputs. The inputs are multiplied by the weights and then added to the bias. The result is then passed through a sigmoid activation function and the output determines if the hidden state gets activated or not. Weights will be a matrix with the number of input nodes as the number of rows and the number of hidden nodes as the number of columns. The first hidden node will receive the vector multiplication of the inputs multiplied by the first column of weights before the corresponding bias term is added to it.

## More Information:

https://medium.com/edureka/restricted-boltzmann-machine-tutorial-991ae688c154

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6997788/

https://www.frontiersin.org/articles/10.3389/fphar.2019.01631/full

https://cacm.acm.org/magazines/2021/7/253464-deep-learning-for-ai/fulltext

https://www.theaidream.com/post/introduction-to-restricted-boltzmann-machines-rbms

https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog0901_7

https://vinodsblog.com/2020/07/28/deep-learning-introduction-to-boltzmann-machines/

http://www.cs.toronto.edu/~hinton/papers.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6997788/pdf/fphar-10-01631.pdf

https://research.google.com/pubs/GeoffreyHinton.html?source=post_page---------------------------

## 0 reacties:

## Post a Comment