This is different from the separate models where no similarities could be exploited to reduce the effective number of parameters. So from this perspective, the hierarchical model is simpler than the sum of the separate ones above. Finally, I wondered what the group-model actually learned.
It seems like the group model is representing the Z-shape in a fairly noisy way, which makes sense because the decision surface for every sub-model looks quite different. Here, we look at the correlations in the first layer of the group distribution.
Informative priors are a powerful concept in Bayesian modeling. Any expert information you encode in your priors can greatly increase your inference. The same should hold true for BNNs but it raises the question how we can define informative priors over weights which exist in this abstract space that is very difficult to reason about understanding the learned representations of neural networks is an active research topic.
While I don't know how to answer that generally, we can nonetheless explore this question with the techniques we developed this far. The group distributions from our hierarchical model are providing structured regularization for the subnetworks.
But there is no reason we can't use the group distributions only in a hierarchical network. We can just use the inferred group structure and reapply it in the form of informative priors to individual, flat networks. For this, we must first estimate the group distribution as it looks to the subnetworks. This is essentially sampling from the group posterior predictive distribution. While there is no guarantee that this distribution is normal technically it is a mixture of normals so could look much more like a Student-T , this is a good enough approximation in this case.
As the correlation structure of the group distributions seem to play a key role as we've seen above, we use MvNormal priors. Again, we just loop over the categories in our data set and create a separate BNN for each one. This is identical to our first attempt above, however, now we are setting the prior estimated from our group posterior of our hierarchical model in our second approach.
As demonstrated, informed priors can help NNs a lot. But what if we don't have hierarchical structure or it would be too expensive to estimate? We could attempt to construct priors by deriving them from pre-trained models. For example, if I wanted to train an object recognition model to my own custom categories, I could start with a model like ResNet trained on the CIFAR data set, derive priors from the weights, and then train a new model on my custom data set which could then get by with fewer images than if we trained from scratch.
In this blog post I showed how we can borrow ideas from Bayesian statistics hierarchical modeling and informative priors and apply them to deep learning to boost accuracy when our data set is nested and we may not have a huge amount of data. If you want to play around with this notebook yourself, download it here. If you enjoyed this blog post, please consider supporting me on Patreon. Thanks to Adrian Seyboldt and Chris Chatham for useful discussions and feedback. Thanks also to my patrons, particularily Jonathan Ng and Vladislavs Dovgalecs.
Auto-assigning NUTS sampler It indeed seems like point or mean-field estimates miss a lot of higher-order structure. Note that this code just creates a single, non-hierarchical BNN. Introduction Background.
Candidate Generation. Content-Based Filtering. Collaborative Filtering and Matrix Factorization.
Recommendation Using Deep Neural Networks. Retrieval, Scoring, and Re-ranking.
As a result, the model can only be queried with a user or item present in the training set. Relevance of recommendations.
As you saw in the first Colab , popular items tend to be recommended for everyone, especially when using dot product as a similarity measure. It is better to capture specific user interests. The output is a probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item; for example, the probability to click on or watch a YouTube video. Input The input to a DNN can include: dense features for example, watch time and time since last watch sparse features for example, watch history and country Unlike the matrix factorization approach, you can add side features such as age or country.
Initially, all weights are random, and the answers that come out of the net are probably nonsensical. The network learns through training. Examples for which the output is known are repeatedly presented to the network, and the answers it gives are compared to the known outcomes. Information from this comparison is passed back through the network, gradually changing the weights. As training progresses, the network becomes increasingly accurate in replicating the known outcomes. Once trained, the network can be applied to future cases where the outcome is unknown. Related Topics Neural Networks.