Personal Webpage of Max HornThis is the personal webpage of Max Horn, PhD Student in Machine Learning and Computational Biology at ETH Zürich.
https://ExpectationMax.github.io/
Sat, 23 May 2020 18:37:59 +0000Sat, 23 May 2020 18:37:59 +0000Jekyll v3.8.5Summary of Talks and Posters from ICLR 2020<p>This years ICLR was was special for me (and for many others as well) as it was
the first virtual conference I have ever attended to (and will given the
current situation with COVID-19 surely not be the last). At the virtual ICLR,
posters were replaced with short prerecorded videos of 5 minutes where the
authors briefly present their work. These videos can be accessed at any time
independent of the “poster session” in which was possible to talk to one or
more authors of the paper. One benefit of the conference being completely
virtual is definitely that it allows to spread out looking at the “posters”
over a longer time (if willing to sacrifice the possibility to talk with the
author in a virtual poster session).</p>
<p>Overall, I found this format very appealing, as it allows to get a quick
perspective into the work and also allows to look at many works without getting
overwhelmed as the presentations are usually kept at a high level. For more
detailed information it is always possible to pull up the paper.</p>
<p>A few weeks after the virtual conference is over I finally managed to summarize
some of the papers and posters I looked at, and thought I might as well share
it with the people who could be interested. You can find a subset of the papers
I selected for reading accompanied with a sometimes more, sometimes less
detailed summary below. I must say that I am still not completely finished
with all the papers I marked and thus might update this page at a later time.</p>
<ul id="markdown-toc">
<li><a href="#anomaly-detection" id="markdown-toc-anomaly-detection">Anomaly Detection</a> <ul>
<li><a href="#deep-semi-supervised-anomaly-detection" id="markdown-toc-deep-semi-supervised-anomaly-detection">Deep Semi-Supervised Anomaly Detection</a></li>
<li><a href="#input-complexity-and-out-of-distribution-detection-with-likelihood-based-generative-models" id="markdown-toc-input-complexity-and-out-of-distribution-detection-with-likelihood-based-generative-models">Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models</a></li>
</ul>
</li>
<li><a href="#generative-models" id="markdown-toc-generative-models">Generative Models</a> <ul>
<li><a href="#understanding-the-limitations-of-conditional-generative-models" id="markdown-toc-understanding-the-limitations-of-conditional-generative-models">Understanding the Limitations of Conditional Generative Models</a></li>
<li><a href="#your-classifier-is-secretly-an-energy-based-model-and-you-should-treat-it-like-one" id="markdown-toc-your-classifier-is-secretly-an-energy-based-model-and-you-should-treat-it-like-one">Your classifier is secretly an energy based model and you should treat it like one</a></li>
</ul>
</li>
<li><a href="#gradient-estimation" id="markdown-toc-gradient-estimation">Gradient Estimation</a> <ul>
<li><a href="#estimating-gradients-for-discrete-random-variables-by-sampling-without-replacement" id="markdown-toc-estimating-gradients-for-discrete-random-variables-by-sampling-without-replacement">Estimating Gradients for Discrete Random Variables by Sampling without Replacement</a></li>
<li><a href="#sumo-unbiased-estimation-of-log-marginal-probability-for-latent-variable-models" id="markdown-toc-sumo-unbiased-estimation-of-log-marginal-probability-for-latent-variable-models">SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models</a></li>
</ul>
</li>
<li><a href="#pooling-and-set-functions" id="markdown-toc-pooling-and-set-functions">Pooling and Set Functions</a> <ul>
<li><a href="#fspool-learning-set-representations-with-featurewise-sort-pooling" id="markdown-toc-fspool-learning-set-representations-with-featurewise-sort-pooling">FSPool: Learning Set Representations with Featurewise Sort Pooling</a></li>
<li><a href="#on-universal-equivariant-set-networks" id="markdown-toc-on-universal-equivariant-set-networks">On Universal Equivariant Set Networks</a></li>
<li><a href="#structpool-structured-graph-pooling-via-conditional-random-fields" id="markdown-toc-structpool-structured-graph-pooling-via-conditional-random-fields">StructPool: Structured Graph Pooling via Conditional Random Fields</a></li>
</ul>
</li>
<li><a href="#representation-learning" id="markdown-toc-representation-learning">Representation Learning</a> <ul>
<li><a href="#disentanglement-by-nonlinear-ica-with-general-incompressible-flow-networks-gin" id="markdown-toc-disentanglement-by-nonlinear-ica-with-general-incompressible-flow-networks-gin">Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)</a></li>
<li><a href="#infograph-unsupervised-and-semi-supervised-graph-level-representation-learning-via-mutual-information-maximization" id="markdown-toc-infograph-unsupervised-and-semi-supervised-graph-level-representation-learning-via-mutual-information-maximization">InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization</a></li>
<li><a href="#mutual-information-gradient-estimation-for-representation-learning" id="markdown-toc-mutual-information-gradient-estimation-for-representation-learning">Mutual Information Gradient Estimation for Representation Learning</a></li>
<li><a href="#on-mutual-information-maximization-for-representation-learning" id="markdown-toc-on-mutual-information-maximization-for-representation-learning">On Mutual Information Maximization for Representation Learning</a></li>
</ul>
</li>
<li><a href="#seq2seq-models" id="markdown-toc-seq2seq-models">Seq2Seq models</a> <ul>
<li><a href="#are-transformers-universal-approximators-of-sequence-to-sequence-functions" id="markdown-toc-are-transformers-universal-approximators-of-sequence-to-sequence-functions">Are Transformers universal approximators of sequence-to-sequence functions?</a></li>
<li><a href="#mogrifier-lstm" id="markdown-toc-mogrifier-lstm">Mogrifier LSTM</a></li>
<li><a href="#reformer-the-efficient-transformer" id="markdown-toc-reformer-the-efficient-transformer">Reformer: The Efficient Transformer</a></li>
</ul>
</li>
<li><a href="#understanding-deep-learning" id="markdown-toc-understanding-deep-learning">Understanding Deep Learning</a> <ul>
<li><a href="#four-things-everyone-should-know-to-improve-batch-normalization" id="markdown-toc-four-things-everyone-should-know-to-improve-batch-normalization">Four Things Everyone Should Know to Improve Batch Normalization</a></li>
<li><a href="#on-the-variance-of-the-adaptive-learning-rate-and-beyond" id="markdown-toc-on-the-variance-of-the-adaptive-learning-rate-and-beyond">On the Variance of the Adaptive Learning Rate and Beyond</a></li>
<li><a href="#the-implicit-bias-of-depth-how-incremental-learning-drives-generalization" id="markdown-toc-the-implicit-bias-of-depth-how-incremental-learning-drives-generalization">The Implicit Bias of Depth: How Incremental Learning Drives Generalization</a></li>
<li><a href="#towards-neural-networks-that-provably-know-when-they-dont-know" id="markdown-toc-towards-neural-networks-that-provably-know-when-they-dont-know">Towards neural networks that provably know when they don’t know</a></li>
<li><a href="#truth-or-backpropaganda-an-empirical-investigation-of-deep-learning-theory" id="markdown-toc-truth-or-backpropaganda-an-empirical-investigation-of-deep-learning-theory">Truth or backpropaganda? An empirical investigation of deep learning theory</a></li>
<li><a href="#what-graph-neural-networks-cannot-learn-depth-vs-width" id="markdown-toc-what-graph-neural-networks-cannot-learn-depth-vs-width">What graph neural networks cannot learn: depth vs width</a></li>
<li><a href="#why-gradient-clipping-accelerates-training-a-theoretical-justification-for-adaptivity" id="markdown-toc-why-gradient-clipping-accelerates-training-a-theoretical-justification-for-adaptivity">Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity</a></li>
</ul>
</li>
<li><a href="#other---normalizing-flows-robustness-meta-learning-probabilistic-modelling" id="markdown-toc-other---normalizing-flows-robustness-meta-learning-probabilistic-modelling">Other - Normalizing Flows, Robustness, Meta-Learning, Probabilistic Modelling</a> <ul>
<li><a href="#invertible-models-and-normalizing-flows" id="markdown-toc-invertible-models-and-normalizing-flows">Invertible Models and Normalizing Flows</a></li>
<li><a href="#learning-to-balance-bayesian-meta-learning-for-imbalanced-and-out-of-distribution-tasks" id="markdown-toc-learning-to-balance-bayesian-meta-learning-for-imbalanced-and-out-of-distribution-tasks">Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks</a></li>
<li><a href="#meta-dropout-learning-to-perturb-latent-features-for-generalization" id="markdown-toc-meta-dropout-learning-to-perturb-latent-features-for-generalization">Meta Dropout: Learning to Perturb Latent Features for Generalization</a></li>
<li><a href="#on-robustness-of-neural-ordinary-differential-equations" id="markdown-toc-on-robustness-of-neural-ordinary-differential-equations">On Robustness of Neural Ordinary Differential Equations</a></li>
<li><a href="#why-not-to-use-zero-imputation-correcting-sparsity-bias-in-training-neural-networks" id="markdown-toc-why-not-to-use-zero-imputation-correcting-sparsity-bias-in-training-neural-networks">Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks</a></li>
</ul>
</li>
</ul>
<h2 id="anomaly-detection">Anomaly Detection</h2>
<h3 id="deep-semi-supervised-anomaly-detection"><a href="https://iclr.cc/virtual/poster_HkgH0TEYwH.html">Deep Semi-Supervised Anomaly Detection</a></h3>
<p>Anomaly detection in the Semi-supervised setting: Additional labels samples
to leverage expert knowledge</p>
<p>Goal: Improve decision boundary using the labeled data.</p>
<p>A supervised classifier usually performs bad on unseen samples. Unsupervised
approaches cannot use the labels examples to improve.</p>
<p>Suggest the Deep SAD method with is an extension of the Deep SVDD method.
The SAD method is unsupervised and tries to map examples (non-anomalies)
into a as compact as possible hypersphere.</p>
<p>(I wonder how this is different to maximum
likelihood training of a simple generative models such as a flow? In the end
we simply penalize the distance from the mode. Of course this approach does
not penalize contraction / stretching of the space such that it can map all
examples to a very small volume).</p>
<p>The SVDD approach tries to minimize the following objective:</p>
<script type="math/tex; mode=display">\frac{1}{n} \sum_{i=1}^n || \phi(X_i; \theta) - c ||^2</script>
<p>and the authors of this paper suggest including an additional term which
penalizes anomalous samples from being close to the center of the sphere.
The objective then becomes:</p>
<script type="math/tex; mode=display">\frac{1}{n+m} \sum_{i=1}^n || \phi(X_i; \theta) - c ||^2 + \frac{\eta}{n+m} \sum_{j=1}^m(|| \phi(X_i; \theta) - c ||^2 )^{y_j}; \eta > 0</script>
<p>where $y_j=1$ for normal samples and $y_j=-1$ for anomalies.</p>
<p>Makes perfect sense. Additionally, the authors benchmark their method and
present an information-theoretic framework for deep anomaly detection in
their paper.</p>
<h3 id="input-complexity-and-out-of-distribution-detection-with-likelihood-based-generative-models"><a href="https://iclr.cc/virtual/poster_SyxIWpVYvr.html">Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models</a></h3>
<p>Obvious strategy of generative models for detecting OOD samples (train model
on data, label low likelihood samples as OOD) does not work. Often the out
of distribution samples actually have significantly higher likelihood than
the training distribution.</p>
<p>Intuition of paper: The input complexity of the dataset images influences
the likelihood.</p>
<p>Analyse normalized size of image after being compressed with lossless
compression which serves as a proxy for the Kolmogorov complexity.
<em>Complexity of image seems to correlate with the likelihood (more complex,
less likely).</em> Most variance is explained by this.</p>
<p>Suggest to correct the likelihood by subtracting the complexity estimate of
the image, and show that this can be interpreted as a likelihood ratio test
statistic giving links to Bayesian model comparison, minimum description
length and Occams razor.</p>
<p>Results indicate higher performance compared to many out of distribution
detection methods. The only method based on generative modelling which out
performs the approach is WAIC which relies on ensembles of generative models
and thus is sufficiently more costly to optimize.</p>
<p><em>Input complexity is the main culprit for out of distribution detection and
proxies of input complexity can be used to to design corrected scores.</em></p>
<h2 id="generative-models">Generative Models</h2>
<h3 id="understanding-the-limitations-of-conditional-generative-models"><a href="https://iclr.cc/virtual/poster_r1lPleBFvH.html">Understanding the Limitations of Conditional Generative Models</a></h3>
<p>Main problem of conditional generative models: <em>Not robust</em>, do not
recognise out of distribution samples.</p>
<p>Ideally, in a good generative model undetected (high density) adversarial
attacks should have a low volume.
Problem: Even if adversarial attacks are close to this low volume set the
close area itself would again be large due to the curse of dimensionality.
The theory of the paper suggests that one can always construct such
a adversarial attack such that it would not be detected (?, have to check on
this in the paper).</p>
<p>Results: MNIST model is robust, CIFAR model to detect.
–> Interpolated examples (in data space) on CIFAR have higher likelihood
than the samples themselves
Explanation: Classification and Generation are very different things.
Classification only cares for very few aspects of the data, while generation
tries to model every single aspect of the data.</p>
<p>The authors suggest that the class-unrelated entropy (background, nascence
variables) in CIFAR is the reason for these models failing. Demonstrate this
with new dataset where MNIST digits are on CIFAR background and thus
increasing the class-unrelated entropy. Then MNIST also fails similarly to
CIFAR in the previous example.</p>
<p>Authors argue that much of the problem comes from the standard likelihood
objective which tries to model everything, while for classification one only
cares about selected bits of the data.</p>
<h3 id="your-classifier-is-secretly-an-energy-based-model-and-you-should-treat-it-like-one"><a href="https://iclr.cc/virtual/poster_Hkxzx0NtDB.html">Your classifier is secretly an energy based model and you should treat it like one</a></h3>
<p>Criticism: Progress in generative models driven by Likelihood and sample
quality. Not by downstream tasks such as:</p>
<ul>
<li>Out of Distribution detection</li>
<li>Robust classifications</li>
<li>semi-supervised learning
In practice, the generative model based solutions on these tasks usually lag
behind compared to engineered solutions. <em>Why?</em></li>
<li>Not flexible enough?</li>
<li>Architecture only good for generation not classification etc.</li>
<li>Additional modelling constraints are making models worse for
discrimination (such as invertibility for flows)</li>
</ul>
<p>Alternative: Use energy based models!</p>
<ul>
<li>Very flexible! We only need a energy function</li>
<li>But we cannot easily compute the normalizing constant which leads to
some problems with regard to sampling and training</li>
</ul>
<p><em>How to train?</em>
While log likelihood does not have a nice form, the gradient of log p can be
written quite simply:</p>
<script type="math/tex; mode=display">\frac{\partial p_{\theta}(x)}{\partial \theta} =
E_{p_{\theta}(x')} [ \partial E_{\theta}(x) / \partial \theta ]
- \partial E_{\theta}(x) / \partial \theta</script>
<p>where the last term is evaluated on the data and the first uses samples from
the model (which is also tricky, usually done using MCMC).</p>
<p>Contribution:\
Take classifier models and instead of using softmax define
a energy based model using the inputs before the softmax. Where the energy
is the negative output of the class indexed preactivations:
$E_{\theta}(x, y) = -f_{\theta}(x)[y]$
This EBM can be trained and later used to predict the class using basic
rules of probability, which simply results in a softmax. Further we can sum
$y$ out to which also results in a EBM of the form:
$E_{\theta}(x) = - LogSumExp_y(f(x)[y])$, which is a purely generative model.
<img src="/assets/2020-05-19-ICLR-2020/clf_is_ebm_summary.png" alt="Overview figure" /></p>
<p><em>What to do with this insight? - Experiments</em>\
Train factorized distribution to ensure unbiased training of classifier:
$p(x) + p(y|z)$ as it does not require sampling from the joint distribution
which could be biased. (This is actually reflected in the results, which
show poor performance when the distribution is factored). Nevertheless, this
hybrid model is the only one which yields comparable classification
performance to STOA. Further, JEM improves calibration of the models
significantly compared to baseline approaches. Also the JEM model is
better at recognizing out of distribution samples and adversarial examples.
This is done using a trick where they seed the MCMC chain at the position of
the adversarial sample and execute a few MCMC steps with respect to the learned
data distribution, further improving adversarial robustness significantly.</p>
<p>Nevertheless, training is still very unstable due to MCMC sampling. Learning
is also hard to diagnose as there is no clear loss definition.</p>
<h2 id="gradient-estimation">Gradient Estimation</h2>
<h3 id="estimating-gradients-for-discrete-random-variables-by-sampling-without-replacement"><a href="https://iclr.cc/virtual/poster_rklEj2EFvB.html">Estimating Gradients for Discrete Random Variables by Sampling without Replacement</a></h3>
<p>Discrete variables do not allow the computation of a gradient which is
needed for gradient based optimization. This problem is usually mitigated by
one of two approaches:</p>
<ul>
<li>Relaxation of the discrete distribution to a continuous distribution (+
evtl. sampling) such as Gumbel-Softmax or Concrete methods. These
unfortunately usually have a high bias.</li>
<li>Sampling and deriving stochastic gradients usually based on REINFORCE.
REINFORCE gradients usually have high variance.</li>
</ul>
<p>Authors say that REINFORCE is basically a “trick” for pulling the gradient
operation inside the expectation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla_\theta E_{p_\theta(x)}[f(x)] &= E_{p_\theta(x)} [ \nabla_\theta \log p_\theta(x) f(x) ] \\\\
&\approx \nabla_\theta p_\theta (x) f(x) \\\\
&\approx \frac{1}{k} \sum_{i=1}^k \nabla_\theta p_\theta (x_i) f(x_i)
\end{aligned} %]]></script>
<p>Further, one can use the average of the other samples in order to reduce the
variance of the estimate (REINFORCE with baseline, Mnih & Rezende 2016):</p>
<script type="math/tex; mode=display">\nabla_\theta E_{p_\theta(x)}[f(x)] \approx \frac{1}{k} \sum_{i=1}^k \nabla_\theta p_\theta (x_i)
\left( f(x_i) - \frac{\sum_{j \neq i} f(x_j)}{k - 1} \right)</script>
<p>In contrast to previous work the authors <em>suggest to sample without
replacement</em>, as duplicate samples are uninformative in a deterministic
setting. This leads to a sequence of ordered samples drawn from the distribution
such that</p>
<script type="math/tex; mode=display">p(B) = p(b_1) \times \frac{p(b_2)}{1-p(b_1)} \times \frac{p(b_3)}{1 - p(b_2) - p(b_3)}</script>
<p>In their paper, they derive a generic estimator for $E[f(x)]$ by
Rao-Blackwellizing the crude single sample estimator (which is based on
a single Monte Carlo sample). They call this estimator the unordered set
estimator:</p>
<script type="math/tex; mode=display">E_{p_\theta(x)}[f(x)] = E_{p_\theta(S^k)}[e^{US}(S^k)] = E_{p_\theta(S^k)} \left[\sum_{s \in S^k} p(s) R(S^k, s) f(s)\right]</script>
<p>where $R(S^l, s) = \frac{p^{D \setminus { s } } (S^k \setminus { s } )}{p(S^k)}$
is the <em>leave-one-out ratio</em> and $S^k$ is an unordered sample without replacement.
They then apply REINFORCE to the derived estimator for the computation of
gradients which gives rise to the unordered set policy gradient estimator:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
& E_{p_\theta(x) [ \nabla_\theta \log p_\theta (s) f(x) ] \\\\
&= E_{p_\theta(S) [ e^{USPG}(S^k)] \\\\
&= E_{p_\theta(S) \[ \sum_{s \in S^k} p_\theta(s) R(S^l, s) \nabla_\theta \log p_\theta(s) f(s) \] \\\\
&= E_{p_\theta(S) \[ \sum_{s \in S^k} R(S^l, s) \nabla_\theta p_\theta(s) f(s)\]
\end{aligned} %]]></script>
<p>where the last step can be derived using the log derivative trick.</p>
<p>Further, the authors use an approach similar to Mnih & Rezende (2016) in
order to reduce the variance by subtracting a baseline based on the other
samples. This is not entirely trivial though, as the samples are not
independent and thus a correction needs to be applied.</p>
<p>Experiments:</p>
<ul>
<li>Synthetic example: Shows lowest gradient variance compared to all other
methods</li>
<li>Policy for travelling salesman problem: Comparable to biased approaches
and out-performs all unbiased approaches</li>
</ul>
<p>The devised estimator is a low variance unbiased estimator which can be used
as a replacement to Gumbel-Softmax.</p>
<h3 id="sumo-unbiased-estimation-of-log-marginal-probability-for-latent-variable-models"><a href="https://iclr.cc/virtual/poster_SylkYeHtwr.html">SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models</a></h3>
<p>Suggest a new gradient estimator for marginal probabilities.</p>
<p>Maximum likelihood estimation for Latent Variable Models (LVM) requires
unbiased estimates of $\nabla_\theta \log p_\theta(x)$ which are not
directly available. Instead research has focused on developing lower bounds
for the log marginal probability $\log p_\theta(x)$ such as the ELBO or IWAE
and to optimize the model parameters with respect to this lower bound.</p>
<p>The approach devised in this paper is composed of two components:</p>
<ul>
<li>
<p>Importance weighted bounds (IWAE): Here the idea is to use multiple
samples, instead of a single sample as suggested with the ELBO. This
results in an increasing tighter bound on the true marginal
log-likelihood, such that</p>
<script type="math/tex; mode=display">\begin{gather}
ELBO \leq E[IWAE_1(x)] \leq E[IWAE_2(x)] \leq \dots \leq \log p_\theta(x) \\\\
\log p_\theta(x) = \lim_{K \rightarrow \inf} E[IWAE_K(x)]
\end{gather}</script>
<p>where $K$ denotes the number of samples used to compute the lower bound.</p>
</li>
<li>
<p>Russian roulette estimators: Estimator used to compute the value of an
infinite series</p>
<script type="math/tex; mode=display">\sum^\inf_{k=1} \Delta_k = E_{K \sim p(K)} \left[ \sum^K_{k=1} \frac{\Delta_k}{P(\mathcal{K} \geq k)} \right]</script>
<p>which basically weighs each term in by the probability of sampling
a larger k. This is true is the series converges absolutely s.th.
$\sum_{k=1}^\inf |\Delta_k| \lt \inf$.</p>
</li>
</ul>
<p>The authors suggest SUMO (Stochastically Unbiased Marginalization
Objective), by combining the IWAE lower bound with the Russian Roulette
estimator:</p>
<script type="math/tex; mode=display">\begin{gather}
\Delta_k (x) = IWAE_{k+1}(x) - IWAE_k(x) \\\\
SUMO(x) = IWAE_1(x) + \sum^K_{k=1} \frac{\Delta_k (x)}{P(\mathcal{K} \leq k)}
\end{gather}</script>
<p>where $K \sim p(K)$. This objective is unbiased, such that
$E[SUMO(X)] = \log p_\theta(x)$, and under some conditions (that the
gradient of SUMO is bounded and differentiable everywhere) that
$E[\nabla_\theta SUMO(x)] = \nabla_\theta E[SUMO(x)] = \log p_\theta (x)$.
Deciding on $p(K)$ determines the variance of the estimator and the compute
cost.</p>
<p>Applications:</p>
<ul>
<li>Minimizing $\log p_\theta (x)$, which occurs in reverse-KL objectives</li>
<li>As an unbiased score function for examples in HMC and REINFORCE gradient
estimation</li>
</ul>
<p>Results: Better test NLL, more stable in entropy maximization</p>
<h2 id="pooling-and-set-functions">Pooling and Set Functions</h2>
<h3 id="fspool-learning-set-representations-with-featurewise-sort-pooling"><a href="https://iclr.cc/virtual/poster_HJgBA2VYwH.html">FSPool: Learning Set Representations with Featurewise Sort Pooling</a></h3>
<p>Discuss architectures for computing deep set representations.</p>
<p>Issue of <em>Jump discontinuity</em>: When we are rotating the input set elements
and compute the representation of a set and decode it into a set again,
there comes a point where the decoded set element jumps back by one
position, such that it is again in the same position (the visualization in
their talk is quite good and easier to understand than my explanation here).</p>
<p>With very many points the network would simply give up and just predict
a constant output although the input is being rotated.</p>
<p>Suggest FSPool:</p>
<ol>
<li>Sort inputs (which is ok as it is a set)</li>
<li>Multiply with a set of learned weights. As these are not always the same
size, the weights are interpreted as a piecewise linear function between
0 and 1, and the values used for the dot product are evaluated on an
evenly spaced grid between 0 and 1 such that the correct number of
weights for any size of input can be obtained.</li>
<li>This is done for each feature individually. (Which seems to result in
loss of information regarding the joint distribution?)</li>
</ol>
<p>This helps to mitigate the issue with jump discontinuity for set
autoencoders as the learnt permutation can simply be inverted in the decoder
such that no jumpy needs to occur and the output would always correspond to
the matching input. This then also removes the necessity of matching input
and output elements.</p>
<h3 id="on-universal-equivariant-set-networks"><a href="https://iclr.cc/virtual/poster_HkxTwkrKDB.html">On Universal Equivariant Set Networks</a></h3>
<p>Regarding approximation power of deep equivariant networks.</p>
<p>DeepSets:</p>
<script type="math/tex; mode=display">X \mapsto XA + \mathbf{1}\mathbf{1}^\top XB + 1 c^\top</script>
<p>Authors call $X \mapsto \mathbf{1}\mathbf{1}^\top XB$ a linear transmitting
layer. Further they note that setting $B=0$ for all layers results in
a model which simply applies an MLP on each row of X, and refer to it as
PointNet.</p>
<p>Authors derive the requirements for universal approximation of equivariant
functions on the unit cube. This is not the case for PointNet, as it cannot approximate the simple function $x \mapsto 1^\top x 1$.</p>
<p>Main theorem: PointNet is not equivariant universal, but PointNet with
single linear transmitting layer is. In particular the DeepSets model is
equivariant universal.</p>
<p>Proof:</p>
<ol>
<li>Stone-Weierstrass, any continuous equivariant function can be
approximated by equivariant polynomial on the unit cube</li>
<li>Construct model with linear transmitting layer that approximates any
permutation equivariant polynomial</li>
</ol>
<p>The suggested model consists of two PointNets, and a single linear
transmitting layer.</p>
<h3 id="structpool-structured-graph-pooling-via-conditional-random-fields"><a href="https://iclr.cc/virtual/poster_BJxg_hVtwH.html">StructPool: Structured Graph Pooling via Conditional Random Fields</a></h3>
<p>Why is graph pooling challenging? This is no locality information in a graph
as the number of neighboring nodes is not fixed. (slightly confused by this
statement, are neighbors not the definition of locality ?)</p>
<p>Until now there are two pooling approaches:</p>
<ul>
<li>Selection of important nodes via node sampling, could loose node
information if a node is not selected.</li>
<li>Graph pooling via clustering, cluster nodes together which represent then
represent a new node in the next iteration</li>
</ul>
<p>Other work DiffPool: Suggests to use a GCN to predict an assignment matrix
which defines which nodes are merged. This only uses the node features and
does not incorporate structural information.</p>
<p>The authors suggest StructPool: Where the pooling depends on the node
features and the high-order structural relationship in the graph. They
formulate the assignment problem as a conditional random field where the
goal is to minimize the Gibbs Energy. Basically, they add another pairwise
energy term (derived from an attention mechanism) which looks at pairs of
nodes which are within l-hop distance from each other which create an
additional pairwise energy to the unary energy of the conditional random
field. The two energies are combined and then used the compute the
assignment matrix using the softmax operation.</p>
<p>The proposed method shows improvement over other pooling techniques on D&D,
COLLAB Proteins IMDB-B and IMDB-M whereas it has slightly lower performance
on Enzymes.</p>
<h2 id="representation-learning">Representation Learning</h2>
<h3 id="disentanglement-by-nonlinear-ica-with-general-incompressible-flow-networks-gin"><a href="https://iclr.cc/virtual/poster_rygeHgSFDH.html">Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)</a></h3>
<p>Non-linear ICA theory: Can recover non-linear projections of conditionally
independent distributions (in latent space) in the data space. This requires
the conditionally independent distributions to belong to the exponential
family. Given some additional requirements, the theory implies that the
sufficient statistics of the true generating latent space are a linear
transformation of the recovered sufficient statistics in the latent space.</p>
<p>For disentanglement exactly one variable of the reconstructed latent space
should be associated with one in the true latent space. This is equivalent
to requiring sparsity in the transformation matrix.</p>
<p>Authors show that this is given for Gaussian latent spaces and claim that
additional latent dimensions in the recovered latent space will solely
encode noise. This gives rise to simultaneous disentanglement and
dimensionality discovery mechanism.</p>
<p>They suggest a method based on volume preserving flows (which are thus
incompressible). This is implemented based on RealNVP, where the
(pre-exponentiated) scale of the last component is set to the negative of
the sum of all previous components, enforcing the same volume. The authors
argue, that this constraint makes the standard deviation in the latent space
remain meaningful, as variability can only be shifted between dimensions but
not increased.</p>
<p>They show that the spectrum of standard deviations shows multiple regimes
corresponding to global, local and noise on EMNIST to support this claim.</p>
<p>Experiments on artificial data show that the proposed approach recovers the
true latent space if distributions sufficiently overlap. Further they find
very convincing latent dimensions for EMNIST (more realistic that anything
I have seem to date).</p>
<h3 id="infograph-unsupervised-and-semi-supervised-graph-level-representation-learning-via-mutual-information-maximization"><a href="https://iclr.cc/virtual/poster_r1lfF2NYvH.html">InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization</a></h3>
<p>Prior work: Graph kernel, Graph2vec requires manual construction of features
of importance. Aim of work: Automated discovery of these features.</p>
<p>InfoGraph: Model which tries to maximize the MI between patch
representations (subgraphs) and global representations (hole graph). This
should create a global representation of the graph which preserves aspects
of all pathces and all scales. Approach is competitive in classification.</p>
<p>Extension to semi-supervised scenarios:
Use student-teacher like architecture: Student model learns in a supervised
manner, whereas the teacher learns on all unlabeled data using the
previously devised InfoGraph approach. In order for the student model to
learn from the teacher model, they propose to maximize the mutual
information of intermediate layers of the GNN and the final representation.
Thus the student is slightly biased to exploit similar structures as the
unsupervised teacher. This leads to better performance than simply combining
supervised and unsupervised loss for a single model.</p>
<h3 id="mutual-information-gradient-estimation-for-representation-learning"><a href="https://iclr.cc/virtual/poster_ByxaUgrFvH.html">Mutual Information Gradient Estimation for Representation Learning</a></h3>
<p>Problem: High bias or high variance in existing approaches for MI.</p>
<p>Hypothesis of work is that while the approximated loss landscape could
potentially be very noisy deriving only the gradient without prior
computation of the loss could lead to much lower noise and variance.</p>
<p>Derive gradient of MI and use reparameterization trick in order to make the
computation tractable and obtain $\nabla_z \log q(E_{\phi}(x))$ and
$\nabla_{x,z} log q(x, E_{\phi}(x))$ via score estimation.</p>
<p>Use Spectral Stein Gradient estimator for score estimation of implicit
distributions. Reduce the complexity of the estimation by applying a <em>random
projection</em> into a lower dimensional space. This reduces the computational
complexity of computing the RBF kernel of the Spectral Stein Gradient
Estimator.</p>
<p>The devised approach outperforms other mutual information maximization
techniques.</p>
<h3 id="on-mutual-information-maximization-for-representation-learning"><a href="https://iclr.cc/virtual/poster_rkxoh24FPH.html">On Mutual Information Maximization for Representation Learning</a></h3>
<p>Many representation learning approaches are based on the InfoMax principle,
where a good representation should maximize the mutual information between
the data and the learnt representation (Linsker 1988).</p>
<p>Recently, novel approaches for new lower bounds on the MI and modern CNN
architectures have resurged the approach. However:</p>
<ol>
<li>MI is hard to estimate</li>
<li>Invariant under bijections</li>
<li>Does not yield good clustering representations</li>
</ol>
<p><em>So why doe these approaches work so well?</em></p>
<p>Modern approaches do not maximize MI between data and representation, but
between different views of the same input (higher level aggregation and
lower level aggregation) which is a lower bound on the original InfoMax
objective. Thus if these views encode low level information, such as pixel
noise, they would not yield high mutual information, whereas if high level
features such as “catness” would yield high mutual information on different
crops of a cat image.</p>
<p>Experiments: Maximize MI between bottom and top half of image and evaluate
performance using linear classifier.</p>
<ol>
<li>Usage of bijective encoders: <em>These preserve mutual information
completely!</em> Thus <em>the true MI between the segments is actually remains
the same during training.</em> Nevertheless, the lower bounds of the
estimators increases slightly during training and the classification
accuracy of the derived representations increases strongly. Thus the
<em>estimator favors “good representations”</em> despite any solution maximizing
the MI!</li>
<li>Encoders which could be injective or monojective: MLPs with skip
connections initialized to the identity mapping. Here the estimators
favor hard to invert mappings even though the initialization (identity)
maximizes the mutual information. The estimators thus bias towards good
representations for classification, but tend towards hard to invert
mappings which reduce the true mutual information!</li>
<li>Impact of encoder architecture: <em>Different architecture with same MI
estimator values lead to very different performance in terms of
classification.</em> Thus the value of the estimator is insufficient to
explain performance. Inductive bias of architecture responsible for good
performance?</li>
</ol>
<h2 id="seq2seq-models">Seq2Seq models</h2>
<h3 id="are-transformers-universal-approximators-of-sequence-to-sequence-functions"><a href="https://iclr.cc/virtual/poster_ByxRM0Ntvr.html">Are Transformers universal approximators of sequence-to-sequence functions?</a></h3>
<p>TLDR: Yes.</p>
<p>Maybe not as there are quite some structures which could potentially limit
expressive power: All tokens experience the same transformation. Only
pairwise interactions between tokens are possible.</p>
<p>Paper shows that there always exists a transformer network with small width
and unlimited depth that can approximate any equivariant function
arbitrarily accurately.</p>
<p>What about positional encodings? These remove the restriction of permutation
equivariance and allow a transformer to approximate any continuous seq-2-seq
function.</p>
<p>Further, the authors show that inclusion hybrid architectures which for
example include convolutional layers in between attention layers actually
can improve performance.</p>
<h3 id="mogrifier-lstm"><a href="https://iclr.cc/virtual/poster_SJe5P6EYvS.html">Mogrifier LSTM</a></h3>
<p>Core modification to LSTM: Gate the hidden state using input, and gate the
input using the hidden state in an alternating fashion. The input and hidden
state are then fed into the LSTM which then gives rise to a new hidden
state. This leads to better performance than the baseline LSTM model and
pushes the new LSTM closer to transformer networks in terms of performance.</p>
<p>Why does it work though? Potential explanations</p>
<ul>
<li>Contextualized embeddings: The procedure could lead to an embedding which
accounts for the actual context of the work and not only its mean context
such as in word embeddings. Experiments indicate that this is not
sufficient for explaining the performance on character level tasks and
synthetics datasets.</li>
<li>Multiplicative interactions: Not really clear how this would improve
performance.</li>
<li>Many more</li>
</ul>
<p>None of them really explain the Mogrifier LSTM. The authors performed
several experiments and none of them was really conclusive. There are solely
indications. For example the Mogrifier performs better on a copy task if the
sequences are long. Further, it this performance gap becomes larger as the
vocabulary size of the input sequences increases.</p>
<h3 id="reformer-the-efficient-transformer"><a href="https://iclr.cc/virtual/poster_rkgNKkHtvB.html">Reformer: The Efficient Transformer</a></h3>
<p>Combine two techniques to make Transformers more memory efficient and
scalable with respect to training time:</p>
<ul>
<li>RevNets (Invertable version of ResNet): As the computation can be
inverted, it is not necessary to store all downstream activations for
a back-propagation pass. They can be dynamically recomputed when needed.
The reversibility of the connections is enabled by only applying the layer
to half of the input and adding its output to the other half. This is
a bit similar to the strategy of RealNVP. Interestingly, this does not
reduce the performance.</li>
<li>Chunking of computations through the FeedForward NNs: Prevents having to
store all intermediate activations at once.</li>
<li>While one could in theory also chunk the computation of the attention
mechanism, the quadratic scaling of attention will lead to severe issues
with respect to speed</li>
</ul>
<p>Tackling the attention computation and making it scalable:
Main issue: while we compute all values in the preattention matrix, the
softmax converts these into a matrix of the same size, where many values are
very close to zero. <em>The attention matrix is sparse.</em> How can we use the
sparsity? <strong>Use variant of locality sensitive hashing:</strong> Allows sorting
vectors with a high dot product into buckets. Thus one can simply compute
the attention within each bucket and already cover most of the variance.
Use shared QK attention (where the query and key are the same, which
apparently seems to be as powerful as regular attention). Then bucket QK via
LSH and sort according to the bucket. In order to exploit parallelism chunk
the sorted array into fixed sizes and allow attention to attend within the
chunk and the previous chunk in if the bucket ids match in order to cover
case when chunking slitted buckets. In order to avoid problems with the
probabilistic nature of LHS the process is repeated with multiple hash
functions.</p>
<p>The results indicate that with more hash functions, the model converges to
the performance of a full attention model.</p>
<h2 id="understanding-deep-learning">Understanding Deep Learning</h2>
<h3 id="four-things-everyone-should-know-to-improve-batch-normalization"><a href="https://iclr.cc/virtual/poster_HJx8HANFDH.html">Four Things Everyone Should Know to Improve Batch Normalization</a></h3>
<p>Things that are wrong with BatchNorm:</p>
<ol>
<li>
<p>Inference Example Weighing:
During training, the influence of the current instance on the batch
statistics is still $\frac{1}{B}$, whereas during testing the instance
does not contribute at all as only the moving averages are used for
computing the normalization. The authors suggest to reparametrize the
mean and std-deviation of batch norm during test time to reintroduce the
dependency on the instance:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mu_i &= \alpha E[x_i] + (1-\alpha) m_x\\\\
\sigma_i^2 &= (\alpha E[x_i^2] + (1-\alpha)m_x) - \mu_i^2
\end{aligned} %]]></script>
<p>where, $\alpha$ is a hyperparameter which can be tuned after training on
the validation data. This can also be done after a model has already been
trained with regular batch norm and was shown to improve performance at
test time.</p>
</li>
<li>Ghost Batch Norm:
Originally developed for multi GPU and large batch training scenarios,
Ghost Batch norm normalized across subsets of each batch instead of the
complete batch. The authors show that this increases performance even for
single GPU medium batch size training (by inducing additional noise
during training?)</li>
<li>Batch Normalization and Weight decay: Applying weight decay on the shift
and scale parameters of batch norm is unstudied. Authors show a slight
improvement of performance. Yet, for this to be the case it is necessary
that the path from BN to output does not pass through additional BN
layers as it would amplify the effect with increasing depth.</li>
<li>Generalizing Batch and Group Norm:
For small batch regime when the application of vanilla batch norm is not
possible, suggest to normalize over both channel groups and examples.</li>
</ol>
<p>The authors show that combining all the above proposed techniques can lead
to an improvement of up to 6% in some scenarios. In general, I think the
first approach is the most relevant and interesting.</p>
<h3 id="on-the-variance-of-the-adaptive-learning-rate-and-beyond"><a href="https://iclr.cc/virtual/poster_rkgz2aEKDr.html">On the Variance of the Adaptive Learning Rate and Beyond</a></h3>
<p>Warmup: Linear increase on learning rate, seems to be critical for some
learning tasks (such as in the case of Transformer models).</p>
<p>Experiments showed that <em>without warmup, the gradient distribution gets
distorted towards small values</em>. This happens in the very beginning of
training withing the first 10 updates. With warmup this distortion doesn’t
occur and the gradient distributions remains largely the same compared to
the vary beginning of training.</p>
<p>Why does warmup improve the convergence and mitigate this effect? Authors
suggest that the adaptive learning rate is of very high variance in the
beginning of training, due to the lack of samples used for computing the
moving averages.</p>
<p>Set up two control experiments to verify if this is actually the cause of
problems:</p>
<ol>
<li>Adam-2k, which provides Adam with additional 2k samples for estimating
the variance of the gradient (without any updates to the weights!). This
leads to a learning curve which is extremely similar to using warmup.</li>
<li>Adam-eps: Increase the value of epsilon in the Adam implementation. This
term is usually added to the square root of the variance estimate. By
increasing eps the influence of the estimated variance becomes lower.
This leads to better convergence than without warmup, but still shows
some difficulties during training.</li>
</ol>
<p>The authors suggest a rectification term to mitigate the issue of high
variance in the adaptive learning rate, which basically deactivates adaptive
learning rates when the variance estimate would diverge.</p>
<p><em>Experiments:</em> Astonishingly, the results indicate that this corrected Adam
implementation RAdam, is significantly more robust to the selection of the
learning rate. Further, in contrast to warmup, it is not required to tune
any parameter to reach optimal performance (which is the case for the length
of warmup. Depending on the initial learning rate, longer or shorter warmup
could be required). Cool.</p>
<h3 id="the-implicit-bias-of-depth-how-incremental-learning-drives-generalization"><a href="https://iclr.cc/virtual/poster_H1lj0nNFwB.html">The Implicit Bias of Depth: How Incremental Learning Drives Generalization</a></h3>
<p>Set up linear model $f_{\sigma}(x) = <\sigma, x>$ with reparameterizing
using auxiliary variables: $\sigma = w_1 \cdot w_2 \cdot w_3 \dots$ and
train via gradient descent.</p>
<p>Analyse the gradient flow and show that the learning dynamics are different
between using only $\sigma$ and the formulation using auxiliary variables.
In the not deep model, all values are learned at the same time, whereas with
increasing depth the values are learnt incrementally. The authors conjecture
that this is the cause of sparsity in these type of models. Parameters that
most decrease the loss, are decreased first, if there is a solution which
has few non-zero values, this approach will most likely find it.</p>
<p>The authors formalize the notation of incremental learning, and derive
conditions where it would occur. These conditions become much less strict as
the model becomes more deep. Thus, <em>deeper models allow for incremental
learning to occur more easily</em> and deeper models are more biased towards
obtaining sparse solutions.</p>
<p>Empirical results show that incremental learning occurs also under relaxed
assumptions.</p>
<h3 id="towards-neural-networks-that-provably-know-when-they-dont-know"><a href="https://iclr.cc/virtual/poster_ByxGkySKwH.html">Towards neural networks that provably know when they don’t know</a></h3>
<p>Probabilistic model: Decompose $p(y|x)$ into in and out of distribution part.</p>
<script type="math/tex; mode=display">p(y|x) = \frac{p(y|x,i) p(x|i) p(i) + p(y|x,o) p(x|o) p(o)}
{p(x|i) p(i) + p(x|o) p(o)}</script>
<p>Gives rise to generative models of in distribution data $p(x|i)$, out of
distribution data $p(x|o)$ and probability of the label given data is out of
distribution $p(y|x,o) = 1/M$. Requires a out of distribution data!
By using Gaussian Mixture Models as generative models, the authors can prove
that the model would be not-confident for areas which differ significantly
from the training data. Further they show that with their approach they can
guarantee that entire volumes would be assigned low confidence.</p>
<p>Experimental results indicate state of the art out of distribution detection
without reducing classification performance.</p>
<h3 id="truth-or-backpropaganda-an-empirical-investigation-of-deep-learning-theory"><a href="https://iclr.cc/virtual/poster_HyxyIgHFvr.html">Truth or backpropaganda? An empirical investigation of deep learning theory</a></h3>
<p><em>Suboptimal local minima DO exist in the loss landscape:</em>
Constructed proof based on high bias neurons which force ReLU units to
function as the identity map. These neurons can then kill other ReLU
neurons, constructing a smaller NN embedded inside the NN. Experiments show
than when initializing with high bias or high variance of bias the test
networks converge to suboptimal local minima.
<em>Suboptimal local minimal do exist, but are avoided by careful initialization.</em></p>
<p><em>Low l2 norm parameters are not better:</em>
Low l2 norm motivated from many directions: SVMs, generalization theory,
Induced development of weight decay.</p>
<p>Empirical test: Use weight decay with norm bias, such that it increases the
norm relative to weight decay. This improved performance across
architectures and datasets and even improved performance without batch
normalization.</p>
<p><em>Neural Tangent Kernel patterns:</em>
While the tangent kernel was confirmed to become constant in convnets, the
tangent kernel does not become constant in more involved architectures such
as resnet and others with skip connections.</p>
<p><em>Low rank layers:</em>
Experiments show that theoretical results do not hold in real world.
Maximizing the rank outperforms rank minimization. Rank minimization also
produces less robust networks.</p>
<h3 id="what-graph-neural-networks-cannot-learn-depth-vs-width"><a href="https://iclr.cc/virtual/poster_B1l2bp4YwS.html">What graph neural networks cannot learn: depth vs width</a></h3>
<p>Previous results indicate that the GNN message passing model is equivalent
to a 1WL-test and thus not universal.</p>
<p>These results have been shown on anonymous graphs (where the nodes have no
features). The first result of paper indicates that GNNs with message
passing are universal if they have “powerful layers”, are sufficiently deep
and wide and <em>if the nodes are given unique features</em>. If these do not exist
in the graph, then they can in theory be randomly assigned, but this would
probably lead to issues with generalization. Experiments confirm this:</p>
<ul>
<li>On the detection of a 4-cycle task a GNN without node labels can only
reach very poor performance (provably).</li>
<li>Degree features slightly improve performance a lot but don’t lead to
optimal performance</li>
<li><em>Assigning unique features based on a canonical ordering leads to perfect
performance on train and test</em></li>
<li>Random features leads to optimal performance on train but not on test and
thus generalize very badly</li>
</ul>
<p>Second result focuses on what cannot be learnt by GNNs with message passing.
Suggests that the GNN size should depend on the number of nodes $n$:</p>
<ul>
<li>
<p>Cannot solve many decision, optimization, verification and estimation
problems unless:</p>
<script type="math/tex; mode=display">depth \times width = \Omega(n^d) \text{ for } d \in [0.5, 2]</script>
</li>
<li><em>Dependence on n even if task appears local $\Omega(n)$!</em></li>
<li>Hard problems (maximum independent set, minimum vertex cover, coloring)
require $\Omega(n^2)$</li>
</ul>
<p>All in all a very good study on which can be helpful when picking
hyperparameters for GNNs.</p>
<h3 id="why-gradient-clipping-accelerates-training-a-theoretical-justification-for-adaptivity"><a href="https://iclr.cc/virtual/poster_BJgnXpVYwS.html">Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity</a></h3>
<p>While Vanilla SGD is theoretically optimal it is in practice worse than
extensions such as SGD with momentum and Adam. Where does this gap come
from?</p>
<p>Start insight: Proof of SGD optimality relies on three assumptions:</p>
<ul>
<li>Differentiability</li>
<li>Bounded second moments</li>
<li>L-smoothness which is the focus of this talk</li>
</ul>
<p>L-smoothness is actually a very strict criterion. Empirical evaluation shows
that smoothness of a NN changes dramatically and that it correlates with the
gradient norm.</p>
<p>Authors suggest a relaxed smoothness criterion (l0, l1 smoothness), which
would account for the empirical observations better then the strict
criterion.</p>
<p>Show with the relaxed smoothness criterion, there is a dependency of SGD
convergence on the maximal norm of the gradient. This means Vanilla SGD
would not converge if the gradient is not upper bounded.
Further, the paper shows that clipped gradient descent does not have
a dependence on the maximal norm of the gradient.</p>
<p>High level intuition: With clipping the SGD can traverse non-smooth areas,
without clipping this would lead to divergences. This is especially the case
when training with high learning rate.</p>
<h2 id="other---normalizing-flows-robustness-meta-learning-probabilistic-modelling">Other - Normalizing Flows, Robustness, Meta-Learning, Probabilistic Modelling</h2>
<h3 id="invertible-models-and-normalizing-flows"><a href="https://iclr.cc/virtual/speaker_4.html">Invertible Models and Normalizing Flows</a></h3>
<p>Reference: https://arxiv.org/pdf/1908.09257</p>
<p>Generative networks paradigm:</p>
<ul>
<li>Variational Autoencoder (previously called Helmholz machine)</li>
<li>Generative Adversarial Networks using Noise Contrastive Estimation (the discriminator)</li>
</ul>
<p>Two parts of determine the likelihood of a normalizing flow:</p>
<ul>
<li>likelihood of transformed variable in latent space given prior (push
everything to zero if we assume N(0,1))</li>
<li>determinant of the jacobian penalizing strong compression (similar an
entropy regularization term)</li>
</ul>
<p>Sampling:</p>
<ul>
<li>Sample from prior</li>
<li>pass through flow and you are done</li>
</ul>
<p>History of flows:</p>
<ul>
<li>First:
<ul>
<li>Require square weight matrices</li>
<li>strictly increasing activation functions</li>
<li>Should make whole network invertable
Issue: Computing the determinant of the jacobian is problematic!
Space: $\mathcal{O}(d^2)$, runtime from $\mathcal{O}(d!)$ to $\mathcal{O}(d^3)$</li>
</ul>
</li>
<li>Autoregressive models: Have triangular jacobian!</li>
<li>Using LU decomposition to enforce triangular jacobian: Works really bad.</li>
<li>Coupling layers (NICE): Use function on fraction of the inputs to scale
and shift the other fraction. Is invertable and has a very easy to
compute jacobian! Actually got rejected twice during the process!</li>
<li>Real NVP: Batch normalization, Convolutional NN, small extensions</li>
</ul>
<p>Relevance of log-likelihood:</p>
<ul>
<li>Sample quality vs. log-likelihood:
Log-likelihhod and sample quality do not have to be aligned. But it seems
that (until now at least) that there at least seems to be a very strong
correlation (Evtl. look at Theis & van den Oord et al. 2015).</li>
<li>Density as a measure of typicality (for anomaly detection):
Role in typicality is questionable. Until now it does not seem to be
aligned.</li>
</ul>
<p>Future directions:</p>
<ul>
<li>Learning flows on manifolds</li>
<li>Add in prior knowledge into the flow, such as symmetries</li>
<li>Discrete change of variables, requires tricks (such as continuous
relaxation) for backprop</li>
<li>Variational approximations, mapping discrete variables to continuous
distributions, using dequantization</li>
<li>Adaptive sparsity patterns</li>
<li>Non-invertible models (Dinh 2019)</li>
</ul>
<h3 id="learning-to-balance-bayesian-meta-learning-for-imbalanced-and-out-of-distribution-tasks"><a href="https://iclr.cc/virtual/poster_rkeZIJBYvr.html">Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks</a></h3>
<p>Focuses on extending MAML to unbalanced test datasets and out of
distribution tasks. This is mostly implemented by adapting the inner
gradient update according to the following formula:</p>
<script type="math/tex; mode=display">\begin{aligned}
\theta_0 = \theta * z^\tau \\
\theta_k = \theta_{k-1} - \gamma^\tau \circ \alpha \circ \sum_{c=1}^C \omega^\tau_c \nabla_\theta \mathcal{L}_c
\end{aligned}</script>
<p>where $*$ is abused in notation to “multiply if component in $\theta$ is
a weight and add if component in $\theta$ is a bias” and $\circ$ represents
element wise multiplication.</p>
<p>The components have the following roles:</p>
<ul>
<li>$\omega^\tau_c$ scales the gradient of all instances with class c. Allows
to take larger gradient steps for classes with more examples <em>in order to
tackle class imbalance</em> (Is this equivalent to computing a weighted loss?)</li>
<li>$\gamma^\tau$ scales the size of gradient steps on a per task basis <em>in
order to tackle task imbalance</em>. Reasoning: Larger tasks should have
larger $\gamma$ because they can rely more on task-specific updates,
while smaller tasks should have small $\gamma$ because they should rely
more on meta-knowledge.</li>
<li>$z^\tau$ relocates the initial weights for each task. This allows tasks
which dissimilar (out of distribution) to the original meta-learnt tasks
to be shifted further away in weight space, while task which are closes
the meta-learnt tasks to remain closer. <em>Tackles out-of-distribution
tasks</em></li>
</ul>
<p>The parameters are inferred using amortized variational inference, where the
inference network is shared across all tasks and the parameters are computed
conditionally on summary statistics derived from the task instances (see
below).</p>
<p><img src="/assets/2020-05-19-ICLR-2020/learning-to-balance-figure.png" alt="Inference network" /></p>
<p>In general, I find it very interesting how the authors incorporate additional
terms into the inner gradient computation loop and how the inference of the
associated parameters is implemented using variational inference on the
summary statistics.</p>
<h3 id="meta-dropout-learning-to-perturb-latent-features-for-generalization"><a href="https://iclr.cc/virtual/poster_BJgd81SYwr.html">Meta Dropout: Learning to Perturb Latent Features for Generalization</a></h3>
<p>Suggest to additionally learn how to perturb intermediate representation
during meta learning using input dependent Dropout. This is done by
computing parameters of the noise distribution dependent on the output of
the previous layer. This results in the following preactivation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
a^{(l)} &\sim \mathcal{N}(\mu^{(l)}(h^{(l-1)}), I) \\\\
z^{(l)} &= Softplus(a^{(l)}) \\\\
h^{(l)} &= ReLU(W^{(l)} h^{(l-1)} \cdot z^{(l)})
\end{aligned} %]]></script>
<p>The parameters of the function $ \mu $ and the weights of the layer $W$ are
trained using the meta learning objects of MAML. Yet importantly, the
weights of $\mu$ are not fitted in the inner training loop of MAML as this
would allow the algorithm to “optimize away” the perturbation. Interestingly
they use a lower bound to optimize the inner loop. This is not really
motivated, but the authors draw a connection to variational inference, where
q and p are from the same family and thus the KL would equal 0. The results
look very good actually.</p>
<h3 id="on-robustness-of-neural-ordinary-differential-equations"><a href="https://iclr.cc/virtual/poster_B1e9Y2NYvS.html">On Robustness of Neural Ordinary Differential Equations</a></h3>
<p>Empirical study of Neural ODEs compared to ResNets. Show that they are more
robust to random noise and to adversarial attacks. Further suggest
Time-invariant steady state ODE (TisODE), to additionally improve
robustness. This adds time invariance in the ODE function and a specific
steady state condition.</p>
<p>They argue that the NeuralODE is more robust to perturbations as the ODE
integral curves do not intersect (See Theorem 1). The argument is that due
to the non intersecting property, each perturbation of less than $\epsilon$
would remain sandwiched within the $\epsilon$ ball and its distance from the
unperturbed final position would remain upper bounded. Quite a cool insight.
The extensions of the authors in TisODE aim at controlling the differences
between neighboring integral curves to enhance NeuralODE robustness.</p>
<h3 id="why-not-to-use-zero-imputation-correcting-sparsity-bias-in-training-neural-networks"><a href="https://iclr.cc/virtual/poster_BylsKkHYvH.html">Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks</a></h3>
<p>Using Zero imputation induces a bias in the network! Often the prediction
correlates with the fraction of imputed values more than with respect to the
similarity of the instances! This effect is present across a large variety
of datasets including medical data.</p>
<p>Authors coin term “Variable sparsity problem”, where the expected value of
the output layer of a NN depends on the sparsity of the input data. The
existence of this problem is derived theoretically based on a few
assumptions.</p>
<p>Paper suggest that imputation with non zero values is helpful, by
stabilizing the number of known entities. Phrase imputation as injecting
plausible noise. Still it should be considered as injecting noise into the
network!</p>
<p>Authors suggest to use sparsity normalization:</p>
<ul>
<li>Divide the input data by the L1 norm</li>
<li>Then the average activation of the subsequent layer would be independent
of data sparsity</li>
<li>This is a simple preprocessing scheme of the input data!</li>
</ul>
<p>Authors show theoretically that this preprocessing step can fix the variable
sparsity problem. Empirical results are also in line with the theory.</p>
Sat, 23 May 2020 00:00:00 +0000
https://ExpectationMax.github.io/2020/ICLR-2020/
https://ExpectationMax.github.io/2020/ICLR-2020/conferencespaperspapersProject: simple-gpu-scheduler - easy scheduling of jobs on multiple GPUs<p>Our research group has multiple servers each equipped with multiple GPUs.
Unfortunately, these are not connected together in a cluster infrastructure,
but instead, GPUs are assigned to individuals or on a per-project basis. This
makes the execution of many jobs using multiple GPUs difficult.</p>
<p>While it would be possible to connect the servers to a small cluster with
a scheduling system (we are working on it!), this can take a long time until it
is set up. Especially in academia where the maintenance and setup of servers is
often delegated to the departments IT-team, the path to implementing a small
scale cluster is littered with bureaucracy. Questions like: <em>Who is
responsible for xyz?</em>, <em>How are the software installations managed?</em>,
<em>Which alterations should be done to have the correct network infrastructure?</em>
can take ages before they are answered and appropriately implemented. In our
particular case we had the idea of refurbishing the cluster more than a year
ago and are still no where close to having it up and running.</p>
<p><img src="https://imgs.xkcd.com/comics/networking_problems.png" alt="XKCD comic about networking problems" title="LOOK, THE LATENCY FALLS EVERY TIME YOU CLAP YOUR HANDS AND SAY YOU BELIEVE" /></p>
<h2 id="the-alternative---simple-gpu-scheduler">The Alternative - <code class="language-plaintext highlighter-rouge">simple-gpu-scheduler</code></h2>
<p>Driven by the need of having something as a bridge between our current server
setup and the to be beautiful world of our personal cluster I decided to write
a small Python package to do the job. This is how
<a href="https://github.com/ExpectationMax/simple_gpu_scheduler">simple-gpu-scheduler</a>
was born.</p>
<h3 id="how-it-works">How it works</h3>
<p>Software based on the CUDA library (such as most deep learning frameworks and
many others), can be constrained to only seeing certain GPUs using the
<code class="language-plaintext highlighter-rouge">CUDA_VISIBLE_DEVICES</code> environment variable. The <code class="language-plaintext highlighter-rouge">simple-gpu-scheduler</code> accepts
commands and executes them while setting the environment variable to
a currently free GPU. As soon as the job finishes, the GPU is released and the
next job is allocated to it. This allows to always utilize all of the GPUs to
the maximally possible extent <sup id="fnref:gnu-parallel"><a href="#fn:gnu-parallel" class="footnote">1</a></sup>.</p>
<h3 id="usage">Usage</h3>
<p>I wanted to make <code class="language-plaintext highlighter-rouge">simple-gpu-scheduler</code> as simple and flexible as possible and
thus tried to adhere to the <a href="https://en.wikipedia.org/wiki/KISS_principle">KISS
principle</a>. Like many UNIX tools
it thus takes it’s input from <code class="language-plaintext highlighter-rouge">stdin</code> such that it can be combined with other
tools. This allows reading commands from a list, or even from a fifo (first in
first out), such we can build a fully functioning queuing system. For further
reference please consult the <a href="https://github.com/ExpectationMax/simple_gpu_scheduler">GitHub
page</a> of the project.</p>
<h3 id="simple-example">Simple example</h3>
<p>Suppose you have a file <code class="language-plaintext highlighter-rouge">gpu_commands.txt</code> with commands that you would like to
execute on the GPUs 0, 1 and 2 in parallel:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>gpu_commands.txt
python train_model.py <span class="nt">--lr</span> 0.001 <span class="nt">--output</span> run_1
python train_model.py <span class="nt">--lr</span> 0.0005 <span class="nt">--output</span> run_2
python train_model.py <span class="nt">--lr</span> 0.0001 <span class="nt">--output</span> run_3
</code></pre></div></div>
<p>Then you can do so by simply piping the command into the <code class="language-plaintext highlighter-rouge">simple_gpu_scheduler</code>
script</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>simple_gpu_scheduler <span class="nt">--gpus</span> 0 1 2 < gpu_commands.txt
Processing <span class="nb">command</span> <span class="sb">`</span>python train_model.py <span class="nt">--lr</span> 0.001 <span class="nt">--output</span> run_1<span class="sb">`</span> on gpu 2
Processing <span class="nb">command</span> <span class="sb">`</span>python train_model.py <span class="nt">--lr</span> 0.0005 <span class="nt">--output</span> run_2<span class="sb">`</span> on gpu 1
Processing <span class="nb">command</span> <span class="sb">`</span>python train_model.py <span class="nt">--lr</span> 0.0001 <span class="nt">--output</span> run_3<span class="sb">`</span> on gpu 0
</code></pre></div></div>
<h3 id="hyperparameter-search">Hyperparameter search</h3>
<p>One of the most common use cases for running many jobs in parallel is
hyperparameter search. For convenience I added a small script
<code class="language-plaintext highlighter-rouge">simple_hypersearch</code> which generates commands to evaluate a hyperparameter
grid. Here is a small example of how to generate all possible configurations
and execute them in random order:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>simple_hypersearch <span class="s2">"python3 train_dnn.py --lr {lr} --batch_size {bs}"</span> <span class="nt">-p</span> lr 0.001 0.0005 0.0001 <span class="nt">-p</span> bs 32 64 128 | simple_gpu_scheduler <span class="nt">--gpus</span> 0,1,2
</code></pre></div></div>
<h3 id="final-words">Final words</h3>
<p>I hope some of you find the software useful. Feel free to open issues and
feature requests if you need any further features. See you next time!</p>
<div class="footnotes">
<ol>
<li id="fn:gnu-parallel">
<p><a href="https://www.gnu.org/software/parallel/">GNU parallel</a>
can be used to do something similar (see the
<a href="https://news.ycombinator.com/item?id=21269950">HN discussion</a>). It is
significantly more flexible, which IMHO comes at the cost of ease of use. <a href="#fnref:gnu-parallel" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 25 Oct 2019 00:00:00 +0000
https://ExpectationMax.github.io/2019/simple-gpu-scheduler/
https://ExpectationMax.github.io/2019/simple-gpu-scheduler/developmentdevelopmentOrganizing projects and notes with vimwiki and VimR<p>After a long time of absence I decided to reactivate my blog! So here comes
another post related to optimizing the workflow of a PhD student in Computer
Science.</p>
<h2 id="the-problem">The problem</h2>
<p>If there is one thing you should do a lot in your PhD studies it is reading.
Papers, course materials, books, blog articles. Over time it becomes hard to
keep track of everything read and the thoughts one had while reading. While it
is possible to track this kind of information (especially for papers) in
reference management software, it is usually not desired/possible to store
information from heterogeneous sources in such a format.</p>
<h2 id="a-potential-solution">A potential solution</h2>
<p>As I have developed a strong resistance against using practically any other
editor besides (Neo)vim <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, I searched for a solution within this framework. The
most lucrative solution for my taste was <a href="https://github.com/vimwiki/vimwiki">vimwiki</a>.
It allows to:</p>
<ul>
<li>Organize notes from multiple projects in a hierarchy</li>
<li>Supports navigating between these in a flawless fashion</li>
<li>It is very easy to create new pages</li>
<li>Can be configured to use a markdown compatible syntax</li>
<li>As everything is text, changes can also easily be synchronized git and even
be hosted in a personal wiki on github</li>
</ul>
<h2 id="configuration">Configuration</h2>
<p>I configured vimwiki such that it saves all my notes in a folder <code class="language-plaintext highlighter-rouge">PhDWiki</code> in
my home directory, uses the markdown syntax and only gets activated when files
inside this notes folder are being edited (I want to have the possibility to
configure my general markdown integration independently of vimwiki). This can
be achieved by adding the following settings to your <code class="language-plaintext highlighter-rouge">.vimrc</code> or <code class="language-plaintext highlighter-rouge">init.vim</code>:</p>
<div class="language-vimscript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="nv">g:vimwiki_list</span> <span class="p">=</span> <span class="p">[{</span><span class="s1">'path'</span><span class="p">:</span> <span class="s1">'~/PhDwiki/'</span><span class="p">,</span> <span class="s1">'syntax'</span><span class="p">:</span> <span class="s1">'markdown'</span><span class="p">,</span> <span class="s1">'ext'</span><span class="p">:</span> <span class="s1">'.md'</span><span class="p">,</span> <span class="s1">'index'</span><span class="p">:</span> <span class="s1">'Home'</span><span class="p">}]</span>
<span class="k">let</span> <span class="nv">g:vimwiki_global_ext</span> <span class="p">=</span> <span class="m">0</span>
<span class="c">" Install the plugin, this uses vim-plug anything else should also do</span>
Plug <span class="s1">'vimwiki/vimwiki'</span>
</code></pre></div></div>
<h2 id="working-with-equations-in-the-vimr-preview">Working with equations in the VimR preview</h2>
<p>While it is possible to configure vim such that some symbols in equations are
replaced, this usually does not really improve the readability of the
equations. This is especially due to very limited support of most operation
such as sub- and superscripts. For revising my writing, I rely on the preview
provided by <a href="https://github.com/qvacua/vimr">VimR</a>. While I deactivated all
additional features of the GUI such as file browser, buffer view etc.,
features such as the markdown and html preview are the most beneficial
components of a GUI-interface compared to running NeoVim in the terminal.</p>
<p>Unfortunately, this preview does not support equations which are
omnipresent in courses and papers out of the box. Yet not all hope is lost, as
thanks to a nifty javascript tool <a href="https://www.mathjax.org/">MathJax</a> it can be
made to do so. VimR uses a tiny browser with full javascript support
to render markdown and html pages, thus by modifying the markdown template to
include MathJax we can patch in the support of equations.</p>
<p>For this we need to edit the template file
<code class="language-plaintext highlighter-rouge">VimR.app/Contents/Resources/markdown/template.html</code>, by adding the following
lines just before the html tag <code class="language-plaintext highlighter-rouge"></head></code>. The relevant region of the edited
file should look as follows:</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nt"><title></title></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span><span class="nt">></span>
<span class="nb">window</span><span class="p">.</span><span class="nx">MathJax</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">extensions</span><span class="p">:</span> <span class="p">[</span><span class="dl">"</span><span class="s2">tex2jax.js</span><span class="dl">"</span><span class="p">],</span>
<span class="na">jax</span><span class="p">:</span> <span class="p">[</span><span class="dl">"</span><span class="s2">input/TeX</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">output/HTML-CSS</span><span class="dl">"</span><span class="p">],</span>
<span class="na">tex2jax</span><span class="p">:</span> <span class="p">{</span>
<span class="na">inlineMath</span><span class="p">:</span> <span class="p">[</span> <span class="p">[</span><span class="dl">"</span><span class="se">\\</span><span class="s2">(</span><span class="dl">"</span><span class="p">,</span><span class="dl">"</span><span class="se">\\</span><span class="s2">)</span><span class="dl">"</span><span class="p">],</span> <span class="p">[</span><span class="dl">'</span><span class="s1">$</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">$</span><span class="dl">'</span><span class="p">]</span> <span class="p">],</span>
<span class="na">displayMath</span><span class="p">:</span> <span class="p">[</span> <span class="p">[</span><span class="dl">'</span><span class="s1">$$</span><span class="dl">'</span><span class="p">,</span><span class="dl">'</span><span class="s1">$$</span><span class="dl">'</span><span class="p">],</span> <span class="p">[</span><span class="dl">"</span><span class="se">\\</span><span class="s2">[</span><span class="dl">"</span><span class="p">,</span><span class="dl">"</span><span class="se">\\</span><span class="s2">]</span><span class="dl">"</span><span class="p">]</span> <span class="p">],</span>
<span class="na">processEscapes</span><span class="p">:</span> <span class="kc">true</span>
<span class="p">},</span>
<span class="dl">"</span><span class="s2">HTML-CSS</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="na">availableFonts</span><span class="p">:</span> <span class="p">[</span><span class="dl">"</span><span class="s2">STIX</span><span class="dl">"</span><span class="p">],</span>
<span class="na">preferredFont</span><span class="p">:</span> <span class="dl">'</span><span class="s1">STIX</span><span class="dl">'</span><span class="p">,</span>
<span class="na">webFont</span><span class="p">:</span> <span class="dl">'</span><span class="s1">STIX-Web</span><span class="dl">'</span><span class="p">,</span>
<span class="na">imageFont</span><span class="p">:</span> <span class="kc">null</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="nt"></script></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js"</span> <span class="na">async</span><span class="nt">></script></span>
<span class="nt"></head></span>
</code></pre></div></div>
<p>This adds MathJax in the most light configuration to the markdown template,
allowing it to render math equations. Below you can see how the result looks:</p>
<p><img src="/assets/2019-04-12_VimR_equations.jpg" alt="Vimr with markdown preview" /></p>
<p>While this works well in practice, it is still a rather hacky solution. For
example sometimes it is necessary to wrap the equation into a set of <code class="language-plaintext highlighter-rouge"><p></p></code>
tags to prevent the markdown renderer from destroying the equations. To make
the approach more integrated and robust to updates of the software, I opened an
issue <a href="https://github.com/qvacua/vimr/issues/718">here</a>. I will edit this post
if there is any follow up development.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I am planning to do a separate blog post on the benefits and disadvantages of using an editor like Vim for most of your work. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 12 Apr 2019 00:00:00 +0000
https://ExpectationMax.github.io/2019/organizing-projects-with-vimwiki-and-VimR/
https://ExpectationMax.github.io/2019/organizing-projects-with-vimwiki-and-VimR/developmentdevelopmentSetting up a Neovim and pipenv based Python development environment<p>I think everybody has been there after some time:</p>
<ul>
<li>multiple python venvs for dozens of projects</li>
<li>huge <code class="language-plaintext highlighter-rouge">requirements.txt</code> files containing all dependencies of dependencies</li>
<li>JuPyter notebooks everywhere, including their dependencies</li>
</ul>
<p>For the start of my PhD I decided to try to bring some order in the chaos of environments and dependencies by switching to <code class="language-plaintext highlighter-rouge">pipenv</code>. Furthermore, I show how to implement <strong>jupyter notebook</strong> style programming in a Neovim()/Oni() development environment.</p>
<h2 id="pipenv">pipenv</h2>
<p>Pipenv is a tool that allows to manage project dependent virtual environments, while additionally enhancing reproducibility by using checksums of installed packages (<code class="language-plaintext highlighter-rouge">Pipfile.lock</code>).
It is the recommended package manager by http://www.python.org, is straightforward to install and also supports loading project specific environment variables using an <code class="language-plaintext highlighter-rouge">.env</code> file.</p>
<p>Virtual environments in pipenv are not stored in the repository of the project, also there are no additional files besides the Pipfile and the Pipfile.lock (these are actually good to have to ensure reproducibility).
The strategy is to avoid installing packages outside of pipenv (for example using pip), which automatically ensures that all project dependencies are tracked and up to date. Overall pretty neat in my opinion.</p>
<p>On macOS with <a href="https://brew.sh/">brew</a> you can be up an running with Python 3 and pipenv using the following commands:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew <span class="nb">install </span>python3
brew <span class="nb">install </span>pipenv
</code></pre></div></div>
<p>Afterwards, we can install JuPyter in the global Python3 environment (or the users Python3 environment by adding the <code class="language-plaintext highlighter-rouge">--user</code> flag) using:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip3 <span class="nb">install</span> <span class="o">[</span><span class="nt">--user</span><span class="o">]</span> jupyter
</code></pre></div></div>
<h3 id="avoiding-reinstalling-jupyter-for-all-venvs">Avoiding reinstalling JuPyter for all venvs</h3>
<p>Now that we have the JuPyter installed in the global environment we don’t want to have to reinstall all the dependencies for each virtual environment/project we work on.
The trick is, that we only need to install the JuPyter kernel in the individual virtual environments, and register these kernels in the global installation.</p>
<p>The kernel package that is required for a jupyter/IPython interface (notebook, QT Console, console) to communicate with an environment is <code class="language-plaintext highlighter-rouge">ipykernel</code>, which can be installed as an development dependency in pipenv (<code class="language-plaintext highlighter-rouge">pipenv install --dev ipykernel</code>).
Afterwards, the new kernel needs to be registered with the global JuPyter installation.
In order to make the whole process easier, I wrote added a small bash function to my <code class="language-plaintext highlighter-rouge">~/.bashrc</code> to create a Python 3 environment, install <code class="language-plaintext highlighter-rouge">ipykernel</code> as a development dependency and register the new kernel for usage in the global JuPyter installation.</p>
<p>To get the same functionality, add the following lines to your <code class="language-plaintext highlighter-rouge">~/.bashrc</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>init_python3_pipenv <span class="o">()</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s2">"Setting up pipenv environment"</span>
pipenv <span class="nb">install</span> <span class="nt">--three</span>
<span class="nb">echo</span> <span class="s2">"Installing ipython kernel"</span>
pipenv <span class="nb">install</span> <span class="nt">--dev</span> ipykernel
<span class="c"># get name of environment and remove checksum for pretty name</span>
<span class="nv">venv_name</span><span class="o">=</span><span class="si">$(</span><span class="nb">basename</span> <span class="nt">--</span> <span class="si">$(</span>pipenv <span class="nt">--venv</span><span class="si">))</span>
<span class="nv">venv_prettyname</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nv">$venv_name</span> | <span class="nb">cut</span> <span class="nt">-d</span> <span class="s1">'-'</span> <span class="nt">-f</span> 1<span class="si">)</span>
<span class="nb">echo</span> <span class="s2">"Adding ipython kernel to list of jupyter kernels"</span>
<span class="si">$(</span>pipenv <span class="nt">--py</span><span class="si">)</span> <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span> <span class="nv">$venv_name</span> <span class="se">\</span>
<span class="nt">--display-name</span> <span class="s2">"Python3 (</span><span class="nv">$venv_prettyname</span><span class="s2">)"</span>
<span class="o">}</span>
</code></pre></div></div>
<p>A new project can now easily be set up using:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/Projects/MyAwesomeProject
<span class="nb">cd</span> ~/Projects/MyAwesomeProject
init_python3_pipenv
</code></pre></div></div>
<h2 id="jupyter-notebook-style-programming-in-onineovim">JuPyter notebook style programming in Oni/Neovim</h2>
<p>For the vim users out there, I will explain how you can convert vim into an interactive developing environment similar to working in a jupyter notebook or using an ide like spyder.
This setup involves launching an IPython kernel in a QTConsole, establishing a remote connection to the kernel using the <code class="language-plaintext highlighter-rouge">nvim-ipy</code> plugin and configuring the QTConsole such that it outputs the results of remote commands.
IMHO the result look quite acceptable:</p>
<p><img src="/assets/2018-04-10_Oni-and-QtConsole.png" alt="Screenshot of working environment" /></p>
<h3 id="running-qtconsole-from-vim-using-correct-kernel">Running QtConsole from vim using correct kernel</h3>
<p>The benefit of the QT Console is that the output of a command is directly visible, allowing interactive programming with intermediate plots and variable inspection.</p>
<p>On macOS, the dependencies (mainly QT) of the QT Console can be installed via brew.
For other operating systems please refer to the <a href="https://qtconsole.readthedocs.io/en/stable/index.html">QT console documentation</a>.
As I only use Python 3 for development on an macOS operating system, I got the QT Console up and running using the following commands:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew <span class="nb">install </span>sip <span class="nt">--without-python</span>@2
brew <span class="nb">install </span>pyqt <span class="nt">--with-python3</span> <span class="nt">--without-python</span>@2
</code></pre></div></div>
<p>For integration with vim I use the <a href="https://github.com/bfredl/nvim-ipy">nvim-ipy</a> vim plugin, which can be installed using your favorite vim plugin manager (I personally use <a href="https://github.com/junegunn/vim-plug">vim-plug</a>).
The following command rely on the installation of <code class="language-plaintext highlighter-rouge">nvim-ipy</code>.
To allow the QT Console to easily be launched using the correct kernel and from within vim, I defined the following vim functions in my <code class="language-plaintext highlighter-rouge">init.vim</code></p>
<div class="language-vimscript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="p">!</span> GetKernelFromPipenv<span class="p">()</span>
<span class="k">let</span> <span class="nv">a:kernel</span> <span class="p">=</span> tolower<span class="p">(</span>system<span class="p">(</span><span class="s1">'basename $(pipenv --venv)'</span><span class="p">))</span>
<span class="c">" Remove control characters (most importantly newline)</span>
<span class="k">return</span> substitute<span class="p">(</span><span class="nv">a:kernel</span><span class="p">,</span> <span class="s1">'[[:cntrl:]]'</span><span class="p">,</span> <span class="s1">''</span><span class="p">,</span> <span class="s1">'g'</span><span class="p">)</span>
<span class="k">endfunction</span>
<span class="k">function</span><span class="p">!</span> ConnectToPipenvKernel<span class="p">()</span>
<span class="k">let</span> <span class="nv">a:kernel</span> <span class="p">=</span> GetKernelFromPipenv<span class="p">()</span>
<span class="k">call</span> IPyConnect<span class="p">(</span><span class="s1">'--kernel'</span><span class="p">,</span> <span class="nv">a:kernel</span><span class="p">,</span> <span class="s1">'--no-window'</span><span class="p">)</span>
<span class="k">endfunction</span>
<span class="k">function</span><span class="p">!</span> AddFilepathToSyspath<span class="p">()</span>
<span class="k">let</span> <span class="nv">a:filepath</span> <span class="p">=</span> expand<span class="p">(</span><span class="s1">'%:p:h'</span><span class="p">)</span>
<span class="k">call</span> IPyRun<span class="p">(</span><span class="s1">'import sys; sys.path.append("'</span> <span class="p">.</span> <span class="nv">a:filepath</span> <span class="p">.</span> <span class="s1">'")'</span><span class="p">)</span>
echo <span class="s1">'Added '</span> <span class="p">.</span> <span class="nv">a:filepath</span> <span class="p">.</span> <span class="s1">' to pythons sys.path'</span>
<span class="k">endfunction</span>
command<span class="p">!</span> <span class="p">-</span>nargs<span class="p">=</span><span class="m">0</span> ConnectToPipenvKernel <span class="k">call</span> ConnectToPipenvKernel<span class="p">()</span>
command<span class="p">!</span> <span class="p">-</span>nargs<span class="p">=</span><span class="m">0</span> RunQtConsole <span class="k">call</span> jobstart<span class="p">(</span><span class="s2">"jupyter qtconsole --existing"</span><span class="p">)</span>
command<span class="p">!</span> <span class="p">-</span>nargs<span class="p">=</span><span class="m">0</span> AddFilepathToSyspath <span class="k">call</span> AddFilepathToSyspath<span class="p">()</span>
</code></pre></div></div>
Tue, 10 Apr 2018 17:00:00 +0000
https://ExpectationMax.github.io/2018/Neovim-pipenv-based-development-environment/
https://ExpectationMax.github.io/2018/Neovim-pipenv-based-development-environment/developmentdevelopment