1.4: 04 Random vectors and independence - Mathematics

1.4: 04 Random vectors and independence - Mathematics

Problem 500

10 questions about nonsingular matrices, invertible matrices, and linearly independent vectors.

The quiz is designed to test your understanding of the basic properties of these topics.

You can take the quiz as many times as you like.

The solutions will be given after completing all the 10 problems.
Click the View question button to see the solutions.

Problem 704

Let $A=egin 2 & 4 & 6 & 8 \ 1 ſ & 0 & 5 \ 1 & 1 & 6 & 3 end$.
(a) Find a basis for the nullspace of $A$.

(b) Find a basis for the row space of $A$.

(c) Find a basis for the range of $A$ that consists of column vectors of $A$.

(d) For each column vector which is not a basis vector that you obtained in part (c), express it as a linear combination of the basis vectors for the range of $A$.

1.4: 04 Random vectors and independence - Mathematics

In real life, we usually need to deal with more than one random variable. For example, if you study physical characteristics of people in a certain area, you might pick a person at random and then look at his/her weight, height, etc. The weight of the randomly chosen person is one random variable, while his/her height is another one. Not only do we need to study each random variable separately, but also we need to consider if there is dependence (i.e., correlation) between them. Is it true that a taller person is more likely to be heavier or not? The issues of dependence between several random variables will be studied in detail later on, but here we would like to talk about a special scenario where two random variables are independent.

The concept of independent random variables is very similar to independent events. Remember, two events $A$ and $B$ are independent if we have $P(A,B)=P(A)P(B)$ (remember comma means and, i.e., $P(A,B)=P(A extrm < and >B)=P(A cap B)$). Similarly, we have the following definition for independent discrete random variables.

Intuitively, two random variables $X$ and $Y$ are independent if knowing the value of one of them does not change the probabilities for the other one. In other words, if $X$ and $Y$ are independent, we can write $P(Y=y|X=x)=P(Y=y), extrm < for all >x,y.$ Similar to independent events, it is sometimes easy to argue that two random variables are independent simply because they do not have any physical interactions with each other. Here is a simple example: I toss a coin $2N$ times. Let $X$ be the number of heads that I observe in the first $N$ coin tosses and let $Y$ be the number of heads that I observe in the second $N$ coin tosses. Since $X$ and $Y$ are the result of independent coin tosses, the two random variables $X$ and $Y$ are independent. On the other hand, in other scenarios, it might be more complicated to show whether two random variables are independent.

I toss a coin twice and define $X$ to be the number of heads I observe. Then, I toss the coin two more times and define $Y$ to be the number of heads that I observe this time. Find $Pigg((X 1)igg)$.

Names

A great feature of R’s vectors is that each element can be given a name. Labeling the elements can often make your code much more readable. You can specify names when you create a vector in the form name = value . If the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes. You can name some elements of a vector and leave others blank:

You can add element names to a vector after its creation using the names function:

This names function can also be used to retrieve the names of a vector:

If a vector has no element names, then the names function returns NULL :

Discussion

A frequent situation in machine learning is having a huge amount of data however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero.

Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored:

There are a number of types of sparse matrices. However, in compressed sparse row (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3 , respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix:

As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix.

As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the different types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences between them and we should be conscious about why we are choosing one type over another.

1. Introduction

Measuring and testing dependence between |$$| and |$$| is a fundamental problem in statistics. The Pearson correlation is perhaps the first and the best-known quantity to measure the degree of linear dependence between two univariate random variables. Extensions including Spearman’s (1904) rho, Kendall’s (1938) tau, and those due to Hoeffding (1948) and Blum (1961) can be used to measure nonlinear dependence without moment conditions.

Testing independence has important applications. Two examples from genomics research are testing whether two groups of genes are associated and examining whether certain phenotypes are determined by particular genotypes. In social science research, scientists are interested in understanding potential associations between psychological and physiological characteristics. Wilks (1935) introduced a parametric test based on |$|>_<,>| / (|>_| |>_|)$|⁠ , where |$>_<,> = ><(<^< m T>>,<^< m T>>)<^< m T>>>inmathbb^<(p+q) imes (p+q)>$|⁠ , |$>_= >()inmathbb^$| and |$>_ = >()inmathbb^$|⁠ . Throughout |$>_ = >()$| stands for the covariance matrix of |$$| and ||>_|| stands for the determinant of |>_|⁠ . Hotelling (1936) suggested the canonical correlation coefficient, which seeks |>inmathbb^p| and |>inmathbb^q| such that the Pearson correlation between |><^< m T>>| and |><^< m T>>| is maximized. Both Wilks’s test and the canonical correlation can be used to test for independence between |$$| and |$$| when they follow normal distributions. Nonparametric extensions of Wilks’s test were proposed by Puri & Sen (1971), Hettmansperger & Oja (1994), Gieser & Randles (1997), Taskinen et al. (2003) and Taskinen et al. (2005). These tests can be used to test for independence between |$$| and |$$| when they follow elliptically symmetric distributions, but they are inapplicable when the normality or ellipticity assumptions are violated or when the dimensions of |$$| and |$$| exceed the sample size. In addition, multivariate rank-based tests of independence are ineffective for testing nonmonotone dependence ( Székely et al., 2007). The distance correlation ( Székely et al., 2007) can be used to measure and test dependence between |$$| and |$$| in arbitrary dimensions without assuming normality or ellipticity. Provided that |E(|| +||) < infty|⁠ , the distance correlation between |$$| and |$$|⁠ , denoted by |>(,)|⁠ , is nonnegative, and it equals zero if and only if |$$| and |$$| are independent. Throughout, we define ||| = (<^< m T>>)^<1/2>| for a vector |$$|⁠ . Székely & Rizzo (2013) observed that the distance correlation may be adversely affected by the dimensions of |$$| and |$$|⁠ , and proposed an unbiased estimator of it when |$$| and |$$| are high-dimensional. In this paper, we shall demonstrate that the distance correlation may be less efficient in detecting nonlinear dependence when the assumption |$E(|| +||) < infty$| is violated. To remove this moment condition, Benjamini et al. (2013) suggested using ranks of distances, but this involves the selection of several tuning parameters, the choice of which is an open problem. The asymptotic properties of a test based on ranks of distances also need further investigation.

We propose using projection correlation to characterize dependence between |$$| and |$$|⁠ . Projection correlation first projects the multivariate random vectors into a series of univariate random variables, then detects nonlinear dependence by calculating the Pearson correlation between the dichotomized univariate random variables. The projection correlation between |$$| and |$$|⁠ , denoted by |$>(,)$|⁠ , is nonnegative and equals zero if and only if |$$| and |$$| are independent, so it is generally applicable as an index for measuring the degree of nonlinear dependence without moment conditions, normality or ellipticity ( Tracz et al., 1992). The projection correlation test for independence is consistent against all dependence alternatives. The projection correlation is free of tuning parameters and is invariant to orthogonal transformation. We shall show that the sample estimator of projection correlation is |$n$| -consistent if |$$| and |$$| are independent and root- |$n$| -consistent otherwise. We conduct Monte Carlo studies to evaluate the finite-sample performance of the projection correlation test. The results indicate that the projection correlation is less sensitive to the dimensions of |$$| and |$$| than the distance correlation and even its improved version ( Székely & Rizzo, 2013), and is more powerful than both the distance correlation and ranks of distances, especially when the dimensions of |$$| and |$$| are relatively large or the moment conditions required by the distance correlation are violated.

We have to consider both components of a vector, namely direction and magnitude when using vector addition.

Keep in mind that the two vectors with the same magnitude and direction can be added like scalars.

In this topic, we will explore graphical and mathematical methods of vector addition, including:

2. Vector Addition Using the Parallelogram Method
3. Vector Addition Using the Components

Vector addition can be performed using the famous head-to-tail method. According to this rule, two vectors can be added together by placing them together so that the first vector’s head joins the tail of the second vector. The resultant sum vector can then be obtained by joining the first vector’s tail to the head of the second vector. This is sometimes also known as the triangle method of vector addition.

Vector addition using the head-to-tail rule is illustrated in the image below. The two vectors P and Q are added using the head-to-tail method, and we can see the triangle formed by the two original vectors and the sum vector.

First, the two vectors P and Q are placed together such that the head of vector P connects the tail of vector Q. Next, to find the sum, a resultant vector R is drawn such that it connects the tail of P to the head of Q.

Mathematically, the sum, or resultant, vector, R, in the image below can be expressed as:

R = P + Q

Vector Addition Using the Parallelogram Method

To understand vector addition using the parallelogram method, we will consider and explain the figure below.

First, draw the given vectors, A and B, to have the same initial point as shown in the image below. Then, draw a parallelogram using the copies of the given vectors.

Second, draw the copy of the vector B called B’, and place it parallel to the vector B to connect to the head of the first vector, A. Similarly, draw a copy of the vector A called A’, and place it parallel to A so that its tail connects with the head of vector B.

Finally, the resultant of the two vectors, which is equal to the sum of vectors A and B, will be the parallelogram’s diagonal. It can be drawn by joining the initial point of the two vectors A and B to the head of the vectors A’ and B’.

In summary, three steps are required to perform the vector addition using the parallelogram method:

Step 1: Place the two vectors so that they have a common starting point

Step 2: Draw and complete the parallelogram using copies of the two original vectors

Step 3: The diagonal of the parallelogram is then equal to the sum of the two vectors

As we know, vectors given in Cartesian coordinates can be decomposed into their horizontal and vertical components. For example, a vector P at an angle Φ, as shown in the image below, can be decomposed into its components as:

Px, which represents the component of vector P along the horizontal axis (x-axis), and

Py, which represents the component of vector P along the vertical axis (y-axis).

It can be seen that the three vectors form a right triangle and that the vector P can be expressed as:

P = Px+ Py

Mathematically, the components of a vector can be also be calculated using the magnitude and the angle of the given vector.

Px = P cos Φ

Py = P sin Φ

Moreover, we can also determine the resultant vector if its horizontal and vertical components are given. For example, if the values of Pxand Pyare given, then we can calculate the magnitude and the angle of the vector P as follows:

|P| = (Px )^2+( Py)^2

And the angle can be found as:

Thus, in summary, we can determine a resultant vector if its components are given. Alternatively, if the vector itself is given, we can determine the components using the above equations.

Similarly, if the vectors are expressed in ordered pairs (column vectors), we can perform the addition operation on the vectors using their components. For example, consider the two vectors M and N given as:

Performing vector addition on the two vectors is equivalent to adding the two vectors’ respective x and y components. This yields the resultant vector S:

S = M + N

S = (m1+n1, m2+ n2).

It can be written explicitly as:

The magnitude of the resultant vector S can be computed as:

|S| = (Sx )^2+( Sy)^2

And the angle can be computed as:

Combining Variational Encoder into the Model

To deal with the problem of generating a diverse set of examples, I combined a Variation Autoencoder (VAE) to our network. I am not going through the details of explaining VAE’s here, as there have been some great posts about them, and a very nice TensorFlow implementation.

VAE’s help use do two things. Firstly, they allow us to encode existing image into a much smaller latent Z vector, kind of like compression. It does this by passing an image through the encoder network, which we will call the Q network, with weights . And from this encoded latent vector Z , the generator network will produce an image that will be as close as possible to the original image passed in, hence it is an autoencoder system. This solves the problem we had in the GAN model, because if the generator only produces certain digits, but not other digits, it will get penalised as it is not reproducing many examples in the training set.

So far, we have assumed the Z vector to be simple independent unit gaussian variables. There’s no guarantee that the encoder network Q will encode images from a random training image X , to produce values of Z that belong to a probability distribution we can reproduce and draw randomly from, like a gaussian. Imagine we just stop here, and train this autoencoder as it is. We will lack the ability to generate random images, because we lack the ability to draw Z from a random distribution. If we draw Z from the gaussian distribution, it will only be by chance that Z will look like some value corresponding to the training set, and will produce images that do not look like the image set otherwise.

The ability to control the exact distribution of Z is the second thing the VAE will help us do, and the main why the VAE paper is such an important and influential paper. In addition to perform the autoencoding function, the latent Z variables generated from the Q network will also have the characteristic of being simple independent unit gaussian random variables. In other words, if X is a random image from our training set, belonging to whatever weird and complicated probability distribution, the Q network will make sure the Z is constructed in a way so that P(Z|X) is a simple set of independent unit gaussian random variables. And the amazing thing is, this difference between the distribution of P(Z|X) and the distribution of a gaussian distribution (they call this the KL Divergence) can be quantified and minimised using gradient descent using some elegant mathematical machinery, by injecting gaussian noise into the output layer of the Q network. This VAE model can be trained by minimising the sum of both the reconstruction error and KL divergence error using gradient descent, in Equation 10 of the VAE paper.

Our final CPPN model combined with GAN + VAE:

Rather than sticking with the pure VAE model, I wanted to combine VAE with GAN, because I found that if I stuck with only VAE, the images it generated in the end looked very blurry and uninteresting when we blow up the image. I think this is due to the error term being calculated off pixel errors, and this is a known problem for the VAE model. Nonetheless, it is still useful for our cause, and if we are able to combine it with GAN, we may be able to train a model that will be able to reproduce every digit, and look more realistic with the discriminator network acting as a final filter.

Training this combined model will require some tweaks to our existing algorithm, because we will also need to train to optimise the VAE’s error. Note that we will adjust both and when optimising for both G_loss and VAE_loss .

CPPN+GAN+VAE algorithm for 1 epoch of training:

The trick here, is to structure and balance all the subnetworks’ structure, so that G_loss and D_loss hovers around 0.69 , so they are mutually trying to improve by fighting each other over time, and improve at the same rate. In addition, we should see the VAE_loss decrease over time epoch by epoch, while the other two networks battle out each other. It is kind of a black art to train these things, and maintain balance. The VAE is trying to walk across a plank connecting two speed boats (G and D) trying to outrace each other.

After training the model, we can see the results of feeding random vectors of Z, drawn from unit gaussian distribution, into our G network, and we can generate some random large images. Let’s see what we end up with!

Random latent vectors

We can generate some random large samples from our trained model in IPython:

We can see how our generator network takes in any random vector Z, consisting of 32 real numbers, and generates a random image that sort of looks like a number digit based on the values of Z.

The next thing we want to try is to compare actual MNIST examples to the autoencoded ones. That is, take a random MNIST image, encode the image to a latent vector Z, and then generate back the image. We will first generate the image with the same dimensions as the example ( 26x26 ), and then an image 50 times larger ( 1300x1300 ) to see the network imagine what MNIST should look like were it much larger.

First, we draw a random picture from MNIST and display it.

Then, we encode that picture into Z.

From Z, we generate a 26x26 reconstruction image.

We can also generate much larger reconstruction image using the same Z.

The current VAE+GAN structure seems to produce cloudy versions of MNIST image when we scale them up, like trying to draw something from smoke.

Below are more comparisons of autoencoded examples versus the originals. Sometimes the network makes mistakes, so it is not perfect. There is an example of a zero being misinterpreted as a six, and a three getting totally messed up. You can try to generate your own writing samples and feed an image into IPython to see what autoencoded examples get generated. Maybe in the future I can make a javascript demo to do this.

Autoencoded Samples

As discussed earlier, the Latent Z Vector can be interpreted as a compressed coded version of actual images, like a non-linear version of PCA. Embedded in these 32 numbers is information containing not only the digit the image represents, but also other information, such as size, style and orientation of the image. Not everyone writes the same way, some people write it with a loop, or without a loop, and some people write digits larger than others, with a more aggressive pen stroke. We see that the autoencoder can capture most of these information successfully, and reproduce a version of the original image. An analogy would be a person looking at an image, and taking down notes to describe an image in great detail, and then having another person reproduce the original image from the notes.

Solution: | a | = √ 2 2 + 4 2 = √ 4 + 16 = √ 20 = 2√ 5 .

Solution: | a | = √ 3 2 + (-4) 2 = √ 9 + 16 = √ 25 = 5.

Solution: | a | = √ 2 2 + 4 2 + 4 2 = √ 4 + 16 + 16 = √ 36 = 6.

Solution: | a | = √ (-1) 2 + 0 2 + (-3) 2 = √ 1 + 0 + 9 = √ 10 .

Examples of n dimensional space tasks

Solution: | a | = √ 1 2 + (-3) 2 + 3 2 + (-1) 2 = √ 1 + 9 + 9 + 1 = √ 20 = 2√ 5

Solution: | a | = √ 2 2 + 4 2 + 4 2 + 6 2 + 2 2 = √ 4 + 16 + 16 + 36 + 4 = √ 76 = 2√ 19 .

Welcome to OnlineMSchool. This web site owner is mathematician Dovzhyk Mykhailo. I designed this web site and wrote all the mathematical theory, online exercises, formulas and calculators.