# 1.4: 04 Random vectors and independence

1.4: 04 Random vectors and independence

## Functions of Independent Random Variables

Is the claim that functions of independent random variables are themselves independent, true?

I have seen that result often used implicitly in some proofs, for example in the proof of independence between the sample mean and the sample variance of a normal distribution, but I have not been able to find justification for it. It seems that some authors take it as given but I am not certain that this is always the case.  ## Discussion

A frequent situation in machine learning is having a huge amount of data however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero.

Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored:

There are a number of types of sparse matrices. However, in compressed sparse row (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3 , respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix:

As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix.

As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the different types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences between them and we should be conscious about why we are choosing one type over another.

## 1. Introduction

Measuring and testing dependence between |$$| and |$$| is a fundamental problem in statistics. The Pearson correlation is perhaps the first and the best-known quantity to measure the degree of linear dependence between two univariate random variables. Extensions including Spearman’s (1904) rho, Kendall’s (1938) tau, and those due to Hoeffding (1948) and Blum (1961) can be used to measure nonlinear dependence without moment conditions.

Testing independence has important applications. Two examples from genomics research are testing whether two groups of genes are associated and examining whether certain phenotypes are determined by particular genotypes. In social science research, scientists are interested in understanding potential associations between psychological and physiological characteristics. Wilks (1935) introduced a parametric test based on |$|>_<,>| / (|>_| |>_|)$|⁠ , where |$>_<,> = ><(<^< m T>>,<^< m T>>)<^< m T>>>inmathbb^<(p+q) imes (p+q)>$|⁠ , |$>_= >()inmathbb^$| and |$>_ = >()inmathbb^$|⁠ . Throughout |$>_ = >()$| stands for the covariance matrix of |$$| and ||>_|| stands for the determinant of |>_|⁠ . Hotelling (1936) suggested the canonical correlation coefficient, which seeks |>inmathbb^p| and |>inmathbb^q| such that the Pearson correlation between |><^< m T>>| and |><^< m T>>| is maximized. Both Wilks’s test and the canonical correlation can be used to test for independence between |$$| and |$$| when they follow normal distributions. Nonparametric extensions of Wilks’s test were proposed by Puri & Sen (1971), Hettmansperger & Oja (1994), Gieser & Randles (1997), Taskinen et al. (2003) and Taskinen et al. (2005). These tests can be used to test for independence between |$$| and |$$| when they follow elliptically symmetric distributions, but they are inapplicable when the normality or ellipticity assumptions are violated or when the dimensions of |$$| and |$$| exceed the sample size. In addition, multivariate rank-based tests of independence are ineffective for testing nonmonotone dependence ( Székely et al., 2007). The distance correlation ( Székely et al., 2007) can be used to measure and test dependence between |$$| and |$$| in arbitrary dimensions without assuming normality or ellipticity. Provided that |E(|| +||) < infty|⁠ , the distance correlation between |$$| and |$$|⁠ , denoted by |>(,)|⁠ , is nonnegative, and it equals zero if and only if |$$| and |$$| are independent. Throughout, we define ||| = (<^< m T>>)^<1/2>| for a vector |$$|⁠ . Székely & Rizzo (2013) observed that the distance correlation may be adversely affected by the dimensions of |$$| and |$$|⁠ , and proposed an unbiased estimator of it when |$$| and |$$| are high-dimensional. In this paper, we shall demonstrate that the distance correlation may be less efficient in detecting nonlinear dependence when the assumption |$E(|| +||) < infty$| is violated. To remove this moment condition, Benjamini et al. (2013) suggested using ranks of distances, but this involves the selection of several tuning parameters, the choice of which is an open problem. The asymptotic properties of a test based on ranks of distances also need further investigation.

We propose using projection correlation to characterize dependence between |$$| and |$$|⁠ . Projection correlation first projects the multivariate random vectors into a series of univariate random variables, then detects nonlinear dependence by calculating the Pearson correlation between the dichotomized univariate random variables. The projection correlation between |$$| and |$$|⁠ , denoted by |$>(,)$|⁠ , is nonnegative and equals zero if and only if |$$| and |$$| are independent, so it is generally applicable as an index for measuring the degree of nonlinear dependence without moment conditions, normality or ellipticity ( Tracz et al., 1992). The projection correlation test for independence is consistent against all dependence alternatives. The projection correlation is free of tuning parameters and is invariant to orthogonal transformation. We shall show that the sample estimator of projection correlation is |$n$| -consistent if |$$| and |$$| are independent and root- |$n$| -consistent otherwise. We conduct Monte Carlo studies to evaluate the finite-sample performance of the projection correlation test. The results indicate that the projection correlation is less sensitive to the dimensions of |$$| and |$$| than the distance correlation and even its improved version ( Székely & Rizzo, 2013), and is more powerful than both the distance correlation and ranks of distances, especially when the dimensions of |$$| and |$$| are relatively large or the moment conditions required by the distance correlation are violated.

## Covariance of the sum of two random vectors

This is the situation. I have an estimation of the position $(x_t,y_t)$ of an object with its covariance $Sigma_p$ and an estimation of its speed $(v_x, v_y)$ with its covariance $Sigma_v$ . Actually, the two estimations should be correlated since the speed is computed from the position estimation, but for simplicity, we can assume they are independent.

Now, I want to estimate the covariance of the future position of the object $(x_, y_) = (x_t,y_t) + (v_x, v_y) cdot dt$ , where $(v_x, v_y) cdot dt$ can be represented with a RVE with covariance $Sigma_1 = dt^2 cdot Sigma_v$ . So, I have to compute the covariance of the sum of two RVE (namely $X$ and $Y$ ), the first with covariance $Sigma_p$ , the second with covariance $Sigma_1$ .

$cov(X+Y) = E((X+Y)(X+Y)^T) - E(X+Y)E(X+Y)^T$ $= E((X+Y)(X^T+Y^T)) - E(X+Y)E(X^T+Y^T)$ $= E(XX^T + YY^T + XY^T + YX^T) - (E(X)E(Y))(E(X^T)E(Y^T))$ $= E(XX^T)-E(X)E(X^T) + E(YY^T)-E(Y)E(Y^T) + E(XY^T)-E(X)E(Y^T) + E(YX^T)-E(Y)E(X^T)$ $= cov(X) + cov(Y) + E(XY^T)-E(X)E(Y^T) + E(YX^T)-E(Y)E(X^T)$

Is it correct? Now, can I rewrite the formulation in order to compute $cov(X+Y)$ only in function of $cov(X)$ and $cov(Y)$ since I do not have the PDF of $X$ and $Y$ ?

## Probability with binomial distribution and random vectors

In a city the proportion of men with blue eyes is $20$%, of green eyes is $5$%, of black eyes is $10$% and the rest $65$% of men has brown eyes. Susan decides to commute from the center of the city to a small town. In order to get to the town, she has to take two buses which only take people from the city. She decides to run a little experiment which consists of drinking a beer with every man of blue or green eyes in each of the buses. Suppose every man accepts drinking a beer with her and that in the first bus there are $10$ men and in the second, $8$ men.

1)Calculate the probability that in the first part of the trip she has drunk less than $4$ beers.

2)Calculate the probability that she has drunk more than $3$ beers in total (in the two buses).

My attempt at a solution using user137481 hint:

2) If I define $Y=< ext>$, then $Y sim Bi(8,dfrac<1><4>)$, so $Z=X+Y sim Bi(18,dfrac<1><4>)$. The probability of Susan drinking more than 3 beers in total is $P(Z >3)=1-P(Z leq 3)$$=1-sum_^3 P(Z=i)$$=1-sum_^3<18 choose i>(dfrac<1><4>)^i(dfrac<3><4>)^<10-i>$

I would appreciate if someone could take a look at my solution and correct, if necessary, any mistakes. Thanks in advance.

## Example 5-7 Section

One ball is drawn randomly from a bowl containing four balls numbered 1, 2, 3, and 4. Define the following three events:

• Let (A) be the event that a 1 or 2 is drawn. That is, (A=<1, 2>).
• Let (B) be the event that a 1 or 3 is drawn. That is, (B = <1, 3>).
• Let (C) be the event that a 1 or 4 is drawn. That is, (C = <1, 4>).

Are events (A, B, ext< and >C) pairwise independent? Are they mutually independent?

This example illustrates, as does the previous example, that pairwise independence among three events does not necessarily imply that the three events are mutually independent.

## Examples

It might seem odd to train a random forest model on a dataset and then use that random forset as a kernel, instead of using the random forest directly. However, this can be useful in a number of circumstances.

For example, consider the MNSIT digit recognition dataset. A large random forest can obtain acceptable (greater than 90% accuracy) if it is trained on all ten digit classes. In other words, if the random forest algorithm is allowed to see many examples of each digit class, it can learn to classify each of the ten digits. But what if the random forest algorithm is only allowed to see examples of 7s and 9s? Clearly, such a random forest would not be very useful on a dataset of 3s and 4s. However, using the random forest kernel, we can take advantage of a random forest trained on only 7s and 9s to help us understand the differences between 3s and 4s (this is often called transfer learning).

Consider two different PCA projections of the 3s and 4s from the MNIST dataset: The left-hand side image shows the results of using Kernel PCA to find the two most significant (potentially non-linear) components of the 3s and 4s data using the random forest kernel. The right-hand side shows the two most significant (linear) components as determined by PCA.

For the most part, the components found using the random forest kernel provide a much better separation of the data. Remember, the random forest here only got to see 7s and 9s – it has never seen a 3 or a 4, yet the partitions learned by the random forest still carry semantic meaning about digits that can be transferred to 3s and 4s.

In fact, if one trains an SVM model to find a linear boundary on the first 5 principal components of the data, the accuracy for the random forest kernel assisted PCA + SVM is 94%, compared to the 87% of the linear PCA + SVM scheme.

The random forest kernel can be a quick and easy way to transfer knowledge from a random forest model to a related domain when techniques like deep belief nets are too expensive or simply overkill. The kernel can be evaluated quickly, and does not require special hardware or a significant amount of memory to compute. It isn’t frequently useful, but it is a nice trick to keep in your back pocket.

## Independent Events

Although typically we expect the conditional probability P ( A | B ) to be different from the probability P ( A ) of A, it does not have to be different from P ( A ) . When P ( A | B ) = P ( A ) , the occurrence of B has no effect on the likelihood of A. Whether or not the event A has occurred is independent of the event B.

Using algebra it can be shown that the equality P ( A | B ) = P ( A ) holds if and only if the equality P ( A ∩ B ) = P ( A ) · P ( B ) holds, which in turn is true if and only if P ( B | A ) = P ( B ) . This is the basis for the following definition.

### Definition

Events A and B are independent Events whose probability of occurring together is the product of their individual probabilities. if

If A and B are not independent then they are dependent.

The formula in the definition has two practical but exactly opposite uses:

In a situation in which we can compute all three probabilities P ( A ) , P ( B ) , and P ( A ∩ B ) , it is used to check whether or not the events A and B are independent:

• If P ( A ∩ B ) = P ( A ) · P ( B ) , then A and B are independent.
• If P ( A ∩ B ) ≠ P ( A ) · P ( B ) , then A and B are not independent.

### Example 23

A single fair die is rolled. Let A = < 3 >and B = < 1,3,5 >. Are A and B independent?

In this example we can compute all three probabilities P ( A ) = 1 ∕ 6 , P ( B ) = 1 ∕ 2 , and P ( A ∩ B ) = P ( < 3 >) = 1 ∕ 6 . Since the product P ( A ) · P ( B ) = ( 1 ∕ 6 ) ( 1 ∕ 2 ) = 1 ∕ 12 is not the same number as P ( A ∩ B ) = 1 ∕ 6 , the events A and B are not independent.

### Example 24

The two-way classification of married or previously married adults under 40 according to gender and age at first marriage in Note 3.48 "Example 21" produced the table

Determine whether or not the events F: “female” and E: “was a teenager at first marriage” are independent.

The table shows that in the sample of 902 such adults, 452 were female, 125 were teenagers at their first marriage, and 82 were females who were teenagers at their first marriage, so that

P ( F ) = 452 902 P ( E ) = 125 902 P ( F ∩ E ) = 82 902

P ( F ) · P ( E ) = 452 902 · 125 902 = 0.069

we conclude that the two events are not independent.

### Example 25

Many diagnostic tests for detecting diseases do not test for the disease directly but for a chemical or biological product of the disease, hence are not perfectly reliable. The sensitivity of a test is the probability that the test will be positive when administered to a person who has the disease. The higher the sensitivity, the greater the detection rate and the lower the false negative rate.

Suppose the sensitivity of a diagnostic procedure to test whether a person has a particular disease is 92%. A person who actually has the disease is tested for it using this procedure by two independent laboratories.

1. What is the probability that both test results will be positive?
2. What is the probability that at least one of the two test results will be positive?

Let A1 denote the event “the test by the first laboratory is positive” and let A2 denote the event “the test by the second laboratory is positive.” Since A1 and A2 are independent,

Using the Additive Rule for Probability and the probability just computed,

### Example 26

The specificity of a diagnostic test for a disease is the probability that the test will be negative when administered to a person who does not have the disease. The higher the specificity, the lower the false positive rate.

Suppose the specificity of a diagnostic procedure to test whether a person has a particular disease is 89%.

1. A person who does not have the disease is tested for it using this procedure. What is the probability that the test result will be positive?
2. A person who does not have the disease is tested for it by two independent laboratories using this procedure. What is the probability that both test results will be positive?

Let B denote the event “the test result is positive.” The complement of B is that the test result is negative, and has probability the specificity of the test, 0.89. Thus

Let B1 denote the event “the test by the first laboratory is positive” and let B2 denote the event “the test by the second laboratory is positive.” Since B1 and B2 are independent, by part (a) of the example

The concept of independence applies to any number of events. For example, three events A, B, and C are independent if P ( A ∩ B ∩ C ) = P ( A ) · P ( B ) · P ( C ) . Note carefully that, as is the case with just two events, this is not a formula that is always valid, but holds precisely when the events in question are independent.

### Example 27

The reliability of a system can be enhanced by redundancy, which means building two or more independent devices to do the same job, such as two independent braking systems in an automobile.

Suppose a particular species of trained dogs has a 90% chance of detecting contraband in airline luggage. If the luggage is checked three times by three different dogs independently of one another, what is the probability that contraband will be detected?

Let D1 denote the event that the contraband is detected by the first dog, D2 the event that it is detected by the second dog, and D3 the event that it is detected by the third. Since each dog has a 90% of detecting the contraband, by the Probability Rule for Complements it has a 10% chance of failing. In symbols, P ( D 1 c ) = 0.10 , P ( D 2 c ) = 0.10 , and P ( D 3 c ) = 0.10 .

Let D denote the event that the contraband is detected. We seek P ( D ) . It is easier to find P ( D c ) , because although there are several ways for the contraband to be detected, there is only one way for it to go undetected: all three dogs must fail. Thus D c = D 1 c ∩ D 2 c ∩ D 3 c , and

P ( D ) = 1 − P ( D c ) = 1 − P ( D 1 c ∩ D 2 c ∩ D 3 c )

But the events D1, D2, and D3 are independent, which implies that their complements are independent, so

P ( D 1 c ∩ D 2 c ∩ D 3 c ) = P ( D 1 c ) · P ( D 2 c ) · P ( D 3 c ) = 0.10 × 0.10 × 0.10 = 0.001

Using this number in the previous display we obtain

That is, although any one dog has only a 90% chance of detecting the contraband, three dogs working independently have a 99.9% chance of detecting it.

## Subspace Spanned By Cosine and Sine Functions

Let $calF[0, 2pi]$ be the vector space of all real valued functions defined on the interval $[0, 2pi]$.
Define the map $f:R^2 o calF[0, 2pi]$ by
[left(, fleft(, egin
alpha
eta
end , ight) , ight)(x):=alpha cos x + eta sin x.] We put
[V:=im f=.]

(a) Prove that the map $f$ is a linear transformation.

(b) Prove that the set $$is a basis of the vector space V. (c) Prove that the kernel is trivial, that is, ker f=>. (This yields an isomorphism of R^2 and V.) (d) Define a map g:V o V by [g(alpha cos x + eta sin x):=frac(alpha cos x+ eta sin x)=eta cos x -alpha sin x.] Prove that the map g is a linear transformation. (e) Find the matrix representation of the linear transformation g with respect to the basis$$.