A New Coefficient of Correlation. What if you were told there exists a… | by Tim Sumner | Mar, 2024

Rmag Breaking News

Before introducing the formula, it is important to go over some needed prep-work. As we said earlier, correlation can be thought of as a way of measuring the relationship between two variables. Say we’re measuring the current correlation between X and Y. If a linear relationship does exist, it can be thought of as one that is mutually shared meaning the correlation between X and Y is always equal to the correlation between Y and X. With this new approach, however, we will no longer be measuring the linear relationship between X and Y, but instead our aim is to measure how much Y is a function of X. Understanding this subtle, but important distinction between traditional correlation techniques will make understanding the formulas much easier, for in general it is not necessarily the case anymore that ξ(X,Y) equals ξ(Y,X).

Sticking with the same train of thought, suppose we still wanted to measure how much Y is a function of X. Notice each data point is an ordered pair of both X and Y. First, we must sort the data as (X₍₁₎,Y₍₁₎),…,(X₎,Y₎) in a way that results in X₍₁₎ ≤ X₍₂₎≤ ⋯ ≤ X₎. Said clearly, we must sort the data according to X. We will then be able to create the variables r₁, r₂, … ,rₙ where rᵢ equals the rank of Y₎. With these ranks now identified, we are ready to calculate.

There are two formulas used depending on the type of data you are working with. If ties in your data are impossible (or extremely unlikely), we have

and if ties are allowed, we have

where lᵢ is defined as the number of j such that Y Y₎. One last important note for when ties are allowed. In addition to using the second formula, to obtain the best estimate possible it is important to randomly sort the observed ties in a way that one value is chosen to be ranked higher/lower over another so that (rᵢ₊₁ — rᵢ) is never equal to zero just as before. The variable lᵢ is then just the number of observations Y₎ is actually greater than or equal to.

To not dive too much deeper into theory, it is also worth briefly pointing out this new correlation comes with some nice asymptotic theory behind it that makes it very easy to perform hypothesis testing without making any assumptions about the underlying distributions. This is because this method depends on the rank of the data, and not the values themselves making it a nonparametric statistic. If it is true that X and Y are independent and Y is continuous, then

What this means is that if you have a large enough sample size, then this correlation statistic approximately follows a normal distribution. This can be useful if you’d like to test the degree of independence between the two variables you are testing.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *