Phd regression & correlation Basics

Unit-1 Correlation Analysis


  • Unit 1-correlation analysis
  • Simple and Multiple linear regressions
  • correlation→Correlation analysis is a statistical method used in research to measure the strength of the linear relationship between two variables and compute their association
  • Descriptive statistics such as→measure central tendency (central value) (mean, median, mode) and measure of dispersion (range, standard deviation) are commonly used in analysis of univariate data
  • Methods of studying correlation
    • Scatter Diagram
    • Karl Pearson's co-efficient of correlation (covariance methods)
    • Two way Frequency table
    • Spearman's rank correlation coefficient (Rank Method)
    • Concurrent deviation method
  • Scatter diagram→is the graphical representation of the pairs of data ( X ; Y) in an orthogonal co ordinate system (A graph that displays the relationship between two variables.)
  • if for the increasing values x of the variable X there is a definite displacement of the values y of the variable Y we then say that there is a correlation
  • Simple correlation coefficient (r)
    • It is also called Pearson's correlation or product moment correlation co efficient.
    • It measure the nature and strength between two quantitative variable (A measure of the linear association between two variables.)
  • Assumptions for Karl Pearson correlation analysis
    • Variables are related to each other i.e. not independent
    • Data measured on Interval or Ratio variable
    • There exist linear relationship between variables
    • Variable Normally distributed absence of outliers
    • Both variables must have same number of observations
  • In statistics, particularly in regression analysis, homoscedasticity refers to the assumption that the variance of the error terms (residuals) is constant across all levels of the independent variable(s). This means the "noise" or random disturbance in the relationship between the independent and dependent variables is uniform
  • r indicates...,
    • strength of relationship (strong weak or none)
    • Nature/ direction of relationship
      • positive (direct)- variables move in same direction
      • negative (inverse )- variables move in opposite directions
    • r ranges in value from -1.0 to + 1.0
  • r=1 is a perfect positive linear correlation
  • r= 0 is no linear correlation
  • r= -1 is a perfect negative correlation
  • Example
  • if r= 0.694 we may conclude that there positive moderate relationship between X and Y
  • Population correlation coefficient = ρ→σxy / (σx * σy) = cov (xy) / σx * σy
  • Sample correlation co efficient = r = cov (xy)/ Sx*Sy
  • Properties of Karl Pearson correlation coefficient
    • Correlation coefficient is pure number i.e. it does not have unit
    • The range of correlation coefficient is -1 to +1 i.e. -1 r +1
    • Correlation between two variables is know as simple correlation or zero order correlation
    • Correlation coefficient is independent of both origin and scale
    • if rxy =0 the X and Y is not linearly related. (may be curvilinear relationship)
    • X and Y independent variables then rxy= 0 but vice versa may or may not true
    • Correlation co efficient (r) between X and Y is geometric mean of two regression coefficient byx and bxy i.e. r= √(byx*bxy
    • The sign of regression coefficients and correlation coefficient are always same (both regression co efficient have same sign)
    • arithmetic mean of two regression coefficient byx and bxy is greater than or equal to correlation coefficient between X and Y.
    • Correlation coefficient is symmetric in nature .i.e. rxy=ryx
  • Limitations of Karl pearson Correlation coefficient
    • Linearity
      • can't describe non-linear relationships
      • e.g., relation between anxiety & Performance
    • truncation of range
      • underestimate strength of relationship if you can't see full range of x value
    • no proof of causation
    • Third variable problem
      • cloud be 3rd variable causing change in both variables directionality : can't be sure which way causality "flow"
  • Types of Correlation (on the basis of direction of change)
    • Positive correlation
    • Negative correlation
    • perfectly positive
    • perfectly negative
    • zero correlation
  • Types of correlation (on the basis of number of variables)
    • Simple correlation (only 2 variables)
    • Partial correlation (Effect of only two is studies while others are kept constant)
    • Multiple correlation (More than 2 variables)
    • Correlation can be simple (two variables) or multiple (three or more variables).
  • Types of correlation (on the basis of proportion)
    • Linear correlation (amount of change in constant ratio)
    • Non- linear correlation
  • Simple correlation coefficient (r) : Correlation between two variables
  • Multiple Correlation coefficient (R) : Correlation Between more than two variables
  • Least squares fit, Properties and examples
  • Polynomial regression: Use of orthogonal polynomials
  • Spearman's correlation coefficient
    • is the statistical measure of the strength of relationship between paired data it is denoted by -1 rs +1
    • its interpretation is similar to that of Pearson's, e.g. the are closer is to the strong the monotonic relationship
    • A non-parametric measure of the monotonic relationship (between two variables where, as one variable increases, the other either consistently increases or consistently decreases, but not necessarily at a constant rate.)between two variables.
  • without any ties
    • $\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$
    • $\rho$ = Spearman's rank correlation coefficient
    • $d_i$ = difference between the two ranks of each observation
    • $n$ = number of observations
  • with m ties rs
    • $$r_{s} = 1 - \frac{6 \left[ \sum_{} d_{i}^{2} + \sum_{} \frac{m(m_{}^{2} - 1_{})}{12} \right]}{n(n^{2}-1)}$$
  • if the rs of +1 indicates a perfect association of rank , a rs of zero indicates no association between ranks and a rs of -1 indicates a perfect negative association of ranks.
  • the closer rs is to zero the weaker the association between the ranks
  • The calculation of spearman's correlation coefficient and subsequent significance testing of it requires the following data assumption
    • Interval or ratio level or ordinal
    • Monotonically related
    • Absence of outliers
    • Both Variables must have same number of observations
  • Pearson's correlation there is no requirement of normality and hence it is non parametric statistic

Unit-2 Regression diagnostics


  • Autocorrelation
    • Correlation is degree of similarity between two different variables
    • Correlation of same variable at two different times Y t and Y t-k is called Autocorrelation. Autocorrelation is also called as serial correlation
    • Autocorrelation is the correlation of a time series with a lagged version of itself.
  • $$r_k = \frac{\sum_{t=k+1}^{n} (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t=1}^{n} (y_t - \bar{y})^2}$$
  • where k is the time gap being considered and is called the lag. Autocorrelation of lag 1 (i.e. k=1) is the correlation between values that are one time period apart
  • In regression analysis one of the assumption is that the error term et still have mean 0 and constant variance $$σ^2_{e}$$
  • For error term et have mean zero (i.e. e-bar=0) so autocorrelation for residual reduce to
  • $$rk=\dfrac{\sum^n_{t=k+1}(e_{t}e_{t-k})}{\sum^n_{t=1}e_{t}^2}$$
  • Reason for autocorrelation
    • Specification bias
      • some independent variables not included
      • incorrect function form (instead of linear may be Quadratic / non linear )
    • Independent variable of lag form of Y
    • Data may be manipulated i.e. interpolation or extrapolation of data.
  • Consequences of Autocorrelation
    • R-square interpretation may not accurate / holds true
    • narrow confidence interval for regression coefficients
    • Usual t-ratio and F- ratio tests provide misleading results
    • Predicted y value may have large variances
  • correlogram→A correlogram is a visual representation of the correlation between different variables. x = Lag
  • y= Autocorrelation values
  • Test of Autocorrelation
    • A plot of residual et against t
    • Durbin-Watson test
    • A Lagrange Multiplier test
    • Ljung-Box test
    • A correlogram
  • What is the Durbin Watson test
    • The Durbin Watson test is a measure of autocorrelation (also called serial correlation) of residuals from regression analysis. The Durbin Watson test looks for specific type of serial correlation the AR (1) process.
    • The Durbin-Watson test detects autocorrelation in the residuals of a regression model.
  • if et is residuals at time period t regressed by residual at time t-1 a
  • $$e_t=p e_{t-1}+u_t.$$
  • where ut is the residual of above model
  • First order autocorrelation for residual
  • $$r = \frac{\sum_{t=2}^{n} e_t e_{t-1}}{\sum_{t=1}^{n} e_t^2}$$
  • the Hypothesis for Durbin Watson test are
  • Ho= no first order autocorrelation i.e. P=0
  • H1= First order correlation exists i.e. P ≠ 0
  • The test statistic is calculated with
  • $$DW= \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2}$$
  • Where Et are residual from an ordinary least squares regression.
  • The Durbin Watson test Report a test Statistic with a value from 0 to 4 where
    • Test statistic range: 0 to 4
    • Value near 2 indicates no autocorrelation
    • Value 0 to < 2 suggests positive autocorrelation (common in time series data)
    • Value > 2 to 4 indicates negative autocorrelation (less common in time series data )
    • Commonly used in time series analysis

Post a Comment

0 Comments