problem with logistic regression

Discussion:

(too old to reply)

2007-09-26 20:34:25 UTC

I have three vectors X1, X2, Y and want to find out the possible
dependence of Y on X1 and X2.
Logistic Regression is adopted.
However, I got confused because of the following results:

If I fit the Logitstic Regression model for X1 and Y, or for X2 and Y,
the results show that X1 and Y are clearly correlated, so are X2 and
Y.

fitting model: OddRatio(Y)=a+b1*X1
-------------------------------------------------
Overall Model Fit...
Chi Square= 15.4002; df=1; p= 0.0001

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 3.3964 1.1342 0.0027
Intercept -0.9985
-------------------------------------------------

fitting model: OddRatio(Y)=a+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 7.7710; df=1; p= 0.0053

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.1972 0.8819 0.0127
Intercept -0.6931
-------------------------------------------------

However, If I fit the model for all the 3 varibles together, with Y
being dependent variable and X1 and X2 being independent variables,
the p-values are much higher, especially for X2 (0.3288). So can I
still say Y is correlated with X2? How come Y is obviously correlated
with X2 when fitting them separately but not so when fitting all the
variables together?

fitting model: OddRatio(Y)=a+b1*X1+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 16.3307; df=2; p= 0.0003

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.9453 1.1946 0.0137
2 1.0542 1.0794 0.3288

Stratocaster

2007-09-27 00:06:21 UTC

Permalink

My response is probably not as technical as you would like, but I have
encountered similiar occurances when dealing with multiple regression
models.

These explanatory variables (X1, X2) seem to exhibit "Multicollinearity". I
am certain there must be a similar (if not the same) term for logistic
regressions. Long story short (because I don't know the full story), X1 and
X2 provide the model with similar information and one usually becomes
dominant when both are included. You could check this by evaluating the
correlation between X1 and X2, it is probably noteworthy.

Post by xz
I have three vectors X1, X2, Y and want to find out the possible
dependence of Y on X1 and X2.
Logistic Regression is adopted.
If I fit the Logitstic Regression model for X1 and Y, or for X2 and Y,
the results show that X1 and Y are clearly correlated, so are X2 and
Y.
fitting model: OddRatio(Y)=a+b1*X1
-------------------------------------------------
Overall Model Fit...
Chi Square= 15.4002; df=1; p= 0.0001
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 3.3964 1.1342 0.0027
Intercept -0.9985
-------------------------------------------------
fitting model: OddRatio(Y)=a+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 7.7710; df=1; p= 0.0053
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.1972 0.8819 0.0127
Intercept -0.6931
-------------------------------------------------
However, If I fit the model for all the 3 varibles together, with Y
being dependent variable and X1 and X2 being independent variables,
the p-values are much higher, especially for X2 (0.3288). So can I
still say Y is correlated with X2? How come Y is obviously correlated
with X2 when fitting them separately but not so when fitting all the
variables together?
fitting model: OddRatio(Y)=a+b1*X1+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 16.3307; df=2; p= 0.0003
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.9453 1.1946 0.0137
2 1.0542 1.0794 0.3288

Richard Ulrich

2007-09-27 04:14:52 UTC

Permalink

[rearranging a top-posted reply]

- in response to -

Post by Stratocaster

- There was the reply -

Post by Stratocaster
My response is probably not as technical as you would like, but I have
encountered similiar occurances when dealing with multiple regression
models.
These explanatory variables (X1, X2) seem to exhibit "Multicollinearity". I
am certain there must be a similar (if not the same) term for logistic
regressions. Long story short (because I don't know the full story), X1 and
X2 provide the model with similar information and one usually becomes
dominant when both are included. You could check this by evaluating the
correlation between X1 and X2, it is probably noteworthy.

Right. Though, I would call more a case of ordinary
overlap than one of any notable degree of 'multicollinearity'.
Some people like the terms about collinearity to refer
to 100% redundancy, or verging on 100%. But there is
certainly a noticeable correlation of X1 and X2.

In multiple regression, the coefficients are sometimes
called "partial regression coefficients", in recognition
of the fact that each variable contributes what it can
while "partialling out" the other variables.

--
Rich Ulrich, ***@pitt.edu
http://www.pitt.edu/~wpilib/index.html

s***@yahoo.com

2007-09-27 14:10:39 UTC

Permalink

Just adding my comment.

Multicollinearity is an UNIVERSAL issue. "ALL" regression models
including logistic regression, poisson regression, general linear
model, generalized linear model, ANOVA, DOE, etc, etc, etc, suffer
multicollinearity if variables are highly correlated.

The least square estimator:

y=Xb

b=inv(X'X)X'y

Notice that X'X is the same as the correlation matrix. (Actually, the
X'X/(n-1) is the correlation matrix, however, sample size (n) - 1 is
constant).

The computation of the inverse matrix becomes unstable if the
variables are highly correlated (a.k.a. near-singular, ill-
conditioned, rank-deficient, etc, etc, ). Multicollinerity is a matter
of degree: complete, strong (near), weak multicollinearity.

Remedies include combining the two input variables, dropping one of
them, or collecting many more samples. Alternatively, apply
decomposition methods such as singular value decomposition, or QR
decomposition. By the way, never use the "inv" command in Matlab;
instead use the "backslash" command, which applies svd or QR methods
depending on the type of the matrix.

In summary, as long as the computation of the invese matrix is
involved, near-singularity shoud be checked.

Hope this helps.

Sangdon Lee, Ph.D.,
GM Tech Center,

David Winsemius

2007-09-27 13:59:42 UTC

Permalink

Post by Stratocaster
My response is probably not as technical as you would like, but I have
encountered similiar occurances when dealing with multiple regression
models.
These explanatory variables (X1, X2) seem to exhibit
"Multicollinearity". I am certain there must be a similar (if not the
same) term for logistic regressions. Long story short (because I
don't know the full story), X1 and X2 provide the model with similar
information and one usually becomes dominant when both are included.
You could check this by evaluating the correlation between X1 and X2,
it is probably noteworthy.

The model fit statistic simply is not significantly improved by adding X2
to a model containing X1. (Difference in model fit produced by adding X2
has chi-square = 0.9 with 1 df). The reverse ordering of adding X1 to a
base model with X2 would have a significant improvement in fit.

You show us no evidence of multicollinearity. True there is likely an
association between X1 and X2, but there is no material instability in
the point estimate for the X1 parameter that would suggest
multicollinearity. LR is still a linear model. So the multicollinearity
diagnostics are the same. Look at the (X'X)^-1 matrix. Search terms:
multicollinearity "condition number" VIF.

The OP should have been asked what X1 and X2 represented. If this is
merely a homework problem, then OP should have been prompted to think of
some situations where a causal connection might exist between X1 and Y
and between X1 and X2 but not between X2 and Y. If Y were death in a car
accident and X1 were seat belt use, then perhaps a feature associated
with seatbelt use, but not as strongly with auto accident death (perhaps
habit of regularly changing one's car's oil). (Please note: I am not
saying that LR results imply a causal connection. That is established in
a wider consideration of the scientific domain.)
--
David Winsemius

Post by Stratocaster

Post by xz
I have three vectors X1, X2, Y and want to find out the possible
dependence of Y on X1 and X2.
Logistic Regression is adopted.
If I fit the Logitstic Regression model for X1 and Y, or for X2 and
Y, the results show that X1 and Y are clearly correlated, so are X2
and Y.
fitting model: OddRatio(Y)=a+b1*X1
-------------------------------------------------
Overall Model Fit...
Chi Square= 15.4002; df=1; p= 0.0001
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 3.3964 1.1342 0.0027
Intercept -0.9985
-------------------------------------------------
fitting model: OddRatio(Y)=a+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 7.7710; df=1; p= 0.0053
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.1972 0.8819 0.0127
Intercept -0.6931
-------------------------------------------------
However, If I fit the model for all the 3 varibles together, with Y
being dependent variable and X1 and X2 being independent variables,
the p-values are much higher, especially for X2 (0.3288). So can I
still say Y is correlated with X2? How come Y is obviously correlated
with X2 when fitting them separately but not so when fitting all the
variables together?
fitting model: OddRatio(Y)=a+b1*X1+b2*X2
-------------------------------------------------
Overall Model Fit...
Chi Square= 16.3307; df=2; p= 0.0003
Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 2.9453 1.1946 0.0137
2 1.0542 1.0794 0.3288

2007-09-27 14:41:52 UTC

Permalink

Hi, thank you guys for the help.

Let me just give some more details about the background of the
problem.
This is actually a biostatistics problem (not my homework though^_^).
The problem is about the affinity profile of an inhibitory compound
for a bunch of proteins, i.e. which proteins does the compound bind
to?
Y indicates the state of binding: Yi=1 if the compound binds to
protein i, otherwise Yi=0.
X1i indicates whether the amino acid at one specific position in the
protein i belongs to some set or not.
X2i is the same thing but corresponding to another position.
The types of amino acids at these two positions are both believed to
affect the affinity between the protein and the compound.
Thus, I built this model to illustrate this fact and, consequently,
use the model to predict whether the compound binds to an arbitrary or
not.

Post by xz
From such a background, there is not physical correlation between X1

and X2, yet they are statistically correlated as indicated by the
data.
In such a situation, if I still believe that X1 and X2 are factors
affecting the binding (based on some biological rationale) and thus
want to incorporate both X1 and X2 in my model, what should I do?

r***@comcast.net

2007-09-27 14:58:59 UTC

Permalink

Post by xz

Hi, thank you guys for the help.
Let me just give some more details about the background of the
problem.
This is actually a biostatistics problem (not my homework though^_^).
The problem is about the affinity profile of an inhibitory compound
for a bunch of proteins, i.e. which proteins does the compound bind
to?
Y indicates the state of binding: Yi=1 if the compound binds to
protein i, otherwise Yi=0.
X1i indicates whether the amino acid at one specific position in the
protein i belongs to some set or not.
X2i is the same thing but corresponding to another position.
The types of amino acids at these two positions are both believed to
affect the affinity between the protein and the compound.
Thus, I built this model to illustrate this fact and, consequently,
use the model to predict whether the compound binds to an arbitrary or
not.

Post by xz
From such a background, there is not physical correlation between X1

Your results don't say that X2 doesn't matter, they say you can't tell
from your data whether X2 matters. Think of a confidence interval on
the estimated coefficient. The coefficient on X2 could be about the
same size as the one estimated for X1, or it could be zero.

If possible, you sure increase the sample size.
-Dick Startz

s***@gmail.com

2007-09-27 15:54:40 UTC

Permalink

Post by xz
Hi, thank you guys for the help.
Let me just give some more details about the background of the
problem.
This is actually a biostatistics problem (not my homework though^_^).
The problem is about the affinity profile of an inhibitory compound
for a bunch of proteins, i.e. which proteins does the compound bind
to?
Y indicates the state of binding: Yi=1 if the compound binds to
protein i, otherwise Yi=0.
X1i indicates whether the amino acid at one specific position in the
protein i belongs to some set or not.
X2i is the same thing but corresponding to another position.
The types of amino acids at these two positions are both believed to
affect the affinity between the protein and the compound.
Thus, I built this model to illustrate this fact and, consequently,
use the model to predict whether the compound binds to an arbitrary or
not.

Post by xz
From such a background, there is not physical correlation between X1

Many and many more data would definetely help.
But in case where you don't have many and many more data, you may
create a new variable as follows ( and use only X3 as an input).

1) X3= (X1 + X2)/2, just simple average of the two variables

2) X3 = w1*X1 + w2*X2, weighted average of the two, which is very
similar as performing PCA on X1 and X2.

Correlation is not causation, though. Also, make sure whether X3 make
sense or not.

Sangdon Lee