Science topic

# Regression - Science topic

Explore the latest questions and answers in Regression, and find Regression experts.
Questions related to Regression
Question
I have a sample with 138 participants. Only 6 of them reported living alone (%4.3) while the remaining 132 share the household with others (family/partner/housemate etc.)
I am trying to decide whether I can add "living alone" as a dichotomous variable in my hierarchical regression. What worries me is the very low percentage of individuals living alone in the sample. Do you think this would be problematic?
In general I agree with Prof Wright. I would like to add a few things. First if if the DV IS
dichotomous this is a logistic regression and hierarchical regression makes little sense because you can't follow change in things like in Rsq. Second if ols regression is appropriate in this
Situation changes in F stats or Rsq .wlll tell you nothing as far as I can see. Third if this is a logistic regression then are no Fstats or Rsq to follow so why do hierarchical regression? Fourth in this case I can't see how hierarchical regression will tell you anything if ols is appropriate. Thus do a full model of the appropriate type. Fifth if the DV is dichotomous and you do logistic regression with low percentage in one group then Firth method of logistic regression.is your best choice this case is Firth logistic regression.which rewires a special program and That's why Prof Wright asked for a a research question. Finally if more than one DV life gets very complicated as it will if you have more than 2 categories in your DV. So next time please specify things like these whenever possible to save time in getting an answer. Best wishes, David Booth
Question
In which software the operation can be easy?
I think Minitab is a good software for you in this field.
Question
For instance, when using OLS, objective of the could be
# to determine the effect of A on B
could this kind of objective hold when using threshold regression?
Question
This is the final model.
Log (odds of discontinuing exclusive breastfeeding) = -4.259+0.850 superior support+0.802 sufficient duration to express breastmilk.
Thank you.
You may be interested in taking this free online course; it includes logit modelling
Modules
1. Using quantitative data in research (watch video introduction)
2. Introduction to quantitative data analysis (watch video introduction)
3. Multiple regression
4. Multilevel structures and classifications (watch video introduction)
5. Introduction to multilevel modelling
6. Regression models for binary responses
7. Multilevel models for binary responses
8. Multilevel modelling in practice: Research questions, data preparation and analysis
9. Single-level and multilevel models for ordinal responses
10. Single-level and multilevel models for nominal responses
11. Three-level multilevel models
12. Cross-classified multilevel models
13. Multiple membership multilevel models
14. Missing Data
15. Multilevel Modelling of Repeated Measures Data
Question
I have 27 features and I'm trying to predict continuous values. When I calculated the VIF (VarianceInflation Factors), only 8 features are less than 10 and the remaining features range from 10 to 250. Therefore, I am facing a multicollinearity issue.
My work is guided by two aims:
1- ML models should be used to predict the values using regression algorithms.
2- To determine the importance of features( interpreting the ML models).
A variety of machine learning algorithms have been applied, including Ridge, Lasso, Elastic Net, Random Forest Regressor, Gradient Boosting Regressor, and Multiple Linear Regression.
Random Forest Regressor and Gradient BoostingRegresso showing the best performance (Lowest RMSE), while using only 10 features (out of 27 features) based on the feature importance results.
As I understand it, if I face multicollinearity issues, I can fix them using regularized Regression models like LASSO. When I applied Lasso to my model, the evaluation result is not as good as Random Forest Regressor and Gradient BoostingRegresso. However, none of my coefficients become zero when I apply the feature importance.
Moreover, I want to analyse which feature is affecting my target value and I do not want to omit my features.
I was wondering if anyone could help me determine which of these algorithms would be good to use and why?
Hi,
For nonlinear data, Random Forests are recommended.
For linear case, LASSO is a good choice.
Good luck,
Anis.
Question
Hey all,
For my master's thesis I am conducting research to determine the influence of certain factors (ghost games, lack of fan interaction, esports) on fan loyalty. For my statistical analyses, I will first conduct confirmatory factor analysis to validate which items (f.e., purchased merchandise) belong to what latent factor (f.e., behavioral loyalty).
However, I am unsure for my next step. Can I use multiple lineair regression with my latent variables to identify the relationship between the factors and loyalty. The data is collected through a survey of mainly 7-point Likert scale questions. Can I use lineair regression or is ordinal regression a must with Likert scale data?
I would suggest to use a PCA (principal components analysis) and calculate factor scores. These factor scores, in my opinion, could then be used as predictors in a multiple regression.
Question
The correlation between team purpose/trust and work-family synergy is .10 (ns) in my sample (N = 319). In my AMOS structural model, the standardized regression coefficient for these variables is -.23 (p < .001). How do I explain this apparent anomaly?
Search for 'Wright Tracing Rules' and see how correlation coefficients are used to compute path estimate. You may be able to troubleshoot anomaly by manually tracing each pathway (keep the model simple).
Some key rules are: 1) no loops (no-go twice); 2) no going forward then backward; 3) a maximum of one curved arrow per path.
Question
I want to measure the development during conflict of two countries and ultimately compare the results, however, I am struggling to determine an appropriate model. My data is annual and I have the data for both my dependent variable and independent variables for both countries.
How do I know if it is panel data or time-series data? I personally think it may be simple time-series data, would I then be able to use ARIMA or ARMA models for the regression?
Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time. These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time. https://www.influxdata.com/what-is-time-series-data/
Question
I am running regressions on one country's interest rate spreads vs another, and I am thinking of adding inflation rates as an additional independent variable.
Do I have to transform these into logarithm? And which is more accurate? Log or natural log?
Thank you for your answer Foday Joof , however, many of the published work I am looking at for reference do transform the spreads.... I am trying to make sure I understand why, the intuition behind it, so that I can explain it adequately in my paper.
Question
Hi - I would be grateful for some advice....I have conduced a regression analysis using HP and OP as predictor variables of overall affect.
A non-significant model was noted F(2, 109) = 2.13, p = .124. The model explains 2.0% of variance in post-playing/singing overall affect (adjusted R2 = .020).
However, in looking at the regression co-efficients I have the results illustrated in the attached.
I'm confused as to how I can have a non-significant model when one of the predictor variables is significant?
Many thanks
Karan
Since your F statistic is not significant, it basically tells you that your model does not work. In other words, it indicates that your set of independent variables (IVs) does not significantly predict your dependent variable (DV). Therefore, you should delete or add some IVs and then examine the F statistic if your goal is finding the best model. As regards the coefficients, they allow you to compare the amount each IV contributes to predicting your DV when considering all IVs. Even though one IV is significant and the other is not, this tells you that when they interact, their combination yields an insignificant prediction of your DV. For more insights, you could go through the piece by Frost (2021) and the RG thread initiated by Sternberg (2015). Both are referenced below.
Frost, J. (2021, June 8). How to interpret P-values and coefficients in regression analysis. Statistics By Jim. https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/
Sternberg, N. (2015, September 6). Can I interpret a regression when the regression model is non-significant, but it still results in a significant interaction effect? ResearchGate. https://www.researchgate.net/post/Can_I_interpret_a_regression_when_the_regression_model_is_non-significant_but_it_still_results_in_a_significant_interaction_effect
Good luck,
Question
my data has some outliers, so i preferred to use Quantile regression.
do i have to remove the outliers from the data before running the quantile regression?
Hi Srikanth,
Since the Quartile Regression estimates the conditional median or quartiles, it is generally robust against the effect of outliers, so no need for outlier removal.
Hope this helps!
Ebrahim
Question
If dependent variable - the number of participation in legislative elections (1-8), which regression is better? count model and negative binomial regression? If 'yes', why?
Pardon me again but if you don't have any idea about dependant variables and their characteristics this question doesn't have an answer. See Rosner text for an explanation of why that would be. Best wishes and apologies for any misunderstanding, David Booth
Question
As part of my research, I have to analyse 10 years time series financial data of four public limited companies through Multiple Linear Regression. First I analysed each company separately using Regression. The adjusted R square value is above 95% and VIF is well within limits but Durbin-watson measure shows either 2.4, 2.6 or 1.1 etc which signifies either positive or negative auto correlation. Then I tried the model with the combined data of all the four companies. This results in very less adjusted R squared value (35%) and again a positive auto correlation of 0.94 Durbin-Watson . As I am trying DuPont Analysis where the dependent variable is Return on Equity and independent variables are Net Profit Margin, Total Asset Turnover and Equity Multiplier, which are fixed, I cannot change the independent variables to reduce the effect of auto correlation. Please suggest me what to do.
Time series analysis allows you to have various models like auto regressive etc.which would explain the behavior of your data based on which you can project the d.v. into the future.
Question
Generally when it comes to assessing the performance of an ANN the most reliable approach is using a test set. However, as you further progress into the future there will be no more test data to use, as your model will provide the estimated values. In this regard, if you want to retrain your ANN using new incoming data you cannot test your adjusted model anymore. On what terms should you choose between two trained models without a test set? Validation error, loss etc. ? I look forward to any suggestions.
Alessandro Ferrarini how often do you suggest retraining an ANN? For example, retraining each week on a dataset containing hourly data and using your percentages it would mean having 168 vectors of data, from which approximatively 100 for training, 33 for validation and 35 for testing. Is there any method you suggest for determining when retraining should be done, so its relevant (meaning that the trained model has something new to learn)? Thank you very much for your previous answer, it assured me that I was doing the comparison using the right metrics.
Question
Hi my peers,
I have a study with panel data from 2009 to 2018, for 104 companies (1040) observations. My study is causal relationship . I need to highlight the temporal effect on the regression results... How can I do that?
I tried two methods but each produced different result to some extent.
The first was through creating year dummies through command.... tab year, gen (yr_) ,this produce ten dummy years.
The second method was through i.year which produced only 9 years, leaving the beginning year i.e 2009. However, the regression result to great extent unchanged but significant years become less in i.year inclusion.
I dont know what correctly to follow of these or if there is another way that show the temporal effects on the study findings.
Hi Samih. I have exactly the same problem as you. How did you manage to do it ?
Question
Given the independent variable (x) and two regulating variables (M1 and M2), if we want to plot a three-way interaction diagram, which interaction items need to be calculated? After obtaining the estimated parameters of these interaction items, which software or plug-in can be used to draw the three-way interaction diagram?
Thanks & Regards
The ggplot function being part of the R ggplot2 package might be the easiest way to draw a 3-way interaction.
Look at the Appendix (understanding ggplot2) of the online R Graphics Cookbook:
the basics are : aes(x = X-variable, y = Y-variable, group = 'regulating variable')
Question
Hello everyone,
I am working on a journal revision. The reviewers ask me to do a mixed procedures analysis because my experiment was a multiple-period task in which a participant repeated a task over several periods, and all periods observations were used in the analysis.
The reviewers also provided a reference for rerunning the analysis. When I was reading the reference paper, the results table is reported as in the picture attached.
My question is how do I conduct an ANOVA or mixed procedure and report results similar to the table attached. Specifically, do I need to conduct two ANOVA analyses to report one for between subjects and one for within-subjects? Or one analysis is enough. If so, how do I find the two Errors (one for between subjects and one for within-subjects)?
BTW, I use SPSS.
Thank you very much for your help!
There is a terminology issue here (and the terminology is confusing). They reviewers might have meant:
1) a mixed ANOVA with a combination of within (repeated measures) and between (independent measures factors). This is relatively easy to run via the repeated measures ANOVA commands in SPSS (e.g., see https://www.discoveringstatistics.com/repository/mixed_2020.pdf ). This is a single model that handles all the error terms etc.
2) They could have meant a linear mixed model (or multilevel model). If you have a completely balanced design with no missing cells and no time-varying covariates then this is essentially equivalent to the mixed ANOVA (assuming a single random factor for participants and a nested design). Generally this approach is useful because you have imbalance or complex random effects that you want to model (and the mixed ANOVA is a special case of this kind of model). The mixed term here is a mixture of random and mixed effects being modelled. Its a highly flexible approach. You can run this in SPSS but it isn't straightforward if you are new to these models.
The table you show appears to be from a mixed ANOVA. Your description isn't sufficient to tell what the reviewers meant.
Question
In my dependent variable, four categories are there. I already have found the regression result of mprobit. Although, I had run a syntax:
mfx, predict(p outcome(4)) varlist(_all), but I got result with value like 0.00000083. Is this result is fine? or any other command is there in STATA.
Shiba Shankar Pattayat I have to go through stata document. because till now, I also use mfx command only.
Question
Hi
I need to use regression models for my research. I used SPSS for linear regression but I want to use univariate and multivariate power regression such as:
Y=a(Xb)
Y=a(Xb)(Zc)
Y=a(XZ)b
and...
where:
a,b,c: model parameters
Y: dependent variable
X,Z: independent variables
Is there any user friendly statistical software to do it?
(I know about SAS or R software, but I think they perform regression by programming)
Thanks
R provides flexible solutions to those problems. E.g., you can simply use the nls() function, which will fit a nonlinear least-squares model.
In case of the nls() function, you provide a formula in the form y ~ a*x + b (in case of a simple linear regression) and some starting values as a list.
So you'll have for example
nls(y ~ a*x + b, start = list(a = 1, b = 1))
for a simple linear model, but you can use nonlinear formulas, too:
nls(y ~ a*x^b, start = list(a = 1, b = 1))
nls(y ~ a*x^b*z^c, start = list(a = 1, b = 1, c = 1))
nls(y ~ a*(x*z)^b, start = list(a = 1, b = 1))
and so on. For very complex models and "extreme" values for the parameters a, b, c, you may have to adjust the starting values.
When people have little experience with programming, it may sound intimidating to "perform regression by programming", but it is way more straight forward than searching the toolbars of programs such as Excel. Doing things "by programming" also comes with the advantage that you can "program" the complete workflow and apply it to new sets of data.
Another programmatic solution would be to use functions from the scikit-learn module in Python (which offers seemingly unlimited possibilities), but R is easier to just jump into, in my opinion, and widely used in statistical analyses.
I'd encourage you to give R a try, if you haven't done so yet...
Question
I built the SVR model in MATLAB. I want to use this model to predict the optimal experimental parameters (not SVR model parameters), such as using pso or GA, but I don't know the SVR model regression function (objective function), please help me.
In my knowledge, SVM cannot be used to make regression.
Question
Does anyone work with images dataset to implement the deep learning models for regression tasks?
We have an images dataset along with a CSV file telling the popularity of each image for 5 days. We have to predict the popularity of the 6th day based on the images, and the CSV file (contains the popularity of previous days).
Idea is to use transfer learning (any pre-trained DL model) for regression tasks (to predict the popularity of the 6th day).
Thanks
Question
I am testing Fama French three- and five-factor models for Japan. I have done all the regression part, however I am strugglin with GRS test.
Can anybody help me with that please? How can i test GRS in Stata?
I know that the command is grstest2, however I do not know how to use it and how to read the results.
Highly appreciate the help
Question
How does Exploratory regression works with categorical Data. Looking for good examples ?
Stutee Gupta Hi, you may also want to look at the following documents for more information,
Question
By using the regression equation, Y=mX+C, whether to take C=0 or not in this calculation?
Hello Anshuman Nath
I think whether you include C=0 or not, your final regression equation would look like Y=mX. So you can proceed with this equation to calculate for your total carbohydrate content of your sample.
Hope this helps.
Best
Question
I am using probit regression to check mortality rate of larvae after exposure to various bacterial isolates and my total sample number is 16 including one control. i transformed the mortality rates to probits from finney table.
Sorry! The table is attached here
Question
In my research, I've made a following equation (with a demographic dummy variable Czech = 1, Dutch = 0): DV * Czech = b0 + b1*Czech + b2*Czech ... b10*Czech.
I can simply interpret the b1...b2 for the Czech demographics, however, how can I interpret the variables for the Dutch population (Dutch=0)? I would need to use only the intercept? Or do I have a mistake in the model equation (the DV should not be an interaction term)?
In DV = b0 + b1*Czech, b0 is the mean of DV for Dutch, and b1 is the mean difference in DV between Czech and Dutch.
Czech*DV on the left-hand side of your equation makes no sense. Theother terms (b2*Czech, b3*Czech,...) make no sense either. This whole part of your right-hand side could be written as b0 + (b1+b2+b3+...)*Czech, and the sum of the coefficients can be (andmust be) treated as a single model coefficient (i.e. as the mean difference in the DV between Dutch and Czech.
An interaction would be modelled between two difference predictors by their product. In your example I see only Czech being a predictor. Another one might be, say, Age. Then the model
DV = b0 + b1*Czech + b2*Age + b3*(Czech*Age)
would be a model where:
b0 is the mean DV of Dutch at age 0,
b1 is the mean difference between Czech and Dutch at age 0,
b2 is the mean change in the DV per year of age for Dutch, and
b3 is the interaction of Czech and Age, i.e. the difference in the mean change per year between Czech and Dutch.
In this example, the coefficients b0-b2 are not particularily usefulpractically, but required by the model to estimate the inteaction. If the Variable Age is centered at age a, then b0 and b1 would refer tosubjects at age a.
Question
What is the best method of checking normality through SPSS before conducting a regression?
Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.
Question
Hi there. I'm a medical student and now I'm working on survival prediction models. But I encountered some difficult problems about feature selection. The original sequencing data was huge, so that I relied on univariate COX regression to obtain a subset (subset A), and I'd like to perform Lasso regression to further select features for the final survival prediction model construction (using multi-variate COX regression). However, the subset A was still huge than I expected. Can I further obtain subsets by limiting the range of Hazard Ratio (HR) for Lasso regression? Or, could I perform Random Survival Forest to obtain subsets from subset A, for final survival prediction model construction? Is there anything I need to pay special attention to during these processes?
apologies if I’ve misunderstood the questio. I’m not sure what you mean by using a univariate Cox model to create a subset? Did you simply look for features that were significant on their own? If so I’m not sure that’s appropriate…
in terms of Lasso regression, why do you need to subset the features before hand? Why not just fit A lasso regularised cox model with all features?
Question
No previous study used large-scale data from many countries on a topic- can this be a strong rationale for a study? for instance, if I say that there is no previous cross-national study on a topic so it cannot be said that it is a worldwide phenomenon rather than a specific regional phenomenon. Also, a cross-national study will give the result a stronger generalization power which might of interest to international practitioners like WHO.
- Can these statements be a strong rationale and justification for a study? Please let me know, Thank you.
Agree with Sachin Suknunan
Question
I have non-stationary time-series data for variables such as Energy Consumption, Trade, Oil Prices, etc and I want to study the impact of these variables on the growth in electricity generation from renewable sources (I have taken the natural logarithms for all the variables).
I performed a linear regression which gave me spurious results (r-squared >0.9)
After testing these time series for unit roots using Augmented Dickey- Fuller test all of them were found to be non-stationary and hence the spurious regression. However their first differences for some of them, and second differences for the others, were found to be stationary.
Now when I test the new linear regressions with the proper order of integration for each variables (in order to have a stationary model) the statistical results are not good (high p-value for some variables and low r-squared (0.25))
My question is how should I proceed now? Should i change my variables?
Please note that transforming variable(s) does NOT make the series stationary, but rather makes the distribution(s) symmetrical. Application of logarithmic transformation needs to be exercised with extreme caution regarding properties of the series, underlying theory and the implied logical/correct interpretation of the relationships between the dep variable and associated selected regressors.
Reverting to your question, the proposed solution would be to use the Autoregressive Distributed Lag (ARDL) model approach, which is suitable for datasets containing a mixture of variables with different orders of integration. Kindly read the manuscripts attached for your info.
All the best!
Question
I am working on nonlinear monetary policy rules using time series analysis and I need to estimate a threshold variable and through that utilize a threshold regression.
Dear Kweku
I also recommend you go through the guideline from STATA.
Regards
Question
Dear Community,
I want to center (subtract the mean) my data in order to lower my intercept (as close as possible to zero). However, as I have categorical data as well, which I don't know how I can possibly center, I can't seem to lower my intercept.
And the thing is that my dummy variable drive up the intercept... So I don't know to fix the issue.
When I enter only continuous variables that are centered, the intercept is close to zero.
Not sure why you want the intercept being zero, but one way you could easily achieve this is by first finding the (non-zero) intercept and the subtract this value from the response variable.
Question
I know that neural networks including the deeper one perform better when we augment the dataset size,
I wonder which of the other machine learning algorithms used for regression is sensitive to the size of data, or it is not sensitive.
Hi Nana Cne!
All machine learning algorithms are volume-sensitive data sets. But, among all the algorithms, SVM can work with a smaller data set.
Question
I am looking for suggestions on statistical methods for the following study design.
- We have one control group and four treatment groups of mice.
- Sample size is 6 mice per group. The number of variables (i.e. metabolites) is about 200.
- Each treatment aims to increase the lifespan of mice. However, we do not know how long each mouse lived. **We only know on average how much longer each treatment lives than control.** For example, for treatment group 1, we know on average the mice lived 30% longer than the control, while treatment group 2 80%, etc.
What I am puzzling is this is not a group comparison (i.e. using ANOVA) or a regression, because we do not know exactly how much longer EACH mouse lives. Instead, what we know is the average lifespan of each group. Let's say we would like to perform a univariate analysis, namely analyzing variables one by one.
Here are my thoughts so far
- Conduct simple linear regression anyway, i.e. we make the lifespan of each mouse the average lifespan of its corresponding treatment group and then perform linear regression. Or
- Use multinomial regression, e.g. proportional odds model or baseline odds model, because there is an order by lifespan for these groups.
I feel either is optimal. For the multinomial regression, we are not able to distinguish 1% vs 10% and 1% vs 99%, i.e. we are taking the continuous information inefficiently.
Are these valid methods? Or what statistical method would you recommend to conduct this analysis Thank you!
Hello Sili,
Wow; it seems as if monitoring the lifespan of each individual animal in the study would have been a basic expectation for making longevity comparisons. As well, do I understand correctly that you wish to evaluate the impact of (or on) 200 variables as they might or might not relate to average lifespan for a group? If so, then the study seems woefully undersized in order to be able to do that with any precision. Just the thought of running 200 tests means that the expected number of Type I errors would likely be high enough to worry about, unless you set the per-comparison significance level very low.
What this leaves you with as options is not really an attractive pool.
1. You could run correlations of mean levels of a variable with the mean lifespan for a group. Unfortunately, that's 5 pairs of data points, which is not going to lead you to a very precise estimate.
2. If you have measurements for each animal on a given variable, you could run a one-way anova (using the 5 groups as the IV). Again, these won't be very powerful tests unless the effects are pretty large. The sheer number of such tests also needs to be considered, due to the aggregate Type I error risk. I would not suggest manova, as the number of cases is simply too low to afford statistical power.
3. I would not recommend the OLS regression method you outlined, as you would be artificially masking variation in lifespan.
4. Your ability to compute and report ORs (odds ratios) for survival is severely hampered by not knowing whether a given group 3 animal did or did not outlive a given control group animal. Presuming they all did may well distort the reality of the outcomes, given the description of what you know. If you have, by hour / day /week / month the number of surviving cases in each group, you could try a proportional hazards analysis, but the sample sizes are likely far too low to offer much precision here (also, it sounded like you didn't really have this type of information available).
5. Just report the average lifespan of each group, the mean of each group on each variable (if individual animal results are available, then you should include CIs as well), call the study and presentation "exploratory," and leave it at that (with the obligatory recommendation that, "further research is needed").
Question
Many references are explained that kriging and GP regression is the same. However, in the formulation, they have differences from each other. Can you explain the difference between these two methods from the statistics domain?
Actually need to compare for different kriging methods, for example, simple kriging assumed that the mean of the random function is known and constant as GP regression. However ordinary kriging assumed the mean of the function is known and changing locally (E(x)).
I hope the following article may be helpful to understand the difference(s) between Kriging and Gaussian Process regression.
Thanks!
Question
I want to do comprehensive study of errors in variables from both numerical analysis and statistical viewpoint and compare the results with regression for selected parameter estimation problems in my domain where it is expected to perform better in terms of accuracy. These problems are of type linear and non linear regression. I want to check if the method under study is an improvement over generalized least squares. I am including multiple factors like accuracy, computational efficiency, robustness, sensitivity in my study under different combinations of stochastic models. What kind of statistical analysis/ experimental design/metric/hypothesis test is required for a study of this nature to establish superiority of one method over another(to make a recommendation of one method over another for a particular class of problems)
Maybe you wan to consider the recursive least squares algorithm (RLS). RLS is the recursive application of the well-known least squares (LS) regression algorithm, so that each new data point is taken in account to modify (correct) a previous estimate of the parameters from some linear (or linearized) correlation thought to model the observed system. The method allows for the dynamical application of LS to time series acquired in real-time. As with LS, there may be several correlation equations with the corresponding set of dependent (observed) variables. For the recursive least squares algorithm with forgetting factor (RLS-FF), adquired data is weighted according to its age, with increased weight given to the most recent data.
Years ago, while investigating adaptive control and energetic optimization of aerobic fermenters, I have applied the RLS-FF algorithm to estimate the parameters from the KLa correlation, used to predict the O2 gas-liquid mass-transfer, hence giving increased weight to most recent data. Estimates were improved by imposing sinusoidal disturbance to air flow and agitation speed (manipulated variables). The proposed (adaptive) control algorithm compared favourably with PID. Simulations assessed the effect of numerically generated white Gaussian noise (2-sigma truncated) and of first order delay. This investigation was reported at (MSc Thesis):
Question
Hello,
I’m analyzing how multiple emojis change the perception of messages. In the questionnaire, the participants had to rate neutral sentences and the same neutral sentences enriched with one positive, negative or neutral emoji or the same emoji twice or three times on a 7-point Likert-scale from very negative to very positive (DV). The IV in this case is the number of emojis used in the sentences (0, 1, 2 and 3). I also want to control, how adding emojis to the sentences influence the perception of the sentences for gender and digital generations.
To analyze the effect, I conducted an ordinal logistic regression, because the DV is ordinal. I also split the IV, to calculate different models with the IV’s 0 or 1 emoji, 1 or 2 emojis and 2 or 3 emojis. Now I want to control this effect for gender and digital generations.
- Is there another way than splitting the data set e.g. in different genders and calculating the regression again for different data sets, one containing only men, one only women and one only other genders?
- Is there another method than ordinal logistic regression, to check whether the effects derived from the collected data are significant?
I'm not sure I understand the question. Why not just add gender or other variables as predictors? Doing subgroup analyses is generally a bad idea because an effect in one subgroup and no significant effect in another doesn't imply that the effect is different in the two subgroups. For example it could be near identical in both subgroups and p < .05 in one and p > .05 because of different sample sizes or different degrees of collinearity with other predictors. To test if the effect is bigger for males than females you'd want to include interaction effects - though this can be difficult to interpret - especially in an ordinal logistic regression. Always get a plot of the predicted probabilities for the ranges of predictors of interest to aid interpretation. (The predicted probabilities are not a linear function of the coefficients so you can get potentially surprising patterns depending on the ranges of predicted probabilities).
Question
Dear all,
I trust this email finds you well.
I am currently testing a theorical model using the PLS-SEM and Smartpls. The aim is to assess the influence of certain soft skills (empathy) on client relationship outcomes (e.g customer loyalty).
I would like to control to assess the effect of certain (control) variable on some of my dependent variables.
The issue that I am encountering is the following.
I have 5 control variables:
· Gender (w/ 2 groups: male female)
· Salary (w/ 5 groups: below 20k, 21-30k, 31-50k, 51-70k, +70k)
· Education (w/ 5 groups: no diploma, high school diploma, bachelor, master, PhD)
· Visit frequency (w/ 5 groups: once a year, once every 6 months, ect...)
If I understood things correctly to yield significant result and interpret findings correctly, I will need to use dummy variables. Which means that, based on the above, there will be a total of 17 control variables.
Based on your experience, is there a way to simplify the above? I was thinking to reduce the number of groups for each variable. For instance, instead of having 5 salary categories, reduce the number of categories down to 2 such as:
- Salary: above and below the national average salary.
- Education: those who have at least a master, those who do not.
Etc…
What do you think of the above?
Of course, if you have any information / resources / material / that may allow me to address the above issue, that will be appreciated.
I thank you once more for your assistance and wish you a nice day.
Best regards,
BITT
Question
Good day Scholars
I am currently working on a time-series regression with endogeneity problem. I just want to know if Cochrane orcult estimator performs better than the two-stage estimator in a single-regression equation with endogeneity problem.
Thanks for always being there
Saheed Busayo Akanni , no, it does not. The Cochrane-Orcult procedure allows to correct the regression estimates to account for the presence of serial correlation in the error term, but does not solve a potential endogeneity problem.
Question
have 3 IV likert scale variables and 1 DV which is also likert scale data. I want to do moderation analysis. can i use regression for this data. And suggest ways to do.
Ok Belal. Thank You very much!!!
Question
Hello all,
I have 60 samples of Whole Genome Sequence (BAM, VCF ) File. Now I want to do SNP analysis ( I am trying to find some positions which differ among case and Control ).
I did with Python and Found some Positions based on 70 % different among Case and Control.
While I googled I found many Researchers are using plink and other tools for linear ( Logistic ) Regression and finding p values.
I also watched some Youtube Videos but still which is insufficient.
I have attached some of my data screenshots.
1. Sample of Merged 60 VCF File
2. sample of the single VCF file.
Thank You all
Hi Abhishesh Bajracharya,
I did the same work with 7 plants sample genome sequence, where 3 tolerant (T) and 4 susceptible (S). So I marge 3 bam files in a 1 bam file for T and 4 bams in 1 bam file for S and i did vcf annotation using snpeff. So now my question is how do interpret the result from SNPEff output. I have locus information like I am working with HSP protein so I am looking SNPs in between HSP positions. Also, additional suggestions on what can I get from snpeff result? Please help me in this regard.
I did
SNP calling Using GATK_v4
- Filtering (Filtering out the unnecessary SNPs): 'SnpSift Filter'
- Comparing (between the control and the samples; finding out matches): 'bcftools isec'
- Annotating (figuring out the gene locus of the SNPs): 'SnpEff Eff'
Question
I have sample a of 138 observations (cross sectional data) and running OLS regression with 6 independent variables.
My adjusted R2 is always coming negative even if I include only 1 independent variable in the model. All the beta coefficients as well as regression models are insignificant. The value of R2 is close to zero.
My queries are:
(a) Is negative adjusted R2 possible? If yes how should I justify it in my study and any references that can be quoted to support my results.
(b) Please suggest what should I do to improve my results? It is not possible to increase the sample size and have already checked my data for any inconsistencies.
@David If you can suggest any book or published research paper I can refer to because I couldn't find any authentic source on google that can be cited for supporting my results.
Question
Hello,
Do ggplot regression plots actually show the number of the participants accurately? I had 27 participants, but only see 13 dots on the graph. Is it OK, or I'm missing something here?
Thanks,
Vita
Question
Hi!
I am preparing to run some analyses on two variables that seemingly have a nonlinear relationship. I say this because when I fit a line to the scatterplot, r-squared was higher for the cubic equations was higher.
Any tips on this? I'm trying to use X at one time point to predict Y at a later time point. What analysis for this case would be appropriate, given what the scatterplot suggests?
I hope someone can help!
If you are not studying a completely new subject I would search for information available (from more expert lab members or scientific specialistic literature) on your process in order to choose an appropriate or more reliable extrapolation method.
Question
I can find many for Chironomidae larvae, but struggling to find one for pupae, but I'm sure they must exist!
Question
Dear Community,
I am doing a Fractional Outcomes regression (Logit) for my thesis and can't find information on what assumptions the model makes. I suppose I would need to test those assumptions are met in my sample in order to be able to conduct the analysis. Furthermore, I wanted to know if there is any possibility of doing a robustness test on such a model.
Additionally, in a Fractional Outomes regression do my independent variables have to be between 0 and 1 as well?
Thank you very much,
Best,
Jan
Fractional responses concern outcomes between zero and one.
The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you know the denominator, you want to estimate such models using standard probit or logistic regression. For instance, the fractional response might be 0.25, but if the data also include that 4 out of 36 had a positive outcome, you can use the standard estimation commands.
Fractional response models are for use when the denominator is unknown. That can include averaged 0/1 outcomes such as participation rates, but can also include variables that are naturally on a 0 to 1 scale such as pollution levels, patient oxygen saturation, and Gini coefficients (inequality measures).
Fractional response estimators fit models on continuous zero to one data using probit, logit, heteroskedastic probit, and beta regression. Beta regression can be used only when the endpoints zero and one are excluded.
I copied can cited from https://www.stata.com/features/overview/fractional-outcome-models/. Hope this can help. Thanks.
Question
Dear Researchers,
I am doing my research about the determinants of bank capital structure, for this I use information about banks’ consolidated balance sheets and income statements from the CapitalIQ database. My original dataset has 1712 observations, when accounting for missing values the dataset reduces to 1074 observations. I have the following problem, that when I run my OLS regressions using the complete observations dataset with heteroskedasticity and cluster robust standard errors (coeftest in R), most of my variable coefficients become insignificant. This does not happen while using the original dataset which contains missing values or the complete cases one without robust se. Does missing values or attrition affect the heteroskedasticity in data and therefore the significance of coefficients while accounting for heteroskedasticity and cluster robustness?
I will be glad to get any input or clarification.
Thank you,
Outmane
Question
i estimated Autoregressive model in eview. I got parameter estimation for one additional variabel which i have not included in the model. the variable is labelled as ' SIGMASQ '.
what is that variable and how to interpret it?
i am attaching the results of the autoregressive model.
Sigmasq is sigma square of the distribution of residuals which is considered as a proxy of the variance of the distribution of the dependent variable. Indeed this distribution is necessary to maximum liklihood method
Sigmasq is estimated in the second step after estimating the parameter related to estimators
Otherwise, SE is the standard error of regression which is the average of differencesome between actual values and fitted values of the dependent variable
Question
1. What is the difference between Least Square Regression and Robust Regression?
2. How can we interpret the results of the regression model in both cases?
3. If the variables in the data set have not shown proper correlation, can we use these techniques?
4. Any R script references?
1. In the normal lingo, very little. In robust regression, the standard errors are calculated differently, to make them "robust" against heteroscedasticity (or clustering).
2. You interpret the regression coefficient in the same manner: the average change in y given a one-unit increase in x.
3. I have written three textbooks on regression, but I have never heard the term "proper correlation" before. You do regression to find out if there is a (linear) association between x and y and, if so, to find out how large this assocoation is.
4. I seldom use R, but I know for a fact that both plain vanilla and robust regression is are straightforward to estimate in R.
Good luck :-)
Question
When to use Cragg hurdle regression?
What is bound dependent variable?
Cragg hurdle regression fits a linear or exponential hurdle model for a bounded dependent variable. The hurdle model combines a selection model that determines the boundary points of the dependent variable with an outcome model that determines its nonbounded values. Separate independent covariates are permitted for each model. For more information see the like:
Question
Dear all,
I am conducting my dissertation which has a quantitative data analysis process and, as I have never done it before, I need your help to understand the needed steps.
The basic model will require three separate regressions to confirm the relationship between constructs. The data has been collected d with a Likert-style questionnaire where multiple questions measure each variable.
Before I start running the regressions, which steps should I implement?
I believe I would need to do reliability, do I test the Cronbach's alpha for each question of for the variables altogether?
Similarly for the regression, do I take the Likert average of each variable, or do I test each question separately?
Thank you for your help! If anyone would also be available for a video call it would be of amazing support.
Elena
Hi Kenneth W. Cromer great thank you very much for confirming this! Now back to the drawing board and reversing the steps, good thing I saved the initial data set!
Question
I have many data regarding two independent variables. I asking for how to get a fitting curve to predict the attitude of the two variables. I wonder if it will be beneficial to use multi regression modeling. I checked linear regression but it was not valid.
The recursive least squares algorithm (RLS) allows for (real-time) dynamical application of least squares (LS) regression to a time series of time-stamped continuously acquired data points. As with LS, there may be several correlation equations with the corresponding set of dependent (observed) variables. RLS is the recursive application of the well-known LS regression algorithm, so that each new data point is taken in account to modify (correct) a previous estimate of the parameters from some linear (or linearized) correlation thought to model the observed system. For RLS with forgetting factor (RLS-FF), adquired data is weighted according to its age, with increased weight given to the most recent data. This is often convenient for adaptive control and/or real-time optimization purposes.
Application example ― With a single correlation equation ― While investigating adaptive control and energetic optimization of aerobic fermenters, I have applied the RLS algorithm with forgetting factor (RLS-FF) to estimate the parameters from the KLa correlation, used to predict the O2 gas-liquid mass-transfer, while giving increased weight to most recent data. Estimates were improved by imposing sinusoidal disturbance to air flow and agitation speed (manipulated variables). The proposed (adaptive) control algorithm compared favourably with PID. Simulations assessed the effect of numerically generated white Gaussian noise (2-sigma truncated) and of first order delay. This investigation was reported at (MSc Thesis):
Question
I am looking into substitution between accruals and relative earnings management for my masters thesis. And since there is a joint relation between them I understand I need to run 2SLS for the simultaneous equations. However, I clear from previous literatures about the exogenous variables that they use but I am not quite clear on the instrumental variables they use for the first stage of the regressions.
From my point of understanding I believe they use accruals (AM) as instrumental variable for the relative earnings management (RM) equation and RM as instrumental variable for the AM equation.
I would really appreciate some insights on the soundness of my understanding on this. Also would be great if I could get some ideas on what IV variables to use.
Good question.
Question
Hi there. I have a question about comparing path estimates. Basically, I have 8 models with the same outcome variable across the 8 models, but different predictors in each model. There are 5 time points of data for both variables in each model. I have constrained the cross-lagged path from the predictor to the outcome variable to be the same at each time point, and so I essentially have one cross-lagged path estimate of interest from each model.
In summary, I have 8 cross-lagged path estimates from 8 different models for the same outcome variable….and I want to compare them. Could you explain how best to go about doing so? I’ve seen two approaches in the literature which don’t actually seem to be widely used: 1) The Cumming approach of testing for 50% overlap in the confidence intervals of the standardized regression coefficients, and 2) The Clogg et al. (1995) approach and calculate z scores from the standardized regression coefficients and their standard errors.
Would you recommend either of these? Thanks!
Also, for a more qualitative comparison between path estimates, surely the standardized regression coefficient is superior to the unstandardized??
Latent Variable Models
An Introduction to Factor, Path,
and Structural Equation Analysis
Fifth Edition
AUTHORS: LOEHLIN AND BEAUJEAN
This is a good reference and addresses this topic as well.
Question
I have a balanced panel of firm data, in total 39 firms throughout 21 quarters. These are a subset of S&P100 firms from the period 2013-2020 and were selected based on earnings call transcript availability and respective newspaper coverage. The regression model has cumulative abnormal returns following quarterly earnings calls as the dependent variable and as independent variables: a variable representing economic sentiment, book-to-market, leverage, firm size, EPS surprise and volatility prior to the earnings call.
I know that clustered standard errors rely on asymptotic arguments, therefore it might not be reliable to draw inference on those since the number of clusters along both dimensions is less than the generally recommended number (40-50 approximately). Nevertheless, I could argue that there well might be unobserved components of the error term that clustering would account for.
The case is similar with fixed effects.
So my question is: what is the recommended procedure? Should I just report regression results with and without clustered standard errors and describe the above concerns, or are there any other arguments I could use as to why/why not to cluster and/or use fixed effects in this case? I know the definitions and what these are used for, I'm rather looking for arguments for or against in this specific case.
As usual explain what you did and say what you found. Best wishes, David Booth
Question
How do we calculate ( Predictive R-squared ) using ( SPSS ) to find if Regression Overfitting ??
Bruce Weaver Sorry I didn't connect R**2 With the PRESS statistic. Mea Culpa. Thanks for setting it right. Best, David Booth
Question
I am running a binomial coefficients regression on R and I’m not getting all my variables back. How best can I solve this?
from the picture, there is a missing age group 1 and a diagnosis type.
As an example of what I mean by a simple data set showing why one level of a predictor is set to zero, imagine we have the following data set:
Gender Time_watching_television
F 5
M 7
Can't get more simple than that, right?
And your model is Time = Intercept + Gender + e
Where e is the residual error term.
The model summary will be:
Intercept = 5
GenderMale = 2
There's no Female level, because that's already incorporated into the intercept.
Question
Command for including trend in seemingly unrelated regression in stata.
Dear Adamu Braimah Abille, in Stata, you can generate trends using filters into the original series (e.g. Hodrick and Prescott). I recommend the command <<tsfilter>>
I hope it helps,
Question
I ran a regression and I have trouble understanding the interpretation of a coefficient if I have some independent variables that are proportions. I ran a FE regression btw, so I know that the interpreation would be something like "for a given fund, as X varies
across time by one unit, Y changes by 𝛽2 units".
The dependent variable in my regresson is i.e. excess returns like i.e. "2.5" (or -1.7 etc.). As independent variable I know i.e. have "Members-to-beneficiaries-ratio" expressed as i.e. "0.77" or "3.5" or even "125.8" etc. So, for a given fund, as the Members-to-beneficiaries-ratio varies across time by one unit, the excess return changes by i.e. - .345 units. Does this mean if in my first example the MtBRatio changes from 0.77 to 1.77 then the excess return changes from 2.5 to 2.155?
Any help on this would be greatly appreciated! Thanks a lot in advance.
Thanks for your input David Eugene Booth & Babak Jamshidi . I just tried to make sense out of what you were saying but I think I still struggle to understand exactly what you mean. I have been reading about units and dimensional analysis too. Sorry I have to ask for further elaboration.
So if I have i.e. 2.5 (y) / 0.77 (x) then my unit rate is = 2.5 / 0.77. This 0.77 is basically a ratio. But the way I get it then is: 2.5 + (-0.345)/ 0.77 + (2.5/0.77). But this seems for me rather difficult to put in words when interpreting the results.
Not sure if I am just not making the link here and am confused because I use a ratio instead of just a regular number (as i.e. just number of members/beneficiaries).
Question
I would like to know if it is possible to extract a correlation value (r) from a regression value (r²). Is there any calculation that allows this?
Salvatore S. Mangiafico and David Morse thank you very much for the explanations! You helped me a lot!
Question
In adsorption, a reviewer suggested me to not to use linearized models in the age of computers where it is possible to perform more precise non-linear regression.
You are most welcome Selvaraju Sivamani.
Question
Hi all,
I have found an association between social distancing attitudes and the willingness to do social distancing, as studied on a Likert type scale (Never to Always). Now there is also an association between what type of work they have and the same response.
My question is how can I isolate the influence of city from the influence of line of work? Wondering if there is a technique for nominal/ordinal variables similar to adjustment/control variables in regression. I would be interested in anything from AI / statistics, preferrably to be used in R or Excel.
And open for collaborations if anyone wants to co-author or similar
It sounds like you could be analyzing 2-way and 3-way tables with large numbers of cells. What size is your sample?
Question
Hi all,
I am doing a moderated mediation regression in process (model 7). I have covariates with a lot of levels, (e.g., 5) and created dummies.
Now, when I run the PROCESS model, I get the error: NOTE: Due to estimation problems, some bootstrap samples had to be replaced. The number of times this happened was:2740.
When I run the model with the categorical variables (no dummies), I do not get the error, but then I do not have dummies. What would be the most simple option for now?
Kind regards,
Maayke
How large are the groups you are comparing with each dummy variable?
Question
Hi All,
I am doing a research on Regression Test Cases Prioritization with Neuro Fuzzy System.
Is there any suggestions How can I implement this on MATLAB or any other Platform?
Thanks
Dear Neoaz Mahfuz,
I suggest you to see links and attached files on topic.
Question
seemingly unrelated regression with STATA when we have just one dependent variable
Imen Slimi, if you have got too few observations, you would then consider grouping the countries based on some cultural indices (e.g., Hofstede). See here: https://www.hofstede-insights.com/product/compare-countries/
Question
EDIT: Up to the literature suggested in the answers, IT IS NOT POSSIBLE because they are required at least some calibration data, which - in my case - are not available.
I am looking for a technique/function to estimate soil temperature from meteorological data only, for soils covered with crops.
In particular, I need to estimate soil temperature for a field with herbaceous crops at mid-latitudes (north Italy), but the models I found in literature are fitted for snow-covered and/or high-latitude soils.
I have daily values of air temperature (minimum, mean and maximum), precipitation, relative humidity (minimum, mean and maximum), solar radiation and wind speed.
Thank you very much
Question
Hello, I would like to regress the independent variable GINI (Gini Index) on the dependent variable GDPpc (GDP per capita growth). For GDPpc to reach its stationary form, its first difference had to be taken. However, GINI was already in a stationary form.
I tried two different regressions. One with GDPpc and GINI being both in their first difference and one with only GDPpc being in its first-difference.
The first regression (d1_GDPpc, d1_GINI) yields a really high p-value, while the second regression (d1_GDPpc, GINI) yields a low enough p-value for the effects to be statistically significant.
My questions are the following:
Is it possible to regress variables in their first difference with variables in their level? Which regression model should I choose?
I do agree with Babak Jamshidi.
Question
Hello everyone
This question may seem absurd or look like kind of a p-hacking method but I'm gonna explain.
I am currently working on mangrove ecosystems and I have a range of 14C calibrated data that I use to estimate the elevation of sea level over a certain period of time (380 to 6850 BP). I performed my regression analysis, obtain my tendencies and now I would like to compare my results with some from other studies. I found a good article with lots of data that I could use in my discussion. However, although the authors have drawn a dot plot with tendencies and estimated elevation rates of mean sea level (MSL) on it, they did not report confidence intervals of the estimated rates. Since I have the dataset they used, I would like to obtain those intervals to see if my elevation rates are comprised within or not. Problem is: when I use their data and perform the regressions on the mentioned intervals I cannot find the same tendencies that they did so I wonder if there is a maneer to specify the regression coefficient and let R finding the right subset of data that fits it?
I wonder also if there is another way to do, other than with regression analysis and IC?
Mathis
Don't try to manipulate data in order to get desired (orpublished) results.
Contact the autors. Tell them what you tried and what your problem is. Provide thesource and the data you used and the analysis you did. Ask if the data and analysis is correct or if you missed something. Ask for everything required to reproduce their result. Maybe there is a mistake in the data set you have or in the way you analyze the data -- or that the authors did not correctly describe the complete analysis or that they made a mistake somewhere.
Question
Hello,
I am writing a thesis paper on the relationship between Income Inequality and Economic Growth. I want to use Educational Attainment as a control variable, but it does not have values for some years. Specifically, my paper's time span is from 1961 to 2018 and the variable has missing values only for 2 years: 1961 and 1963. Is there gonna be any issues using this variable in the regression, considering the fact that only 2 values are missing?
If they are missed at random (and you can assume that in this context), there is no problem. Another solution in this context would be linear interpolation, but I would stick to the former suggestion.
Question
I've clarified the relationship between customer satisfaction and e-wom intention is U-shape (quadratic regression). In my model, Triggers moderate this relationship. But I don't know to prove it, because Process in SPSS only for linear regression
Question
I am running a linear regression with three-way interaction in R
lm(A~X*Y*Z), A=numerical variable, whereas X, Y, Z all are categorical variables with factors.
X=5 factors, Y=2 factors, Z=4 factors. Every time I am running regression the three-way interaction for the last level is missing. For e.g. if I relevel the Z factors, the last is getting dropped in the three-way interaction.
Coefficients: (8 not defined because of singularities). (This is mentioned in the R output)
I have tried using zero intercepts but it did not make any difference
lm(A~0+X*Y*Z) or lm(A~X*Y*Z-1) and all other possible combinations.
I need three-way interaction results to make a conclusion about my data.
So you believe all those coefficients are worth estimating? Were people (or cases) allocated into all of these as an experimental design, or are these naturally occurring groups? What is your minimum cell size (for the 40 cells)? What is your sample size? With that many cells, as David Eugene Booth says to measure the full factorial you need a few in each cell (so one option if the three-way interaction is not critical to the theory you are examining do not measure it).
Most important, you haven't said what research questions you are trying to address. A short answer is your data are not complete enough for the model you are testing, but the better answer you should give is why you choose this specific model.
Question
I am creating 5 different regression models to measure a single phenomenon, tuition discounts. Tuition discounts are my independent variable in each regression. The Dependent variables are revenue, SAT scores, and percentage of underrepresented races. Each regression also has control variables such as net tuition price, endowment, total enrollment. My question is: can I use percentage of race as a control variable for my regression measuring the relationship between tuition discounts and SAT scores? Percentage of race is a good predictor for SAT scores so it makes sense to use as a control, but it is a dependent variable in one of my other regressions. It seems a little cannibalistic to use the variable in two places.
Yes you can on Design of Experment
Question
Excuse me, I know this may sound silly, but I'm having a hard time on finding out how to calculate the CC50 of a compound using Prism. I know I have to transform the concentrations tested into X=log(X) and normalize the absorbance measured into percentage values, and then fit a non-linear regression curve, but which equation do I chose when setting the parameters for the non-linear regression?
It's easy for calculating the IC50, since I'm adressing inhibition and there are clear options of equations regarding inhibition, but when I'm trying to assess citotoxicity, I guess the idea is a bit different, and I can't find anything elucidating this issue. Could anyone help me with this, please?
This video explains how: https://youtu.be/7NgRqXSByFo
Though the video is about finding IC50, the same method can be used to find the CC50
Question
I use the random-effect Tobit regression for my data with 131, 511 observations. The coefficient is 0.002 significant at a 5% level. The reviewer thinks the results are not very strong and suggested me to use t-statistics adjusted for a large sample. Anyone can give me some comments on this, thank you very much!
Hello Thao,
Statistical significance and practical significance are different creatures altogether. It sounds as if the reviewer is expressing concern that the magnitude of the statistically significant coefficient is so small as not to represent an important / meaningful difference.
I would suggest that the way to approach this is not with a different statistical test (though, as Professor Booth points out, we're pretty much having to presume that your use of a Tobit model was suitable in the first place!). Instead, look in the related literature to what other researchers would classify as differences that are noteworthy (and not just statistically significant); that's the kind of benchmark that would help.
Question
I am using GMM technique for panel data. I have eight variables in my model. For some varaibles, I am coming across different signs of correlation and regression coefficients like, one of my variable has a positive coefficient in the regression model but when I am plotting the correlation matrix, it is showing a negative coefficient of correlation with the dependent variable. I need to know if this is a point of concern for the model, if it is, then what can be the solution to this problem?
Hello Zaira,
David Eugene Booth 's answer is absolutely correct. If it turned out to be the case that all of your independent variables were uncorrelated with one another, then you'd never see a result in which regression coefficients took on a different sign than the zero-order (ordinary Pearson) correlations with the DV.
However, when there is redundancy among the IVs, it is quite possible that both the magnitude and sign of regression coefficients change relative to what you might have anticipated based on the zero-order correlations.
Question
Hello,
Except CNN models, is there any deep learning-based network to work on image-to-image regression problems?
I will be highly appreciative if one can provide me with new ideas.
Tnx a lot dear Niloy. In fact, I am supposed to map a 100*9 2D tensor (input) onto a 900*1 ID tensor (output). BTW, it is a regression type with supervised learning, not a classification one. I've already developed accurate models using CNN and am going to test with other structures.
Question
I'm a beginner in the field of econometrics. I'm currently working on a Difference-in-difference regression, and I have a (probably a very basic) question in this regard:
- In what way are the coefficient normalized in a static Difference-in-Difference regression?
Question
My results are statistically significant, but parameter estimates are negative. I have attached my result. Both of my independent and dependent variables are likert scale 1- strongl disagreed to 5 - strongly agree. And that is way scored 1-5.
Thank you
@Seyed Mahdi Amir Jahanshahi
I am facing the same problem do you know how could I solve thus issue ?
Question
For a current research project, I am dealing with a large acquistion dataset which in many cases includes multiple events/acquisitions per firm. Target is to calculate the Buy-and-Hold Abnormal Returns (BHAR) of the companies' stock for a 2-year period following the event and check for a stasticial significance of the BHAR in connection with financial and non-financial variables.
When running a regression for all events as stand-alone datapoints (i.e. counting the same firm multiple times with multiple events), I am not receiving any meaningful results. When however using a pivot fuction to summarise the results on a per company basis (i.e. take the average BHAR per company and only counting each firm once), I am obtaining a statistically relevant regressions.
I have been reading into the relevant literature but have not found any views if this approach is scientifically permitted. Does anyone have a view on wether an averaging of event study results on a per company basis is scientifically accepted?
Hi M. Susen , Thanks for your question - Its valid to pick 24 months holding period, the key item to note is that if the same firm has multiple acquisitions within the 24 months, then the exit price may not distinguish between the multiple acquisition events. However, it is often rare for a firm to have multiple acquisitions within a 24 Month period. As such, if you eliminate firms that have more than one acquisition event within a 24 month period, you will still have a sizeable sample size and you will also have solved the confounding of events problem. I am not sure a robustness test would solve the problem as such since your choice of holding period is not an accompanying assumption but instead its a key characteristic of your dependent variable. The other alternative is to split the sample into two, Firms with multiple acquisition events and firms without multiple acquisition events.
Question
My question is this: if I have data on employee satisfaction at time 1 and time 2, and I want to predict say leaving intention at time 2, could response surface analysis with polynomial regression be used?
Thanks.
Gerry
Sorry but I am a bit late to the party but you had better see Montgomery's Design and Analysis of Experiments the section on Response Surface Analysis. This is available at z-library google instructions. It doesn't look like this is what you need. Best wishes David Booth
Question
If in a multivariate model we have several continuous variables and some categorical ones, we have to change the categoricals to dummy variables containing either 0 or 1.
Now to put all the variables together to calibrate a regression or classification model, we need to scale the variables.
Scaling a continuous variable is a meaningful process. But doing the same with columns containing 0 or 1 does not seem to be ideal. The dummies will not have their "fair share" of influencing the calibrated model.
Is there a solution to this?
Monika Mrozek I think that based on Johannes Elfner shared, it makes sense NOT to scale the discrete variables.
Question
I want to make a empirical forecast model on the basis of hypothesis or other statistical parameters like Ordinary Linear Regression and OG Regression. if some one has this then please send me. I want to learn methodology because i am not a stat student.
Thanks
Interesting
Question
Hi,
I am currently working on a research project with the title "How does internet use affect trust? An intergenerational comparison." I'm essentially running an ordered logit model, with the dependent variable trust being made up of 11 categories. The main independent variable is also split into 4 categories of daily internet use and there are also 4 age generation categories and a range of other controls.
I've run the regression model and done a preliminary analysis using odds ratios. However, I essentially want to extrapolate the marginal effects of higher internet use compared to lower internet use the probability of a high trust outcome P(yi > 5) t and whether this varies between generations.
Thank you and please let me know if any further details are required of if this is unclear!
Question
Dear Collegaues,
I hope all is well.
Thank you very much for your help
I am used to STATA and very new to using R package.
STATA has lasso inference for linear and logistic regression. However, it doesnt have LASSO features for cox regression.
I wonder if I can use R to do LASSO inference for cox regression model?
I am literally very new to R and would appreciate if you can help me do syntax in R for my model.
I am sorry that I am very Naive in R
If I am using STATA, I would do the following to produce the cox model:
1)stset PTIME, failure(PSTATUS)
2)stcox i.sex BMI_TCR COLD_ISCH_KI SERUM_CREAT END_CPRA i.ETHCAT AGE AMIS BMIS i.STEROIDS_MAINT AGE_DON i.DIAB i.dgf
3)estat phtest (to test proportional hazard assumptions)
I wonder what is the syntax to do the same in R ?
Also, what are the syntaxes to use this model to perform LASSO inference for cox harazrd regression?
Finally how to do the post-estimation tests after fitting in the LASSO inference for cox harard regression?
Thank you vey much for your help
Looking forward to hearing back from you