Applied Data Science and
Machine Learning

"Long and continuing experience indicates cle...

2018-04-24T15:08:04.991-07:00

"Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation." -

I recently participated in the Kaggle Toxic Comments competition, and I was really surprised by two things: 1. everyone in the top 1000 entries or so had above 95% accuracy, and 2. the teams who won basically used the same models as

Sorry, I am here for technical purposes.

2017-06-15T03:37:49.933-07:00

Sorry, I am here for technical purposes.

Why do so many people focus, as you say, “almost t...

2017-04-25T20:31:14.126-07:00

Why do so many people focus, as you say, “almost to the point of exclusivity,” on the learning algorithms, especially the latest and the greatest? As you point out, this is a problem, but I don’t think it is limited to the narrow scope of data mining algorithms. Consider this statement by Karl Popper, “we are not students of some subject matter, but students of problems. And problems may cut

I think it makes perfect sense. Model training sho...

2017-04-25T18:55:26.180-07:00

I think it makes perfect sense. Model training should be a on-going progress in a rapidly-changing world. Just like how humans learn, we learn new things everyday and that's how we progress. What's it gonna be like if we decide to learn all the things we need for maybe 22 years, and once we graduate from college, we refuse to learn anything new and always make decisions based on the

If the most expensive resource is the time of the ...

2017-04-25T17:49:06.277-07:00

If the most expensive resource is the time of the human analyst, how is it that letting the model do the significant extra work of filling in the gaps of the missing data itself a bad thing? I do not condone this as good practice (it’s terrible, I agree), but the time of the human analyst is saved when they don’t have to deal with the missing values themselves.

Of course, I am assuming

Cool post. Recently I learned from a paper (http:/...

2017-04-24T21:18:07.356-07:00

Cool post. Recently I learned from a paper (http://infolab.stanford.edu/~west1/pubs/West-Precup-Pineau_CIKM-10.pdf) that PCA can be used in other ways besides just dimensionality reduction. In the paper they use PCA to identify topics that are missing in an input document. They use a background corpus of documents to fill in values of a matrix and then perform PCA on the matrix. They say the

thanks for your comments. You are correct that the...

2016-04-18T07:03:54.536-07:00

thanks for your comments. You are correct that there is a bit of hyperbole going on with the title. The "dangerous" label would only be the case if the model is used, of course.

What I'm most uncomfortable with in this post is how to detect the problems. Yes, there are obvious visual cues and yes we can examine training/testing accuracy metrics (for consistency...but

Your title drew me in. I certainly agree that over...

2016-04-18T06:37:25.969-07:00

Your title drew me in. I certainly agree that overfitting is more dangerous than poor accuracy. I would also suggest that poor accuracy isn't very dangerous, making your assertion not terribly surprising (not to say it isn't a valid point, of course). If you create a model (or your learning algorithm does it for you) and the model performs poorly, you know it performs poorly up-front.

What it really boils down to is that a lot of data...

2016-04-16T21:13:34.998-07:00

What it really boils down to is that a lot of data science is all about asking questions about what you see -- and a formal degree in a data science field may or may not give you the skills to ask the right questions, look at the data the right way, or try the right experiments. As explained in the case with Target, domain knowledge and a "forensic mindset" are some of the most critical

Thank you for sharing your thought. Overfitting re...

2016-04-16T19:35:29.803-07:00

Thank you for sharing your thought. Overfitting really is a big problem when there are only few samples used to build the model. Like L4G, only 10 instances used and only one fit the target. It could be very dangerous if we train an overfitting model on these data because the model will fit those 9 wrong instances stay away from the true target. This problem will be less severe if we have enough

I really liked the idea of classifying images from...

2016-04-14T18:47:05.862-07:00

I really liked the idea of classifying images from Google Street View with the city from which they came from. It reminded me of a different article that I read here: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42241.pdf in which the authors use a deep convolutional neural network to read numbers from street signs.

A convolutional neural network takes

Completely agree with you that each field builds u...

2016-04-14T12:40:30.865-07:00

Completely agree with you that each field builds up its own vocabulary and short-hand descriptions for the concepts important to the field. And, generally, there is very little incentive to build up a broader vocabulary because most of the time, practitioners in their own discipline don't communicate to other fields or even have the need to.

The biggest advantage of this kind of

This article I think is highlighting a challenge w...

2016-04-14T11:14:24.904-07:00

This article I think is highlighting a challenge with cross-disciplinary collaboration in general -- the fact is, so much of what everybody does in the world in everyone's various fields of study overlaps quite a bit, but we all like to talk about it in our own way, with our own vocabulary. Learning to overcome this divide is very powerful, because I believe that some of the most creative

Hi I am a Mater (Master in computer and informati...

2016-03-21T02:30:59.851-07:00

Hi I am a Mater (Master in computer and information system MCIS) student, I want to complete my thesis on Student performance prediction and analysis using data mining so my requirement is large student dataset so that i can complete my thesis.

so please help me by forwarding dataset related to student performance to my mail address: mukeshjswl7@gmail.com

Just posted this reply to a related question about...

2015-12-05T20:20:43.316-08:00

Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (http://www.analyticbridge.com/forum/topics/k-means-clustering), where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"

If there is not an operational definition for the number of clusters, yes,

I mean the algorithms themselves, not the people. ...

2015-11-16T15:51:34.645-08:00

I mean the algorithms themselves, not the people. An algorithm like a decision tree doesn't generate a 1/0 answer for binary classification (nor do any algorithm). They generate probabilities (or some other number between 0 and 1). So I'm not addressing people in the loop at all here, which is a good question to ask. I'm only addressing how to utilize what the algorithms produce.

Great post Dean, thanks! On your point about algor...

2015-11-16T11:19:52.896-08:00

Great post Dean, thanks! On your point about algorithms not producing decisions/solutions - do you mean that there is always a person who has to interpret the information the algorithm produces? Is it feasible in the near future for a complex algorithm to actually produce a "decision" or action? I think there are companies trying to produce

Mining of data in general terms can be elaborated ...

2015-09-10T00:18:00.378-07:00

Mining of data in general terms can be elaborated as retrieving useful information or knowledge for further process of analyzing from various perspectives and summarizing in valuable information to be used for increasing revenue, cut cost, to gather competitive information on business or product. predictive analytics solutions, and does the

Hello, I have gone through your posts and came...

2015-05-09T04:59:05.364-07:00

Hello,

I have gone through your posts and came to see that the posts made by you are informatica. As well as I am already a reader of your RSS feed. And I will be following you all the way of my research. Thanks for providing information.

RSS feed

This is a great quote of the day. Thanks for shari...

2014-12-11T08:54:56.697-08:00

This is a great quote of the day. Thanks for sharing! I also wanted to mention that if anyone's interested in learning more about predictive analytics I would strongly recommend visiting the Modern Analytics website. They have an incredible predictive analytics software solution that is revolutionizing the way that companies predict

a few comments... first, the correlation filterin...

2014-11-04T01:55:32.143-08:00

a few comments...

first, the correlation filtering as I envisioned it here is intended to remove redundant variables, that is, variables with extremely high correlations magnitudes that indicate the pairs of variables contain the same information. Correlations lower than 0.9 or maybe at the low end, 0.8, I wouldn't touch using this method.

signficance is largely a result

Hello I have a question regarding removing variabl...

2014-11-03T23:13:03.405-08:00

Hello I have a question regarding removing variables before modelling.

When removing variables, should the significance of correlation play any role in whether or not I remove the variables? Or does removing variables solely depend on the correlation coefficient?

Should I choose a range that I do not want my coefficients to exceed? Can I say for example that I want to

Applied Data Science and Machine Learning

"Long and continuing experience indicates cle...

Sorry, I am here for technical purposes.

Why do so many people focus, as you say, “almost t...

I think it makes perfect sense. Model training sho...

If the most expensive resource is the time of the ...

Cool post. Recently I learned from a paper (http:/...

thanks for your comments. You are correct that the...

Your title drew me in. I certainly agree that over...

What it really boils down to is that a lot of data...

Thank you for sharing your thought. Overfitting re...

I really liked the idea of classifying images from...

Completely agree with you that each field builds u...

This article I think is highlighting a challenge w...

Hi I am a Mater (Master in computer and informati...

Just posted this reply to a related question about...

I mean the algorithms themselves, not the people. ...

Great post Dean, thanks! On your point about algor...

Mining of data in general terms can be elaborated ...

Hello, I have gone through your posts and came...

This is a great quote of the day. Thanks for shari...

a few comments... first, the correlation filterin...

Hello I have a question regarding removing variabl...

Applied Data Science and
Machine Learning