tag:blogger.com,1999:blog-5652924.comments2024-03-02T01:02:21.655-08:00Applied Data Science and <br>Machine LearningDean Abbotthttp://www.blogger.com/profile/16818000233889520746noreply@blogger.comBlogger537125tag:blogger.com,1999:blog-5652924.post-4781558207763296882018-04-24T15:08:04.991-07:002018-04-24T15:08:04.991-07:00"Long and continuing experience indicates cle..."Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation." -<br /><br />I recently participated in the Kaggle Toxic Comments competition, and I was really surprised by two things: 1. everyone in the top 1000 entries or so had above 95% accuracy, and 2. the teams who won basically used the same models as Unknownhttps://www.blogger.com/profile/12592345423156402497noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-75675854621431829012017-06-15T03:37:49.933-07:002017-06-15T03:37:49.933-07:00Sorry, I am here for technical purposes.Sorry, I am here for technical purposes.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-33470052261926132152017-04-25T20:31:14.126-07:002017-04-25T20:31:14.126-07:00Why do so many people focus, as you say, “almost t...Why do so many people focus, as you say, “almost to the point of exclusivity,” on the learning algorithms, especially the latest and the greatest? As you point out, this is a problem, but I don’t think it is limited to the narrow scope of data mining algorithms. Consider this statement by Karl Popper, “we are not students of some subject matter, but students of problems. And problems may cut Brandonnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-1699689967025376662017-04-25T18:55:26.180-07:002017-04-25T18:55:26.180-07:00I think it makes perfect sense. Model training sho...I think it makes perfect sense. Model training should be a on-going progress in a rapidly-changing world. Just like how humans learn, we learn new things everyday and that's how we progress. What's it gonna be like if we decide to learn all the things we need for maybe 22 years, and once we graduate from college, we refuse to learn anything new and always make decisions based on the Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-60001398296563569042017-04-25T17:49:06.277-07:002017-04-25T17:49:06.277-07:00If the most expensive resource is the time of the ...If the most expensive resource is the time of the human analyst, how is it that letting the model do the significant extra work of filling in the gaps of the missing data itself a bad thing? I do not condone this as good practice (it’s terrible, I agree), but the time of the human analyst is saved when they don’t have to deal with the missing values themselves.<br /><br />Of course, I am assumingJanehttps://www.blogger.com/profile/17665249487819180285noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-17050263972071445222017-04-24T21:18:07.356-07:002017-04-24T21:18:07.356-07:00Cool post. Recently I learned from a paper (http:/...Cool post. Recently I learned from a paper (http://infolab.stanford.edu/~west1/pubs/West-Precup-Pineau_CIKM-10.pdf) that PCA can be used in other ways besides just dimensionality reduction. In the paper they use PCA to identify topics that are missing in an input document. They use a background corpus of documents to fill in values of a matrix and then perform PCA on the matrix. They say the Grantnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-47090008846807623022016-04-18T07:03:54.536-07:002016-04-18T07:03:54.536-07:00thanks for your comments. You are correct that the...thanks for your comments. You are correct that there is a bit of hyperbole going on with the title. The "dangerous" label would only be the case if the model is used, of course. <br /><br />What I'm most uncomfortable with in this post is how to detect the problems. Yes, there are obvious visual cues and yes we can examine training/testing accuracy metrics (for consistency...but Dean Abbotthttps://www.blogger.com/profile/16818000233889520746noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-12333913568549783312016-04-18T06:37:25.969-07:002016-04-18T06:37:25.969-07:00Your title drew me in. I certainly agree that over...Your title drew me in. I certainly agree that overfitting is more dangerous than poor accuracy. I would also suggest that poor accuracy isn't very dangerous, making your assertion not terribly surprising (not to say it isn't a valid point, of course). If you create a model (or your learning algorithm does it for you) and the model performs poorly, you know it performs poorly up-front. pickettbdhttps://www.blogger.com/profile/09458850380115200278noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-62420286151692851352016-04-16T21:13:34.998-07:002016-04-16T21:13:34.998-07:00What it really boils down to is that a lot of data...What it really boils down to is that a lot of data science is all about asking questions about what you see -- and a formal degree in a data science field may or may not give you the skills to ask the right questions, look at the data the right way, or try the right experiments. As explained in the case with Target, domain knowledge and a "forensic mindset" are some of the most criticalBrookenoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-55414853446382476802016-04-16T19:35:29.803-07:002016-04-16T19:35:29.803-07:00Thank you for sharing your thought. Overfitting re...Thank you for sharing your thought. Overfitting really is a big problem when there are only few samples used to build the model. Like L4G, only 10 instances used and only one fit the target. It could be very dangerous if we train an overfitting model on these data because the model will fit those 9 wrong instances stay away from the true target. This problem will be less severe if we have enough Anonymoushttps://www.blogger.com/profile/09240027679536789084noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-8051497976770778552016-04-14T18:47:05.862-07:002016-04-14T18:47:05.862-07:00I really liked the idea of classifying images from...I really liked the idea of classifying images from Google Street View with the city from which they came from. It reminded me of a different article that I read here: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42241.pdf in which the authors use a deep convolutional neural network to read numbers from street signs.<br /><br />A convolutional neural network takesMillerhttps://www.blogger.com/profile/07110660995452804563noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-17151131179261112812016-04-14T12:40:30.865-07:002016-04-14T12:40:30.865-07:00Completely agree with you that each field builds u...Completely agree with you that each field builds up its own vocabulary and short-hand descriptions for the concepts important to the field. And, generally, there is very little incentive to build up a broader vocabulary because most of the time, practitioners in their own discipline don't communicate to other fields or even have the need to.<br /><br />The biggest advantage of this kind of Dean Abbotthttps://www.blogger.com/profile/16818000233889520746noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-77743482950370082772016-04-14T11:14:24.904-07:002016-04-14T11:14:24.904-07:00This article I think is highlighting a challenge w...This article I think is highlighting a challenge with cross-disciplinary collaboration in general -- the fact is, so much of what everybody does in the world in everyone's various fields of study overlaps quite a bit, but we all like to talk about it in our own way, with our own vocabulary. Learning to overcome this divide is very powerful, because I believe that some of the most creative skschmihttps://www.blogger.com/profile/01310979354635209928noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-91325779557283519462016-03-21T02:30:59.851-07:002016-03-21T02:30:59.851-07:00Hi I am a Mater (Master in computer and informati...Hi I am a Mater (Master in computer and information system MCIS) student, I want to complete my thesis on Student performance prediction and analysis using data mining so my requirement is large student dataset so that i can complete my thesis.<br /><br /><br />so please help me by forwarding dataset related to student performance to my mail address: mukeshjswl7@gmail.comMukeshhttps://www.blogger.com/profile/02001626689950063459noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-53945529326408634702015-12-05T20:20:43.316-08:002015-12-05T20:20:43.316-08:00Just posted this reply to a related question about...Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (http://www.analyticbridge.com/forum/topics/k-means-clustering), where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"<br /><br />If there is not an operational definition for the number of clusters, yes, Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-10652179323194941242015-11-16T15:51:34.645-08:002015-11-16T15:51:34.645-08:00I mean the algorithms themselves, not the people. ...I mean the algorithms themselves, not the people. An algorithm like a decision tree doesn't generate a 1/0 answer for binary classification (nor do any algorithm). They generate probabilities (or some other number between 0 and 1). So I'm not addressing people in the loop at all here, which is a good question to ask. I'm only addressing how to utilize what the algorithms produce.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-46974739530516339702015-11-16T11:19:52.896-08:002015-11-16T11:19:52.896-08:00Great post Dean, thanks! On your point about algor...Great post Dean, thanks! On your point about algorithms not producing decisions/solutions - do you mean that there is always a person who has to interpret the information the algorithm produces? Is it feasible in the near future for a complex algorithm to actually produce a "decision" or action? I think there are companies trying to produce <a href="http://www.modernanalytics.com/" rel=Anonymoushttps://www.blogger.com/profile/01013321460318408995noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-24571860026052917472015-09-10T00:18:00.378-07:002015-09-10T00:18:00.378-07:00Mining of data in general terms can be elaborated ...Mining of data in general terms can be elaborated as retrieving useful information or knowledge for further process of analyzing from various perspectives and summarizing in valuable information to be used for increasing revenue, cut cost, to gather competitive information on business or product. <a href="http://www.dataminingservices.org/our-data-mining-services/weka-data-mining-services" rel="Anonymoushttps://www.blogger.com/profile/04604871283902836515noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-39258137485125725162015-08-13T21:19:10.245-07:002015-08-13T21:19:10.245-07:00I'm thrilled that you saw the quote and was en...I'm thrilled that you saw the quote and was encouraged by it. I only met your father once--I flew to Baltimore to hear him teach a tutorial just so I could finally meet him in person. I was so glad I did! <br /><br />Reading stories about him like this put a smile on my face because it provides a different look on him as a person and not just the technical genius he was. Thanks for posting!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-40857152934049342222015-08-13T18:11:45.385-07:002015-08-13T18:11:45.385-07:00Thank you for this- it is only through posts like ...Thank you for this- it is only through posts like this that I continue to get to know my father in his academic/professional career. I remember one day when UC Berkeley hired another thoerist Dad came home and described this exact divide over a bowl of strawberries that my step mother sneaked his way as part of their diet that they hated- Best, Rebecca Breimanrebecca breimanhttps://www.blogger.com/profile/02170309541578234688noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-76910701360394622542015-05-28T13:09:47.454-07:002015-05-28T13:09:47.454-07:00This question is particularly relevant these days....This question is particularly relevant these days. With increased automation in predictive analytics (which insurance companies use to predict/mitigate risk) they will be able to find correlations a lot quicker, though I am not sure with more accuracy. That is my concern; how accurate are <a href="http://www.modernanalytics.com/" rel="nofollow">predictive analytics solutions</a>, and does the Jameshttp://www.modernanalytics.comnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-16650215645147846502015-05-09T04:59:05.364-07:002015-05-09T04:59:05.364-07:00Hello,
I have gone through your posts and came...Hello,<br /> <br /> I have gone through your posts and came to see that the posts made by you are informatica. As well as I am already a reader of your RSS feed. And I will be following you all the way of my research. Thanks for providing information.<br /><br /><a href="http://www.virtualnuggets.com/ibm-websphere-cast-iron.html" rel="nofollow">RSS feed</a>Anonymoushttps://www.blogger.com/profile/06938531451178325688noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-58879455123139440632014-12-11T08:54:56.697-08:002014-12-11T08:54:56.697-08:00This is a great quote of the day. Thanks for shari...This is a great quote of the day. Thanks for sharing! I also wanted to mention that if anyone's interested in learning more about predictive analytics I would strongly recommend visiting the Modern Analytics website. They have an incredible <a href="http://www.modernanalytics.com" rel="nofollow">predictive analytics software</a> solution that is revolutionizing the way that companies predict Anonymoushttps://www.blogger.com/profile/03090755893173362242noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-32423856413474117322014-11-04T01:55:32.143-08:002014-11-04T01:55:32.143-08:00a few comments...
first, the correlation filterin...a few comments...<br /><br />first, the correlation filtering as I envisioned it here is intended to remove redundant variables, that is, variables with extremely high correlations magnitudes that indicate the pairs of variables contain the same information. Correlations lower than 0.9 or maybe at the low end, 0.8, I wouldn't touch using this method.<br /><br />signficance is largely a resultDean Abbotthttps://www.blogger.com/profile/16818000233889520746noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-54260556701090140182014-11-03T23:13:03.405-08:002014-11-03T23:13:03.405-08:00Hello I have a question regarding removing variabl...Hello I have a question regarding removing variables before modelling. <br /><br />When removing variables, should the significance of correlation play any role in whether or not I remove the variables? Or does removing variables solely depend on the correlation coefficient?<br /><br />Should I choose a range that I do not want my coefficients to exceed? Can I say for example that I want to Anonymousnoreply@blogger.com