Anyone at KDD want to get together for drinks? My colleagues and I are going to head out to AWCC on Tuesday too.
That seems like it's just asking for a Markov model.Hey folks. I've been meaning to post in this thread for a while, but I'm running into something that has finally prompted me to chime in.
I've spent the summer working on the Coursera/JHU Data Science Specialization, and I'm almost finished. The capstone project is creating a predictive text model based on a large corpus of tweets, blog posts, and news stories. The instructors haven't provided much guidance for this one, and since the specialization mostly focused on structured data and classification models, I feel a bit lost at sea in terms of methods for implementing what they want.
I've been able to do the necessary preprocessing, and I can tokenize the text and generate ngrams using the quanteda package in R, but I'm stuck as to next steps. I haven't been able to find any resources that cover this particular type of application, and the main JHU recommendations are Wikipedia pages (which are not really helpful at this point) or to take a whole separate MOOC on natural language processing, which I don't have time to do and still keep pace with the course (I have a day job, etc.).
Does anybody have any recommendations on what I should be looking at to understand this problem?
Hey folks. I've been meaning to post in this thread for a while, but I'm running into something that has finally prompted me to chime in.
I've spent the summer working on the Coursera/JHU Data Science Specialization, and I'm almost finished. The capstone project is creating a predictive text model based on a large corpus of tweets, blog posts, and news stories. The instructors haven't provided much guidance for this one, and since the specialization mostly focused on structured data and classification models, I feel a bit lost at sea in terms of methods for implementing what they want.
I've been able to do the necessary preprocessing, and I can tokenize the text and generate ngrams using the quanteda package in R, but I'm stuck as to next steps. I haven't been able to find any resources that cover this particular type of application, and the main JHU recommendations are Wikipedia pages (which are not really helpful at this point) or to take a whole separate MOOC on natural language processing, which I don't have time to do and still keep pace with the course (I have a day job, etc.).
Does anybody have any recommendations on what I should be looking at to understand this problem?
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.In either case, one simple approach would be : first turn it into some fixed sized embedding that you can run a classifier on. So for instance, something really simple would be to run an lstm/reccurent network on your input sentence, the final output embedding could then be fed into a fully connected layer to a softmax over your word dictionary. For a class project I think that should be sufficient.
Nice thread. I'm looking into an academia --> data science (ish) transition so this is quite useful.
One question I have is in regards to building out a portfolio... to showcase my data analysis skills. So far it's a bit of a mish mash (some projects on my website, some github stuff, some private etc). Does anyone have any good suggestions on how to present/organize sort of a 'data science portfolio'? Cheers.
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.
Edit 2: The markov chain stuff makes a lot of sense too for a class project^^, you just kind of histogram the data so you can draw from the observed distribution
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.
cool, i figured that was the application after reading more carefully. Best of luck, did you end up doing a markov chain?The task actually is predicting the next word in a sentence, and this is the approach I spent the night working on. I have something extremely basic running, and I don't think the expectation for our accuracy is very high. I can probably pass the class with what I have now, but there's plenty of time to make it better.
Sorry I haven't had time to keep up with replies, I was plugging away on the project all evening. I haven't picked up a new interest that absorbs my time like this in a long while.
You might be able to work remote after completing an internship or 6 month contractHas anyone had any success with remote DS jobs? The only ones that I see are "senior" this, and "principal" that.
cool, i figured that was the application after reading more carefully. Best of luck, did you end up doing a markov chain?
We have free access to LinkedIn learning here which is just Lynda. It's not bad. I'm building a BI team and doing that along with AWS certs.Any managers and supervisors on here that have experience with subscriptions to Data Camp for Business? I recently got promoted, and am trying to find professional development opportunities for my team. I'm also in state government so our data science capabilities are severely lacking and would like to beef those up.
Starting my second semester of my Data Science Master's on Monday. My first semester was my toughest academic experience so far, but that's what I get for wanting to go to a top school. The biggest source of stress was accidentally taking a theoretical computer science course with zero theoretical computer science experience. The professor was incredible supportive and encouraging, so I stayed in the class, but it was conceptually the most difficult work I've ever done. My two core requirement classes were fairly easy conceptually, but a lot of work. My last course was an introduction to machine learning which was also somewhat hard. Despite all the stress, I learned a lot and really enjoyed the material.
Now I have to choose my courses for this semester. Two are requirements, so I don't have a choice there. That leaves two that I can pick but I'm having trouble deciding. Luckily I can go to all the classes for the first week or so and see how they feel. Here's what I'm deciding on:
I'm also applying to internships for the summer. My problem is that I'm too all over the place: I'm interested by practically anything except defense/military stuff, biology and healthcare. (Thinking about human bodies makes me really anxious.) I'd also rather not work in purely financial stuff or places like banks. Otherwise, I'm happy as long as I'm working with data. But I do want to have an internship at a place where there is a full data science or data engineering team. Really dreading technical interviews. :(
- Parallel Computing for Big Data: Will almost definitely take this one since I have almost no experience in this. Helps that I get along very well with the professor.
- General Linear Models: I need at least one statistics course, and this probably the most well-regarded course in the Statistics Department. Not super excited about the material but that will probably change once I get a feel for it.
- Applied Linear Algebra: I'm not sure how different this course is from my general data science courses, as the concepts appear to have a lot of overlap. I'm mostly interested because the professor does work with climate data.
- Global Warming Science: Same professor as the Applied Linear Algebra course, obviously with a focus on climate data. It's a low-level course mostly in Python, but I'd like a guided introduction to the topic.
This is very helpful, thank you! This is probably a dumb question, but what do you mean by "case statement"? And all of my projects in the program so far have been group projects so I default to we when I talk about them... should I just be saying it was a group project and then defaulting to "I"?As someone currently conducting data scientist intern interviews ( for reference I'm a PM in BI with a data science/accounting background), be prepared to show you can build a case statement from the ground up. It's been a struggle to find folks with enough of the theoretical and actual building experience, it's usually one side or the other. ( by experience I mean projects! Not actually work experience or years of experience,
, give me an I built the model, not a we! :P ) I would also recommend to take a look at business analyst internship roles, as those are the folks that are producing the story around the data and the data scientists work the regression models, ab testing, etc. feel free to ask me any questions!
I'm surprised it's been hard to find data scientists who know statistics! My undergraduate degree was in statistics so I feel pretty comfortable with it, but I can definitely learn more. I'm set on taking General Linear Models this spring, and hopefully Bayesian Statistics in the fall! Not sure at all about going into climate science... but the class seems very doable. I might be able to get away with taking five courses instead of four.Of those I would definitely say the general linear models. Finding data scientists who actually know statistics has been hard.
Unless you're positive you want to go into climate science, but it might not be as relevant.
Going through the interview process now and it's a little funny how much you're grilled on leetcode and big O, which I've almost never used in 3 years as a data scientist
I'm surprised it's been hard to find data scientists who know statistics! My undergraduate degree was in statistics so I feel pretty comfortable with it, but I can definitely learn more. I'm set on taking General Linear Models this spring, and hopefully Bayesian Statistics in the fall! Not sure at all about going into climate science... but the class seems very doable. I might be able to get away with taking five courses instead of four.
My courses have barely touched on big O. Will be something I'll have to study on my own, I think.
Watch this for linear algebra:I'm looking to get in to some heavy math to pursue data science, and I was wondering if anyone here had any resources they found useful for calculus, linear algebra, and statistics. I don't mind if I have to pay for them, I just want something that'll help me.
Watch this for linear algebra:
And this for calculus:
They are not complete courses on the subject, but a fantastic introduction, that will help you make more sense of a complete course later.
By case statement I mean, take any of your projects, and be able to apply it to the industry you are applying to and how you would would tackle the issue end to end. How would you go about tackling the problem statement and then getting the data, loading the data, cleansing the data, manipulating and extrapolating the data. Then how would he best present the story or distribute the findings.This is very helpful, thank you! This is probably a dumb question, but what do you mean by "case statement"? And all of my projects in the program so far have been group projects so I default to we when I talk about them... should I just be saying it was a group project and then defaulting to "I"?
Would you okay with looking over my resume to give some feedback?
Ah that makes sense! Very helpful. Thank you!By case statement I mean, take any of your projects, and be able to apply it to the industry you are applying to and how you would would tackle the issue end to end. How would you go about tackling the problem statement and then getting the data, loading the data, cleansing the data, manipulating and extrapolating the data. Then how would he best present the story or distribute the findings.
It's not just defaulting to I, but being able to take your interviewer though what you contributed to the group project. As an interviewer I want to know what you can contribute, so don't be afraid to not be humble. This is mostly due to time limits and not having all the time to dig into details, the more you present of yourself up front, the better!
And sure! I'll dm.
It was a couple years ago but it's amazing how just a quick framework of introducing what your even talking about does wonders for entire branches of math. I wondered why math books don't really get into actually explaining what's going on. Why it took me until physics to actually understand what a lot of the math I had memorized *meant*. Then I looked at any math book I ever had and they spend so much effort in it. Entire chapters or giant sections either relating it to real world stuff or trying to illustrate basic concepts in an easy to grasp way. And I don't know if I just didn't read them right at the time or I wasn't engaged or I just skipped over them, but then you watch one 5 minute video and it all just clicks.Watch this for linear algebra:
And this for calculus:
They are not complete courses on the subject, but a fantastic introduction, that will help you make more sense of a complete course later.
I think the problem is more that a lot of these books are written by mathematicians who care more about mathematical rigor, generality, correctness and conciseness than teaching. This results in dense, boring and hard to understand theory. Also a noticeable lack of understandable examples does not help.It was a couple years ago but it's amazing how just a quick framework of introducing what your even talking about does wonders for entire branches of math. I wondered why math books don't really get into actually explaining what's going on. Why it took me until physics to actually understand what a lot of the math I had memorized *meant*. Then I looked at any math book I ever had and they spend so much effort in it. Entire chapters or giant sections either relating it to real world stuff or trying to illustrate basic concepts in an easy to grasp way. And I don't know if I just didn't read them right at the time or I wasn't engaged or I just skipped over them, but then you watch one 5 minute video and it all just clicks.
My theory now is every body has a "clicking" point but it's different from person to person so the books try to get every one for everybody and just overwhelm or exhaust most people and make them think they're bad at math.
Looking for some advice/guidance -
Some background:
-Undergraduate in Business Administration
-Currently a Financial Analyst for the past ~3 years, 2.5 years before that were as a Transportation Analyst, both at a very large transportation company in the US
-Knew SQL (Teradata) before getting my Master's
-March 2017, started a Master's program in Data Analytics and finished with a 3.9 in August 2018. Worked full-time while going to school
-Program was relatively new at the school, no Python or R. Classes included Java, GIS, typical SQL, Pentaho, Pivot4J, XLMiner, VBA, PowerBI, among other database/project management/network security classes
-Proficient using Teradata, SAS, Excel, Access, Spotfire
In my current role, it's basically still Excel everyday, all-day. There's a lot of bureaucracy when it comes to accessing data at the company and/or trying to adopt new software. You have to have approval to access specific tables and data stores even in read-only; we don't even have a data dictionary for 90% of the data that we have and it's a universal complaint across the company that people spend more time trying to find relevant data than analyzing or exploring. I have dreams of one day being a Data Scientist or Architect; I want to be a creator or curator of data, not just a consumer and relying on others to gather the data I want or need.
I feel like I haven't used anything that I've learned in my Master's program (i.e. Java, GIS, etc.), even though I did love learning and using it during my classes. I feel like it's just been a waste. I've tried to communicate to my manager and director that there are other tools and ways to utilize our data more efficiently, but the culture is not open to change. I don't have the opportunity to learn anything new, or utilize what I learned in my role because and I quote 'you're in Finance, not IT, the tools are already in place.'
I've been actively applying to other companies like mad for anything that seems to be more data oriented (Data Analyst/Scientist/Business Analyst/BI Analyst), and I just haven't gotten anywhere. Many of them want specific skills (Alteryx, Hadoop, Redshift, etc.), but I don't have that experience. I keep getting recruiters reaching out to me for Financial Planning & Analysis roles, and I tell them that I'm more interested in the technical side of FP&A than traditional FP&A, even though my experience is a lot of budgeting/forecasting/adhoc anaylsis/etc.
It's hard for me to teach myself something on my own if I don't use it - I bought one of the Python courses on Udemy a few months ago and got halfway through it, but I find that if I don't use something regularly, I don't retain it. I'm going to start over again soon to try again, but wanted to see if anyone had any advice on how to retain stuff like that, when you don't/won't use it daily?
And any tips for what jobs I should look for? I'm to the point that I'm thinking I need to go back to get my Ph.D. to have any shot at a Data Scientist position, and I feel that I'm not prepared for that program.
Quite a lot to unpack here but first...
[...]
I'm UK based so might be different if your in the US but hope this is helpful.
Thank you for all of that insight. It's very, very helpful.
The feedback on the different positions sounds right on point as well.
Do you have any advice on how to expose myself to something like Alteryx/Hadoop/etc. in a meaningful way that I can put experience on my resume and be confident when I talk about it, and also retain it? I don't just want to go through self-taught online classes without some sort of end goal. It would be simple if I used something like that in my current career, but I don't have the option to get anything new where I work as I'm limited to SAS, Teradata, and Spotfire.
Are you interested in working with students? If so I'm super interested!Anyone interested in doing a project on the intersection of ml and climate change? I can't provide compute (in the future potentially) but I'm happy to provide data, technical direction, and an opportunity to present your work at MIT if things go well.
essentially we have a grant who's mission is to promote research in 15 unique research directions, MIT will do their own research but outreach is important too so i figures I'd ask and provide a little of my time mentoring external folks informally.
I think we should be able to make a project if you have a little python experience, also google collab notebooks might be a sufficient way of utilizing free compute if you need some.
Yes , i figured most people interested would be studentsAre you interested in working with students? If so I'm super interested!
Yes , i figured most people interested would be students
edit:DM sent
My girlfriend has, she's building her yearly summer camp to teach girls to code ( emphasis in bio ) using covid-19 data. She says it's a lot of work to wrangle, lol.Has anyone taken all the data floating around for COVID-19 and done any modeling themselves?
Imo. Absolutely. It just depends on the position. I've had analyst interviews where my interviewer asked me how to select the min/max, how to join a table and what's the difference between a left join and inner join. That job was for 80k.Thanks so much for this! I've still working through it, but if I can work through the regular problems easily on my own and quickly solve the "harder" questions after a quick Google search am I fine to say I know SQL on my resume? Or are there basic things beyond the functions tested on this site that would be considered basic knowledge?
Thanks again!