• Ever wanted an RSS feed of all your favorite gaming news sites? Go check out our new Gaming Headlines feed! Read more about it here.
Dec 13, 2018
1,521
Anyone at KDD want to get together for drinks? My colleagues and I are going to head out to AWCC on Tuesday too.
 

Cymbal Head

Member
Oct 25, 2017
2,373
Hey folks. I've been meaning to post in this thread for a while, but I'm running into something that has finally prompted me to chime in.

I've spent the summer working on the Coursera/JHU Data Science Specialization, and I'm almost finished. The capstone project is creating a predictive text model based on a large corpus of tweets, blog posts, and news stories. The instructors haven't provided much guidance for this one, and since the specialization mostly focused on structured data and classification models, I feel a bit lost at sea in terms of methods for implementing what they want.

I've been able to do the necessary preprocessing, and I can tokenize the text and generate ngrams using the quanteda package in R, but I'm stuck as to next steps. I haven't been able to find any resources that cover this particular type of application, and the main JHU recommendations are Wikipedia pages (which are not really helpful at this point) or to take a whole separate MOOC on natural language processing, which I don't have time to do and still keep pace with the course (I have a day job, etc.).

Does anybody have any recommendations on what I should be looking at to understand this problem?
 
May 9, 2018
3,600
Hey folks. I've been meaning to post in this thread for a while, but I'm running into something that has finally prompted me to chime in.

I've spent the summer working on the Coursera/JHU Data Science Specialization, and I'm almost finished. The capstone project is creating a predictive text model based on a large corpus of tweets, blog posts, and news stories. The instructors haven't provided much guidance for this one, and since the specialization mostly focused on structured data and classification models, I feel a bit lost at sea in terms of methods for implementing what they want.

I've been able to do the necessary preprocessing, and I can tokenize the text and generate ngrams using the quanteda package in R, but I'm stuck as to next steps. I haven't been able to find any resources that cover this particular type of application, and the main JHU recommendations are Wikipedia pages (which are not really helpful at this point) or to take a whole separate MOOC on natural language processing, which I don't have time to do and still keep pace with the course (I have a day job, etc.).

Does anybody have any recommendations on what I should be looking at to understand this problem?
That seems like it's just asking for a Markov model.
 
Dec 13, 2018
1,521
Hey folks. I've been meaning to post in this thread for a while, but I'm running into something that has finally prompted me to chime in.

I've spent the summer working on the Coursera/JHU Data Science Specialization, and I'm almost finished. The capstone project is creating a predictive text model based on a large corpus of tweets, blog posts, and news stories. The instructors haven't provided much guidance for this one, and since the specialization mostly focused on structured data and classification models, I feel a bit lost at sea in terms of methods for implementing what they want.

I've been able to do the necessary preprocessing, and I can tokenize the text and generate ngrams using the quanteda package in R, but I'm stuck as to next steps. I haven't been able to find any resources that cover this particular type of application, and the main JHU recommendations are Wikipedia pages (which are not really helpful at this point) or to take a whole separate MOOC on natural language processing, which I don't have time to do and still keep pace with the course (I have a day job, etc.).

Does anybody have any recommendations on what I should be looking at to understand this problem?

Edit 2: The markov chain stuff makes a lot of sense too for a class project^^, you just kind of histogram the data so you can draw from the observed distribution.

What is the explicit task? Funny enough, I was just talking to someone about word embedding models. Anyway, I assume you're not necesarilly predicting the next word in the sentence ( or in that case you're making a generative model ), but predicting something like sentiment from a document... is this what you're doing?

In either case, one simple approach would be : first turn it into some fixed sized embedding that you can run a classifier on. So for instance, something really simple would be to run an lstm/reccurent network on your input sentence, the final output embedding could then be fed into a fully connected layer to a softmax over your word dictionary. For a class project I think that should be sufficient.

Edit : I'm avoiding doing real work, so I'll write up more detail, sorry if you already know all this or it should be tackled a different way for your course.

Another alternative way of tackling a classification is to do something like word2vec, sentence2vec, doc2vec that take in tokenized documents and turn them into a fixed size embedding. Then you can use whatever classifier you want differentiable or non-differentiable, which is to say you can use linear regression not necessarily anything with deep learning like the technique i described above. Looking at the course you linked, and based on the fact it's in R, they probably want you to use more traditional techinques, so stuff like x2vec and bag-of-word models might be more along the lines of what they're thinking.

Anyway, I'm working on research project that is trying to learn "better" representations on text data utilizing adversarial learning methods, so I'm happy to help out with what limited knowledge I have in this space.
 
Last edited:

Cyborg009

Member
Oct 28, 2017
1,238
I use to think data science was starting to be a buzzword than anything else. I wanted to look into it so this is a nice thread. I heard good things about splunk so it might be worth adding to the OP.
 
May 9, 2018
3,600
In either case, one simple approach would be : first turn it into some fixed sized embedding that you can run a classifier on. So for instance, something really simple would be to run an lstm/reccurent network on your input sentence, the final output embedding could then be fed into a fully connected layer to a softmax over your word dictionary. For a class project I think that should be sufficient.
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.
 

opticalmace

Member
Oct 27, 2017
4,029
Nice thread. I'm looking into an academia --> data science (ish) transition so this is quite useful.

One question I have is in regards to building out a portfolio... to showcase my data analysis skills. So far it's a bit of a mish mash (some projects on my website, some github stuff, some private etc). Does anyone have any good suggestions on how to present/organize sort of a 'data science portfolio'? Cheers.
 
Dec 13, 2018
1,521
Nice thread. I'm looking into an academia --> data science (ish) transition so this is quite useful.

One question I have is in regards to building out a portfolio... to showcase my data analysis skills. So far it's a bit of a mish mash (some projects on my website, some github stuff, some private etc). Does anyone have any good suggestions on how to present/organize sort of a 'data science portfolio'? Cheers.

Personally, I think github with a reasonable amount of stars is a great help. I usually review applicants github and contributions to opensource and well followed projects are a huge plus. Some people might look at a personal page if you want to write a nice little summary of each project, but not everyone will necessarily take the time to look through it all.

If you have a place you want to target, highight the project you want them to see first and have a nice graphic next to the text on your personal page.

Oh, and try and submit your work to a conference, if you get accepted to a major one a slew of recruiters will find you and setup interviews their. In fact, for academia, publishing will be everything and recommendations from well known folks in the field. Sometimes submitting to a confernce can be formidable, so you can always start out with publishing to workshops that happen at a conference. The bar for entry is lower and it's a nice way to get started in the community. Workshops that have challenges can be a good starting place since as long as you do okay, you can submit a write up of your approach and they'll usually accept it.
 
Last edited:
Dec 13, 2018
1,521
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.

So, didn't mean to offend you. I was originally a physicist but really enjoyed all the excitement around deep learning and wanted to apply it to my problems, so I ended up over the last 4 years transitioning to an ml research scientist and taking some graduate ml courses at the school my lab is affiliated with. Anyway, I only mention all that because I always felt data science and ml were the least toxic communities I'd every been a part of, there's this tangible excitment in the air with everyone I know working on this stuff. It feels like a bunch of exciting breakthroughs are coming around the corner and there's so much interest from the general public too. Early on, everyone I meet and worked with seemed to be very welcoming of people with non traditional backgrounds and willing to teach. In other fields, there's a very high level of snobbery, well physics anyway. So I only wanted to help and was completly willing to break down anything I laid out, it was more of a shotgun of ideas he could pick from. I only meant simple because he could set that up with very little code based on what he said he'd already done.

I mean, just look at the front page, there's so much content for beginners, it really doesn't seem toxic at all to me from the perspective of a new practitioner. I mean you can follow the latest academic research and there's no paywall ! If it weren't for the ridiculous amount of applicants and competition, It would easily have the lowest barrier to entry of any science I can think of.
 
Last edited:

Cymbal Head

Member
Oct 25, 2017
2,373
Edit 2: The markov chain stuff makes a lot of sense too for a class project^^, you just kind of histogram the data so you can draw from the observed distribution

The task actually is predicting the next word in a sentence, and this is the approach I spent the night working on. I have something extremely basic running, and I don't think the expectation for our accuracy is very high. I can probably pass the class with what I have now, but there's plenty of time to make it better.

Sorry I haven't had time to keep up with replies, I was plugging away on the project all evening. I haven't picked up a new interest that absorbs my time like this in a long while.
 

JeTmAn

Banned
Oct 25, 2017
3,825
I literally built software for this and even I think calling this workflow "really simple" is the reason why there's so much toxicity in data science. Many of these concepts didn't even exist in practice until a few years ago.

Kekeke I still dont really understand how LSTM's work. It's all too easy to learn the surface of a tool without understanding the key intuition.
 

Cymbal Head

Member
Oct 25, 2017
2,373
backpropaganda, I am still a total novice, so everything you're saying has the ring of something I recognize but don't totally grok. After reading through your reply to my post, I'm looking at text2vec, and the application for classification seems straightforward, even if I don't follow every particular. I'm not sure I'll use it for my current assignment, but it's great to know about this.

One thing that has been challenging to me is that data science, broadly construed, encompasses such a massive range of topics and techniques, it can be hard to figure out what I should be looking for when I'm approaching something new, especially as someone without a mathematical or compsci background. The Coursera sequence worked well in providing me a track to follow, but it's coming to an end and I feel like I still need some guide rails to keep me going. I think I'm going to work through Hadley Wickham's book next to help drill myself on the basics, but I'm curious how people keep up with the field, especially if it's not your primary gig.
 
Last edited:
Dec 13, 2018
1,521
The task actually is predicting the next word in a sentence, and this is the approach I spent the night working on. I have something extremely basic running, and I don't think the expectation for our accuracy is very high. I can probably pass the class with what I have now, but there's plenty of time to make it better.

Sorry I haven't had time to keep up with replies, I was plugging away on the project all evening. I haven't picked up a new interest that absorbs my time like this in a long while.
cool, i figured that was the application after reading more carefully. Best of luck, did you end up doing a markov chain?
 

Cymbal Head

Member
Oct 25, 2017
2,373
cool, i figured that was the application after reading more carefully. Best of luck, did you end up doing a markov chain?

I did, and I just submitted it! After doing some peer grading, it seems to be by far the most common approach.

I legitimately had a lot of fun, but it also feels good to have it done going into the weekend. I spent way too much time this week plugging away on it at the expense of actual work.

Now I need to figure out what I'm going to actually do with this certification.
 
Oct 25, 2017
1,465
Any managers and supervisors on here that have experience with subscriptions to Data Camp for Business? I recently got promoted, and am trying to find professional development opportunities for my team. I'm also in state government so our data science capabilities are severely lacking and would like to beef those up.
 

maxxpower

Attempted to circumvent ban with alt account
Banned
Oct 25, 2017
8,950
California
I just logged back in to Datacamp after a few months and saw that their price to have access to everything is now $50? I used to pay $30.
 

Irnbru

Avenger
Oct 25, 2017
2,128
Seattle
Any managers and supervisors on here that have experience with subscriptions to Data Camp for Business? I recently got promoted, and am trying to find professional development opportunities for my team. I'm also in state government so our data science capabilities are severely lacking and would like to beef those up.
We have free access to LinkedIn learning here which is just Lynda. It's not bad. I'm building a BI team and doing that along with AWS certs.
 
OP
OP
Raticus79

Raticus79

Community Resettler
Member
Oct 25, 2017
1,037
Sorry I haven't posted recently, things have been pretty crazy for me this year.

Is anyone using Dremio here? I've been taking a good look at it for taking advantage of the big local temp NVME drives that are common on cloud VMs now.
 

Pau

Self-Appointed Godmother of Bruce Wayne's Children
Member
Oct 25, 2017
5,838
Starting my second semester of my Data Science Master's on Monday. My first semester was my toughest academic experience so far, but that's what I get for wanting to go to a top school. The biggest source of stress was accidentally taking a theoretical computer science course with zero theoretical computer science experience. The professor was incredible supportive and encouraging, so I stayed in the class, but it was conceptually the most difficult work I've ever done. My two core requirement classes were fairly easy conceptually, but a lot of work. My last course was an introduction to machine learning which was also somewhat hard. Despite all the stress, I learned a lot and really enjoyed the material.

Now I have to choose my courses for this semester. Two are requirements, so I don't have a choice there. That leaves two that I can pick but I'm having trouble deciding. Luckily I can go to all the classes for the first week or so and see how they feel. Here's what I'm deciding on:
  • Parallel Computing for Big Data: Will almost definitely take this one since I have almost no experience in this. Helps that I get along very well with the professor.
  • General Linear Models: I need at least one statistics course, and this probably the most well-regarded course in the Statistics Department. Not super excited about the material but that will probably change once I get a feel for it.
  • Applied Linear Algebra: I'm not sure how different this course is from my general data science courses, as the concepts appear to have a lot of overlap. I'm mostly interested because the professor does work with climate data.
  • Global Warming Science: Same professor as the Applied Linear Algebra course, obviously with a focus on climate data. It's a low-level course mostly in Python, but I'd like a guided introduction to the topic.
I'm also applying to internships for the summer. My problem is that I'm too all over the place: I'm interested by practically anything except defense/military stuff, biology and healthcare. (Thinking about human bodies makes me really anxious.) I'd also rather not work in purely financial stuff or places like banks. Otherwise, I'm happy as long as I'm working with data. But I do want to have an internship at a place where there is a full data science or data engineering team. Really dreading technical interviews. :(
 

Irnbru

Avenger
Oct 25, 2017
2,128
Seattle
Starting my second semester of my Data Science Master's on Monday. My first semester was my toughest academic experience so far, but that's what I get for wanting to go to a top school. The biggest source of stress was accidentally taking a theoretical computer science course with zero theoretical computer science experience. The professor was incredible supportive and encouraging, so I stayed in the class, but it was conceptually the most difficult work I've ever done. My two core requirement classes were fairly easy conceptually, but a lot of work. My last course was an introduction to machine learning which was also somewhat hard. Despite all the stress, I learned a lot and really enjoyed the material.

Now I have to choose my courses for this semester. Two are requirements, so I don't have a choice there. That leaves two that I can pick but I'm having trouble deciding. Luckily I can go to all the classes for the first week or so and see how they feel. Here's what I'm deciding on:
  • Parallel Computing for Big Data: Will almost definitely take this one since I have almost no experience in this. Helps that I get along very well with the professor.
  • General Linear Models: I need at least one statistics course, and this probably the most well-regarded course in the Statistics Department. Not super excited about the material but that will probably change once I get a feel for it.
  • Applied Linear Algebra: I'm not sure how different this course is from my general data science courses, as the concepts appear to have a lot of overlap. I'm mostly interested because the professor does work with climate data.
  • Global Warming Science: Same professor as the Applied Linear Algebra course, obviously with a focus on climate data. It's a low-level course mostly in Python, but I'd like a guided introduction to the topic.
I'm also applying to internships for the summer. My problem is that I'm too all over the place: I'm interested by practically anything except defense/military stuff, biology and healthcare. (Thinking about human bodies makes me really anxious.) I'd also rather not work in purely financial stuff or places like banks. Otherwise, I'm happy as long as I'm working with data. But I do want to have an internship at a place where there is a full data science or data engineering team. Really dreading technical interviews. :(

As someone currently conducting data scientist intern interviews ( for reference I'm a PM in BI with a data science/accounting background), be prepared to show you can build a case statement from the ground up. It's been a struggle to find folks with enough of the theoretical and actual building experience, it's usually one side or the other. ( by experience I mean projects! Not actually work experience or years of experience,
, give me an I built the model, not a we! :P ) I would also recommend to take a look at business analyst internship roles, as those are the folks that are producing the story around the data and the data scientists work the regression models, ab testing, etc. feel free to ask me any questions!
 

Gazele

Member
Oct 25, 2017
972
Of those I would definitely say the general linear models. Finding data scientists who actually know statistics has been hard.

Unless you're positive you want to go into climate science, but it might not be as relevant.

Going through the interview process now and it's a little funny how much you're grilled on leetcode and big O, which I've almost never used in 3 years as a data scientist
 

Pau

Self-Appointed Godmother of Bruce Wayne's Children
Member
Oct 25, 2017
5,838
As someone currently conducting data scientist intern interviews ( for reference I'm a PM in BI with a data science/accounting background), be prepared to show you can build a case statement from the ground up. It's been a struggle to find folks with enough of the theoretical and actual building experience, it's usually one side or the other. ( by experience I mean projects! Not actually work experience or years of experience,
, give me an I built the model, not a we! :P ) I would also recommend to take a look at business analyst internship roles, as those are the folks that are producing the story around the data and the data scientists work the regression models, ab testing, etc. feel free to ask me any questions!
This is very helpful, thank you! This is probably a dumb question, but what do you mean by "case statement"? And all of my projects in the program so far have been group projects so I default to we when I talk about them... should I just be saying it was a group project and then defaulting to "I"?

Would you okay with looking over my resume to give some feedback?

Of those I would definitely say the general linear models. Finding data scientists who actually know statistics has been hard.

Unless you're positive you want to go into climate science, but it might not be as relevant.

Going through the interview process now and it's a little funny how much you're grilled on leetcode and big O, which I've almost never used in 3 years as a data scientist
I'm surprised it's been hard to find data scientists who know statistics! My undergraduate degree was in statistics so I feel pretty comfortable with it, but I can definitely learn more. I'm set on taking General Linear Models this spring, and hopefully Bayesian Statistics in the fall! Not sure at all about going into climate science... but the class seems very doable. I might be able to get away with taking five courses instead of four.

My courses have barely touched on big O. Will be something I'll have to study on my own, I think.
 

Tenck

Member
Oct 27, 2017
612
I'm looking to get in to some heavy math to pursue data science, and I was wondering if anyone here had any resources they found useful for calculus, linear algebra, and statistics. I don't mind if I have to pay for them, I just want something that'll help me.
 

Gazele

Member
Oct 25, 2017
972
I'm surprised it's been hard to find data scientists who know statistics! My undergraduate degree was in statistics so I feel pretty comfortable with it, but I can definitely learn more. I'm set on taking General Linear Models this spring, and hopefully Bayesian Statistics in the fall! Not sure at all about going into climate science... but the class seems very doable. I might be able to get away with taking five courses instead of four.

My courses have barely touched on big O. Will be something I'll have to study on my own, I think.

We get a lot of computer science majors applying. Probably has something to do with our company paying way under market rate, plus its super competitive in the Bay Area and Google and Facebook get a lot of the really good data scientists because it's pretty hard to turn down a FAANG company
 

Lafazar

Member
Oct 25, 2017
1,579
Bern, Switzerland
I'm looking to get in to some heavy math to pursue data science, and I was wondering if anyone here had any resources they found useful for calculus, linear algebra, and statistics. I don't mind if I have to pay for them, I just want something that'll help me.
Watch this for linear algebra:
And this for calculus:

They are not complete courses on the subject, but a fantastic introduction, that will help you make more sense of a complete course later.
 

Irnbru

Avenger
Oct 25, 2017
2,128
Seattle
This is very helpful, thank you! This is probably a dumb question, but what do you mean by "case statement"? And all of my projects in the program so far have been group projects so I default to we when I talk about them... should I just be saying it was a group project and then defaulting to "I"?

Would you okay with looking over my resume to give some feedback?
By case statement I mean, take any of your projects, and be able to apply it to the industry you are applying to and how you would would tackle the issue end to end. How would you go about tackling the problem statement and then getting the data, loading the data, cleansing the data, manipulating and extrapolating the data. Then how would he best present the story or distribute the findings.

It's not just defaulting to I, but being able to take your interviewer though what you contributed to the group project. As an interviewer I want to know what you can contribute, so don't be afraid to not be humble. This is mostly due to time limits and not having all the time to dig into details, the more you present of yourself up front, the better!

And sure! I'll dm.
 

Pau

Self-Appointed Godmother of Bruce Wayne's Children
Member
Oct 25, 2017
5,838
By case statement I mean, take any of your projects, and be able to apply it to the industry you are applying to and how you would would tackle the issue end to end. How would you go about tackling the problem statement and then getting the data, loading the data, cleansing the data, manipulating and extrapolating the data. Then how would he best present the story or distribute the findings.

It's not just defaulting to I, but being able to take your interviewer though what you contributed to the group project. As an interviewer I want to know what you can contribute, so don't be afraid to not be humble. This is mostly due to time limits and not having all the time to dig into details, the more you present of yourself up front, the better!

And sure! I'll dm.
Ah that makes sense! Very helpful. Thank you!
 

SteveWinwood

Member
Oct 25, 2017
18,676
USA USA USA
Watch this for linear algebra:
And this for calculus:

They are not complete courses on the subject, but a fantastic introduction, that will help you make more sense of a complete course later.
It was a couple years ago but it's amazing how just a quick framework of introducing what your even talking about does wonders for entire branches of math. I wondered why math books don't really get into actually explaining what's going on. Why it took me until physics to actually understand what a lot of the math I had memorized *meant*. Then I looked at any math book I ever had and they spend so much effort in it. Entire chapters or giant sections either relating it to real world stuff or trying to illustrate basic concepts in an easy to grasp way. And I don't know if I just didn't read them right at the time or I wasn't engaged or I just skipped over them, but then you watch one 5 minute video and it all just clicks.

My theory now is every body has a "clicking" point but it's different from person to person so the books try to get every one for everybody and just overwhelm or exhaust most people and make them think they're bad at math.
 

Lafazar

Member
Oct 25, 2017
1,579
Bern, Switzerland
It was a couple years ago but it's amazing how just a quick framework of introducing what your even talking about does wonders for entire branches of math. I wondered why math books don't really get into actually explaining what's going on. Why it took me until physics to actually understand what a lot of the math I had memorized *meant*. Then I looked at any math book I ever had and they spend so much effort in it. Entire chapters or giant sections either relating it to real world stuff or trying to illustrate basic concepts in an easy to grasp way. And I don't know if I just didn't read them right at the time or I wasn't engaged or I just skipped over them, but then you watch one 5 minute video and it all just clicks.

My theory now is every body has a "clicking" point but it's different from person to person so the books try to get every one for everybody and just overwhelm or exhaust most people and make them think they're bad at math.
I think the problem is more that a lot of these books are written by mathematicians who care more about mathematical rigor, generality, correctness and conciseness than teaching. This results in dense, boring and hard to understand theory. Also a noticeable lack of understandable examples does not help.
 
Dec 1, 2017
280
Looking for some advice/guidance -

Some background:
-Undergraduate in Business Administration
-Currently a Financial Analyst for the past ~3 years, 2.5 years before that were as a Transportation Analyst, both at a very large transportation company in the US
-Knew SQL (Teradata) before getting my Master's
-March 2017, started a Master's program in Data Analytics and finished with a 3.9 in August 2018. Worked full-time while going to school
-Program was relatively new at the school, no Python or R. Classes included Java, GIS, typical SQL, Pentaho, Pivot4J, XLMiner, VBA, PowerBI, among other database/project management/network security classes
-Proficient using Teradata, SAS, Excel, Access, Spotfire

In my current role, it's basically still Excel everyday, all-day. There's a lot of bureaucracy when it comes to accessing data at the company and/or trying to adopt new software. You have to have approval to access specific tables and data stores even in read-only; we don't even have a data dictionary for 90% of the data that we have and it's a universal complaint across the company that people spend more time trying to find relevant data than analyzing or exploring. I have dreams of one day being a Data Scientist or Architect; I want to be a creator or curator of data, not just a consumer and relying on others to gather the data I want or need.

I feel like I haven't used anything that I've learned in my Master's program (i.e. Java, GIS, etc.), even though I did love learning and using it during my classes. I feel like it's just been a waste. I've tried to communicate to my manager and director that there are other tools and ways to utilize our data more efficiently, but the culture is not open to change. I don't have the opportunity to learn anything new, or utilize what I learned in my role because and I quote 'you're in Finance, not IT, the tools are already in place.'

I've been actively applying to other companies like mad for anything that seems to be more data oriented (Data Analyst/Scientist/Business Analyst/BI Analyst), and I just haven't gotten anywhere. Many of them want specific skills (Alteryx, Hadoop, Redshift, etc.), but I don't have that experience. I keep getting recruiters reaching out to me for Financial Planning & Analysis roles, and I tell them that I'm more interested in the technical side of FP&A than traditional FP&A, even though my experience is a lot of budgeting/forecasting/adhoc anaylsis/etc.

It's hard for me to teach myself something on my own if I don't use it - I bought one of the Python courses on Udemy a few months ago and got halfway through it, but I find that if I don't use something regularly, I don't retain it. I'm going to start over again soon to try again, but wanted to see if anyone had any advice on how to retain stuff like that, when you don't/won't use it daily?

And any tips for what jobs I should look for? I'm to the point that I'm thinking I need to go back to get my Ph.D. to have any shot at a Data Scientist position, and I feel that I'm not prepared for that program.
 

impingu1984

Member
Oct 31, 2017
3,415
UK
Looking for some advice/guidance -

Some background:
-Undergraduate in Business Administration
-Currently a Financial Analyst for the past ~3 years, 2.5 years before that were as a Transportation Analyst, both at a very large transportation company in the US
-Knew SQL (Teradata) before getting my Master's
-March 2017, started a Master's program in Data Analytics and finished with a 3.9 in August 2018. Worked full-time while going to school
-Program was relatively new at the school, no Python or R. Classes included Java, GIS, typical SQL, Pentaho, Pivot4J, XLMiner, VBA, PowerBI, among other database/project management/network security classes
-Proficient using Teradata, SAS, Excel, Access, Spotfire

In my current role, it's basically still Excel everyday, all-day. There's a lot of bureaucracy when it comes to accessing data at the company and/or trying to adopt new software. You have to have approval to access specific tables and data stores even in read-only; we don't even have a data dictionary for 90% of the data that we have and it's a universal complaint across the company that people spend more time trying to find relevant data than analyzing or exploring. I have dreams of one day being a Data Scientist or Architect; I want to be a creator or curator of data, not just a consumer and relying on others to gather the data I want or need.

I feel like I haven't used anything that I've learned in my Master's program (i.e. Java, GIS, etc.), even though I did love learning and using it during my classes. I feel like it's just been a waste. I've tried to communicate to my manager and director that there are other tools and ways to utilize our data more efficiently, but the culture is not open to change. I don't have the opportunity to learn anything new, or utilize what I learned in my role because and I quote 'you're in Finance, not IT, the tools are already in place.'

I've been actively applying to other companies like mad for anything that seems to be more data oriented (Data Analyst/Scientist/Business Analyst/BI Analyst), and I just haven't gotten anywhere. Many of them want specific skills (Alteryx, Hadoop, Redshift, etc.), but I don't have that experience. I keep getting recruiters reaching out to me for Financial Planning & Analysis roles, and I tell them that I'm more interested in the technical side of FP&A than traditional FP&A, even though my experience is a lot of budgeting/forecasting/adhoc anaylsis/etc.

It's hard for me to teach myself something on my own if I don't use it - I bought one of the Python courses on Udemy a few months ago and got halfway through it, but I find that if I don't use something regularly, I don't retain it. I'm going to start over again soon to try again, but wanted to see if anyone had any advice on how to retain stuff like that, when you don't/won't use it daily?

And any tips for what jobs I should look for? I'm to the point that I'm thinking I need to go back to get my Ph.D. to have any shot at a Data Scientist position, and I feel that I'm not prepared for that program.

Quite a lot to unpack here but first...

Jobs it sound like you want are:

Data Analyst, BI (Business Intelligence) Analyst, Data scientist...

You definitely don't want Business Analyst that's a process / integration / project management role that may involve data analysis but it's not a analytics role.

A lot of job roles get the word "Analyst" tagged on the end but the three I've highlighted are really the pure analytics roles and are the most common names, of course you will get some companies give pure analytics role a funky name occasionally.

It's also worth mentioning that Adobe analytics / Google analytics / online (or digital) analytics focused roles are usually listed digital Analyst.

You'll find something like a e-commerce company is usually far less red tape... I work as a data scientist at a e-commerce company and the analytics department works as a business function.. not siloed in a certain department. That means we get access to everything.

It's also worth noting the following regarding what those job roles (these rules aren't universal bit it might help you navigate what jobs to apply for)

Data analyst - Maybe more junior (not always tho) will likely require SQL knowledge... If SQL isn't mentioned the job will likely be a excel all day every day (even more so of no mention of a BI tool such as PowerBI or Tableau). Will maybe doing some insights projects but more likely a report writer / creator.

BI analyst - As above but BI tools will likely be expected (PowerBI, Tableau) will probably be expected to have more actionable insights / recommendations on the data

Data scientist - this is almost a buzz word these days but a proper data scientist will be dealing with predictive analytics, creating models etc.. almost certainly will need R or Python and is effectively an elite BI analyst / data analyst, PowerBI and tableau knowledge won't be mentioned but will be useful, SQL such be a given. Knowledge specific knowledge such as Hadoop, redshift etc will depend on the companies tech stack and as such maybe optional / nice to have if you can demonstrate aptitude and experience. If a data scientist role doesn't mention stuff like this it's a company buzz wording a data / BI analyst role. You'll still have to do grunt work like self service dashboards but this usually keeps stakeholders happy and then you can build on it with the really cool stuff and blow people's minds and can be seen as a god amongst mortals in the workplace

Based on what you have said you'll struggle to get in on a proper data scientist role currently.... That's not to say you can't work up to it, look for data analyst / BI analyst roles and get in there and start thing about how you start using ML models in you're new role.. at that point you suddenly start building experience for a data science role.

A special mention about Alteryx... I use it and it's the best piece of software I've ever used. It's not widely used but is becoming more popular. It's extremely easy to learn so don't worry too much about not having experience in it because if you've got your SQL game down, and understand joins, union, group by etc data concepts then you'll pick it up easily and anyone who is recruiting and also uses Alteryx will know this.

I'm UK based so might be different if your in the US but hope this is helpful.
 
Dec 1, 2017
280
Quite a lot to unpack here but first...

[...]

I'm UK based so might be different if your in the US but hope this is helpful.

Thank you for all of that insight. It's very, very helpful.

The feedback on the different positions sounds right on point as well.

Do you have any advice on how to expose myself to something like Alteryx/Hadoop/etc. in a meaningful way that I can put experience on my resume and be confident when I talk about it, and also retain it? I don't just want to go through self-taught online classes without some sort of end goal. It would be simple if I used something like that in my current career, but I don't have the option to get anything new where I work as I'm limited to SAS, Teradata, and Spotfire.
 

impingu1984

Member
Oct 31, 2017
3,415
UK
Thank you for all of that insight. It's very, very helpful.

The feedback on the different positions sounds right on point as well.

Do you have any advice on how to expose myself to something like Alteryx/Hadoop/etc. in a meaningful way that I can put experience on my resume and be confident when I talk about it, and also retain it? I don't just want to go through self-taught online classes without some sort of end goal. It would be simple if I used something like that in my current career, but I don't have the option to get anything new where I work as I'm limited to SAS, Teradata, and Spotfire.


Alteryx has a free 14 trial... Also you can apply for student licences if your a student. Get a license and start learning it with some datasets. You will see quickly why I say I wouldn't worry about being experienced in Alteryx.. it's a extremely powerful tool but is easy to learn.

Hadoop is open source so it should be easy to learn... Personally I've had exposure to it but I'm far from a expert and even then in my role I'm more focused on the insight not the data engineering of creating a Hadoop cluster. I'd be querying it not making it

Ultimately redshift, Hadoop are just places data lives much like SQL dbs... That why I say they depend more on the companies tech stack and can be optional / nice to have....

Better to invest in learning python (specifically pandas, numpty etc) and R than Hadoop as it stands for you as that is getting into proper data analysis and modelling etc which is universally a required skill and those are the tools you be using, if can use python and R well you have the experience and apptitude to be able to get to grips with Hadoop and unless an employer wants someone who can drop in and hit the ground running with their choosen tech stack having python / R and can demonstrate its use well for analysis is always going to be a must have and stuff like Hadoop a desireable
 

Goda

Member
Oct 26, 2017
2,430
Toronto
Anyone use Dataquest? I've been doing the data science course for about 4months and it seems really great but it delves into so many topics. Things like mapping geographical coordinates which might not be useful for many positions.

Should I stick with this course or are there better ones on pluralsight, udemy, and coursera?

I am currently in an application/database specialist position so I have a background in programming, databases, and log collection with ELK.
 

Pau

Self-Appointed Godmother of Bruce Wayne's Children
Member
Oct 25, 2017
5,838
Anyone interested in doing a project on the intersection of ml and climate change? I can't provide compute (in the future potentially) but I'm happy to provide data, technical direction, and an opportunity to present your work at MIT if things go well.

essentially we have a grant who's mission is to promote research in 15 unique research directions, MIT will do their own research but outreach is important too so i figures I'd ask and provide a little of my time mentoring external folks informally.

I think we should be able to make a project if you have a little python experience, also google collab notebooks might be a sufficient way of utilizing free compute if you need some.
Are you interested in working with students? If so I'm super interested!
 
Dec 13, 2018
1,521
Thanks for the dm and responses, I think it might be hard to give proper attention to more than a few people, so for now we're going to give it a shot with the folks that have messaged and hopefully come up with some cool projects to share in the future. If things go well, I'm open to trying again in the summer.
 

Clay

Member
Oct 29, 2017
8,109
Does anyone have a good source of "practice problems" using SQL?

I did a Udemy bootcamp recently and SQL seems pretty straightforward. I have a decent job at the moment but my team only uses Excel and I feel like I'm not learning anything, I'd like to start looking into positions that involve more in-depth analysis. I'm learning some data analysis/ viz stuff in Python right now, and I'm worried I'm going to lose the SQL I learned if I'm not regularly using it. It was really easy to pick up, but I want to be able to confidently demonstrate I know how to use it, and avoid choking in an interview and having to pull a "I'm not really familiar with the syntax I'd use to do this but I sure could Google it!"

Does anyone know of any resources to get some practice in? I know I could just fiddle around with a database but it'd be a lot easier to pick up new commands if I was trying to solve a problem. If there's a site that has problems with solutions that would awesome, but I'm willing to buy a textbook.

Much appreciated!
 

WedgeX

Member
Oct 27, 2017
13,172
Has anyone taken all the data floating around for COVID-19 and done any modeling themselves?
 

Clay

Member
Oct 29, 2017
8,109

Thanks so much for this! I'm still working through it, but if I can work through the regular problems easily on my own and quickly solve the "harder" questions after a quick Google search am I fine to say I know SQL on my resume? Or are there basic things beyond the functions tested on this site that would be considered basic knowledge?

Thanks again!
 
Last edited:

Kelsdesu

Member
Oct 25, 2017
4,465
Thanks so much for this! I've still working through it, but if I can work through the regular problems easily on my own and quickly solve the "harder" questions after a quick Google search am I fine to say I know SQL on my resume? Or are there basic things beyond the functions tested on this site that would be considered basic knowledge?

Thanks again!
Imo. Absolutely. It just depends on the position. I've had analyst interviews where my interviewer asked me how to select the min/max, how to join a table and what's the difference between a left join and inner join. That job was for 80k.

But there are other positions that will ask you questions that dont require table data knowledge. These are a bit tougher. For example:

What is one method to identify duplicate records in a table?
Are you able to produce a second method?
Write a query to extract the month value of the current date
Given that the current date is in UTC, can you write a query to convert the timezone to the same timezone as Los Angeles?
Write a query to extract the first day of the month of the current date
Write a query to extract the last day of the month of the current date.

This job was for less. Just make sure you read the job description. If the HR person did the their job you will know if you are a good fit for the position.

Also to add I've interviewed people and honestly if you understand the basic select stuff, joins and procedures. Everything else can be learned. I think it is more important that I can trust that you can solve the problem on your own however you can. I dont give af how you do it. I look shit up all the time.