People want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.
Capability: Typically you will go through many iterations of your code and the work products your code produces. It quickly becomes impossible to track changes and reproduce earlier work without some code version control tool. This is only exacerbated when your team size is >1.
Tool: Git is a great version control system and the effort to learn its command line interface is a very worthwhile investment.
Git is incredibly flexible. However this can lead to confusion and inconsistency in how it is applied. Git-flow is a set of scripts that automate much of what you will need to do in Git subject to a particular convention that happens to be very helpful for Data Science.
Capability: Even if your data is small enough to fit in memory, reproducing work will involve running all those scripts into memory before you can pick up where you left off. Other team members have to do the same. This is painful and inefficient. You therefore need to persist your work (raw data, intermediate datasets and work products).
Tool: A database gives you a way to persist your workings and intermediate datasets as well as share with team members. Pick a database the is performant and flexible. I use PostgreSQL. It has an amazing set of features and this flexibility is what you want when doing Data Science.
Capability: Getting your head around your data and preparing it for a variety of algorithms is probably the most time-consuming and important part of the Data Science life cycle. Some preparations are easier done outside of many databases e.g. some natural language processing. Visualizing the data is really important here too.
Tool: Pick a programming language that has great data reshaping and visualization capabilities. If you work in Python, Pandas is a powerful set of data structures and algorithms for wrangling. Seaborn and Matplotlib are good places to start for visualization. And don’t waste time trying to get all these things to work together. Just use Continuum’s excellent distribution Anaconda.
Capability: Data Science is useless without communication (to your customer and within your team). You could just write a report as a Word document. There’s nothing wrong with that and it’s a format your business customers will expect. However, it would be great to have a documentation that is easy to version control and can be kept close to your project code.
Tool: Markdown is a nice platform-neutral way to document your project. Because it’s plain text it’s easy to version control (see above). And if your report isn’t too complicated you can convert it to Word from Markdown. Win.
Capability: You get hundreds of data files. You get huge files in strange formats with broken delimiters. You want to chop these up, patch them together, change their encodings, unravel XML etc etc. No, trying to open the file in a text editor or spreadsheet is not the answer.
Tool: This is best done at a powerful command line. Linux is worth learning.
Capability: Data Science is difficult to communicate. It’s often a slightly meandering journey with dead ends, back-tracking, unexpected insights leading to new research avenues etc. When updating your customer, you need to walk them through some of this journey using narratives interleaved with graphics and tabular data. Code files won’t do. Duplicating into Powerpoint is a lot of extra work for a quick interim update.
Tool: Jupyter allows all of the above in presentation quality. The close interleaving of analysis and documentation helps other team members join a project. And it reduces duplication when you decide it’s time to stop coding and start updating your customer.
Capability: eventually, your understanding and your code start to consolidate. There are some core datasets. They go through some agreed preparatory steps. There are some reports and algorithm datasets that you want to lock down and reproduce several times during the project. Manually running all those code files is a pain.
Tool: build automation tools allow you to automate tasks such as executing code files, creating documentation, importing and exporting data etc etc. I’ve used command line scripts (see above) and software build tools like Ant for this automation. More sophisticated tools like Luigi are now reaching a level of maturity where you could consider them for your team too.
Capability: what the hell is everybody doing? Where did that data come from? Where is the conversation with the system SME that led to that business rule? Where is the deliverable from 2 weeks ago and who sent it to which customer?
Tool: workflow tracking tools like JIRA help answer all the above questions. Look for a tool that is customizable as Data Science doesn’t need all the detail of a large scale software development project. Do make sure you track where your data is coming from and what deliverables are going out the door (see Guerrilla Analytics).
Capability: the diverse nature of Data Science activities leads to a correspondingly diverse set of tools as you’ve seen above. When you get things working, you would rather not break them and you would rather not force every team member to go through the same painful installations and configurations and risk inconsistency.
Tool: Vagrant and other ‘dev ops’ tools allow you to define your tech setups and their configuration in program code. What does that mean? It means that you can build your entire technology stack and configure it by running some code. It also means that the installation of all your tools and their configuration can be version controlled. As your technology stack evolves, update your code and issue a new release to your team. If you trash your technology or need to move to other servers, everything you need to reproduce your environment has been captured and you should be back up and running in minutes.
I’ve covered a lot here. How do you put this all together without choking a team in conventions, rules, tools etc? How do you reduce Data Science chaos and continue to deliver iteratively and at pace? That’s where Guerrilla Analytics can help.