In the past year working through numerous data science projects in the organization I work for, I’ve spent hundreds of hours working with fellow data scientists.
Since the data science field is quite young, it is not surprising that my co-workers, as well as data scientist in general, come from various fields.
Mathematics, statistics, physics and even psychology, this diversity is quite positive as each one brings his strengths and views from his respective field.
Myself, I come from software engineering, and this naturally makes me inspect processes from an engineer’s perspective.
Working through a data science project many skills are required. One should have mathematical understanding and intuition about the nature of the algorithms, statistical knowledge to create and test hypothesis about the problem, familiarity with algorithms and state of the art solutions and much more.
Engineering and programming skills don’t have to be at expert level in order to be a good data scientist. It is absolutely possible to be a top field researcher, without ever knowing what HTTP verbs are, how to manage task queues with workers and brokers or how to build an in browser monitoring dashboard. Production systems, on the other hand, require quite advanced engineering skills.
From my engineering point of view, I’ve noticed that this margin between science and engineering manifests itself once a data science project has to go from the research stage into production. I’ve seen data scientists struggle with frameworks, and spending unnecessary time learning tools which are out of their necessary skill scope.
On the other hand I’ve seen production teams being delivered (from the data scientists team) projects that they don’t understand nor have idea how to incorporate within their organization’s production cycle.
Since this problem was real and reoccurring, I’ve decided to take a step and close this margin with an open source Python package.
Before I’ve opened my dracula-themed editor to start coding a framework, I’ve sat down to define the requirements:
1. Data scientist first. The framework should have as minimum as necessary learning overhead for the data scientist and abstract the heavy lifting.
2. Lean and fast. The development within the framework should be quick and the result should be as lean as possible.
3. Production ready. Use production grade tools and practices, and its output should be something that could easily be delivered to a production engineer to operate with.
Three requirements, dozens hours of code, months of tests and real production use later the denzel package was born and now open sourced for public use.