By Iman Saleh, PhD, Developer Evangelist and TAP Product Manager
This year, Intel hosted its Analytics Summit as a Day 0 event of its annual Intel Developer Forum. The summit brought data scientists, application developers and business owners together to discuss the latest technologies and stories related to data analytics. I had the chance to facilitate a collaboration workflows roundtable and I am sharing here some of the ideas that came up from the discussion.
In the past few years, more enterprises are realizing the value of their data and looking at machine learning and predictive analytics as a way to monetize it. They do that either by using advanced analytics to directly extract value out of the data or to enhance their business processes and add efficiency which translates into cost reduction or increase in earnings. To leverage analytics, these enterprises start building teams that are mainly composed of data scientists and application developers. These two roles typically work closely together to build solutions that mine patterns and trends from data.
These mixed teams not only have high potential for building interdisciplinary tools, but also introduce a set of new challenges. The data scientist and application developer come from different worlds, with different disciplines and workflows. As demand for analytics rises, we have seen both data scientists and application developers wearing different hats to get the job done. That has led to data scientists doing more programming and application developers learning how to build statistical models. Yet, application developers remain the most qualified to write efficient programs and data scientist remain the best at building and validating statistical models.
Here, I review some of the observations coming up from this paradigm of the mix of the two disciplines. I will also share some proposed solutions based on the group discussion and my interactions with individuals from the two disciplines.
Data Scientists Working within the Software Development Life Cycle (SDLC)
Advanced analytics solutions are, mostly, software-based and follow the SDLC. In modern days, that implies applying an agile development methodology. The agile SDLC philosophy promotes a process based on incremental milestones and continuous delivery of value to the end customer. It also promotes close team collaboration and frequent status update, as applied in Scrum, over processes and extensive documentation. Now, these disciplines may not necessarily work well for data scientists, who consume long cycles of work to build and train one model. There usually are no interesting sub-products or milestones to deliver until a model is ready for production, which can take a relatively longer time compared to building the software around it. Consequently, frequent status update become less interesting when there’s nothing happening, but training a model and tuning for the best parameters.
The solution? Maybe a slightly modified agile edition for analytics projects with longer cycles. Alternatively, software engineers can build their applications based on “template models” that may not be ready for production, but represent the end product format. This implies, however, that milestones are prototypes that are used to only solicit the customers’ feedback.
Customer Feedback and New Deliverables
Software Engineers use disciplines like Agile or Spiral model to solicit feedback from customers and continuously enhance their products. This is proven to increase customer’s’ satisfaction. Now, the data science side of the analytics applications makes this process more interesting. Customers can give feedback on the accuracy of the predictions. That can be a symptom of models that need updates or trained on bad data.
The response in this case is to revise the model and/or collect more data for training. In some cases, however, models can be accurate yet the customers can choose to change the models in order to meet a business need – even at the expense of lower accuracy. For example, doctors are known to require to be able to say how a model is trained. They may factor in patient convenience and trade it for accuracy. The question here is that, does this mean the models are inaccurate or just lack additional data that is too subjective to collect? Is it even feasible to factor such data into the learning process? What about “explaining” the model? Customers need to understand how the model was built and how the different parameters were tuned. They also need to be able to measure the effect of changes in these parameters on the model outcome. These are new set of challenges that a software developer didn’t have to deal with before.
Solution? We may come to realize that we always need a human-in-the-loop to assess the suitability of the model, especially for critical tasks. Additionally, model explanation or justification can also be considered as one of the deliverables of an analytics project.
Data Scientists nowadays have to work within the Big Data ecosystem. That implies learning tools and platforms such as Spark, Hadoop, Hive, Casandra and others. These platforms are, mostly, built for performance and introduce new programming constructs and tools for the whole purpose of parallelizing workloads and efficiently handling big sets of data. These tools typically have a steep learning curve and can limit the data scientist in terms of scripting languages based on what they support.
Solution? I’d argue that data scientists do not need to program for efficiency and plan for parallelizing their algorithms. Their whole goal should be to produce machine learning solutions with accurate predictions. It’s up to another role within the team, sometimes referred to as a Data Engineer, to build these efficiency guarantees into the infrastructure and re-design the machine learning implementation for parallel execution.
Standards, Standards, Standards
If you chat with data scientists, they’ll tell you that the process of building and publishing models depends largely on where you work and who you work with. There’s still a lack of standards for representing models, and the process of generating and consuming them. That hinders reusability to a large extend.
Solution? There are ongoing effort to standardize data science deliverables, the Predictive Model Markup Language (PMML) is an example. Yet, these efforts are still not widely adopted by tools. Consequently, data scientists are reluctant to use them due to the lack of tool support.
Where we stand…
We have witnessed so many success stories as well as struggles for companies to join the new wave of advanced analytics and related opportunities. To address this, Intel has been incubating the open source TAP Analytics Toolkit (TAP) that addresses some of these challenges by supporting the collaborative workflow between data scientists and application developers. TAP also masks the complexity of the supporting infrastructure and makes it easier for data scientists to focus on building analytics.
It will be interesting to watch how data science and software engineering disciplines will fuse over time through advanced analytics projects. I expect the two disciplines to evolve with new methods and processes for efficient delivery and reuse of solutions.