We're Hiring Now
close icon
Toggle Menu

The human in the machine

A RESTful Development

author profile photo

By Andy Mahdavi

clock icon

8m

Machine Learning Deployment as Collaboration of Full Stack Data Scientists and Engineers

Not all types of specialization are created equal

As disciplines mature, practitioners tend to specialize. During the Renaissance, physics was a branch of philosophy, and Isaac Newton, inventor of calculus, the prism, and aspiring alchemist, considered himself a philosopher. He called his masterpiece the Mathematical Principles of Natural Philosophy. In the 19th century, philosophy split off from physics, but a single physicist could still be reasonably expected to know the entire subject. Now, we have particle theorists, astrophysicists, solid state physicists, plasma physicists, and many others, and most of them would be hard pressed to understand physics papers outside their expertise, due to the immense level of specialized study required to comprehend, much less contribute, new research in each area.

The same is beginning to happen in data science. As computer vision, natural language processing, anomaly detection, risk decision sciences, and other fields develop more and more complex methods, there will be a natural, and probably necessary, tendency to focus on horizontal specialization. But not all types of specialization are created equal. As well as a horizontal differentiation, some companies, like Facebook, are choosing to put data scientists into vertical niches.

Vertical specialization for data science means dividing teams into explorers, who clean and validate data; into hypothesis builders, rapidly experimenting to develop the thrust of the modeling initiative; into feature engineering specialists, who massage the data into usable, insightful forms; and into machine learning engineers, who fine tune and deploy the models.

Vertical specialization may be attractive as a way to take advantage of the comfort zones of data scientists, putting statisticians into the role of experiment design, and coders into the position of optimization. But by separating the statistician half of the data scientist from the developer half, companies miss out on the synergies that deliver outsize results, as I argue in the table below.

Vertically specialized team Vertically integrated individual
Lacks single point of accountability for model results Drives clarity of communication to business leaders through single owner who is also the chief technical contributor
Potential for nonoptimal communication between different phases of model build Single integrated model build strategy which can still draw on the advice of other team members
Divergence of optimization and feature engineering aspects Harmonized learning and optimization methods that leverage the peculiarities of feature builds and control for data quality anomalies
Limited personal growth due to diluted contribution to single business result Professional growth that comes from end-to-end ownership

Fostering Full Stack Data Scientists

At States Title, we’re reinventing the way America closes real estate transactions, using machine learning. To achieve this goal, we have found it most effective to make a commitment to fostering full stack data scientists. This means that each of our individual contributors owns a business use case from ideation, to exploration, to model architecture and optimization, to deployment.

To be sure, sufficiently sophisticated topics inevitably require collaboration among multiple contributors. But the effect of ownership, and the inevitable growth that comes from pushing code that one had a direct role in conceiving into production, is unmistakable. Both personal and professional growth come out of complete vertical ownership of a deployment.

For example, early in the history of our company, it became clear that we needed a dedicated algorithm for treating mortgages. Every real estate transaction needs to include a list of pre-existing mortgages associated with the property, so that they can be paid off. It is our job as a company to determine that list.

We could have handled this problem by segmenting it into separate data gathering, data analysis, feature engineering, model building, and model optimization tasks, and assigning them to different team members. This could have driven effective results if every single team member assigned to each task had been a high performer and effective communicator. But would it have driven the same amount of personal growth compared to a situation where we gave the entire blank slate to a high performing individual?

As it turns out, a single team member was assigned to the entire white space, and iteratively developed a solution. While the solution relied on the expertise of all the other team members, it had, importantly, a single point of accountability for all model build aspects. This resulted in a nonlinear, tortuous path towards a final solution on model architecture, where all the knowledge about raw data sources and data quality issues could be integrated into the final optimization and model build. The end result was a successful mortgage segmentation strategy, resulting in rapid professional growth (and promotion) for the associate.

We are even going to make a decorative sculpture to be featured in our head office, derived from the model optimization hypersurface!

RESTful APIs that Enable the Full Stack Individuals

To make this fostering of full stack data scientists feasible as a technology organization, special steps are required. We need to free the persons involved from as much of the technical development of the customer-facing product as possible. At the same time, we need to enable an infrastructure which allows rapid updating and delivery of insights to the end customers. Putting both product development and model build on a single individual is both unfeasible and unaligned with the necessary separation between data science and engineering.

So how did we do it? How did we allow full stack data scientists to deploy models straight to customers, without tying them up with product engineering and development?

Our solution was to come up with an architecture where States Title data science models are separated from the core product by a suite of RESTful APIs. Whenever our core title insurance product needs an answer (such as the list of mortgages that are on the property), it launches a data payload over such an API connection, and the data science model need only run its scoring code and return the results. Every time a model is updated, it is seamlessly passed to production without needing to modify the core engineering code, or affect product development more than necessary.

Of course, this approach implies an active and rich communication between data scientists and backend engineers each time a release comes along. But it provides a clean separation of labor, preventing the wasteful re-engineering of model code by backend software engineers that is so common in financial services organizations.

To understand our solution, imagine building the entire suite of company products using such an approach. Imagine that there are 3 core products, utilizing 3 data science models, but in such a manner that two of the products require multiple models. For example, in the schematic below, Product 1 needs models A and B, Product 2 needs only model B, while Product 3 needs models B and C. Each arrow indicates a RESTful connection.

When it comes time update, say, Model C, the data scientist updates the scoring code on the data science side, and flags the Product 3 team that a model update is ready.

The advantage of this collaborative architecture is that interaction between product and data science is reduced to negotiations over the structure of the RESTful payload. As a result, both backend engineers and data scientists are freed to focus on their respective area of expertise.

Summary

We’ve found that data scientists are happiest and grow most quickly when they’re given full ownership of the what we call the “blank slate.” This doesn’t mean their work is in a silo, because team members need to review, understand, test, and critically, approve each others’ work. It does mean that confusion, miscommunication, and unharmonious data quality, feature engineering, and model optimization work is lessened. It does result in more rapid and more fulfilling professional growth. The developmental philosophy that makes the alchemy happen is a RESTful collaboration framework between the backend engineering and data science teams.

Want to learn more about our approach to applying machine intelligence to the real estate process? Sign up for our emails to receive the latest news and advice direct from our team.