The Rise of the Citizen Data Scientist
The ability for businesses of all sizes to collect data has exploded in recent years. A terabyte of data is no longer an unheard-of occurrence that only lives in a defense mainframe — it exists on average users’ laptops. But what can businesses do with all of that data? And how can they utilize it without spinning up a costly data engineering project for every single data engineering task? By embracing the citizen data scientist.
What is a citizen data scientist?
In the not-too-distant past, functioning as a data scientist required a highly-specialized, code-intensive set of skills. You needed to know how to build your own models, possess a firm understanding of statistics, and be fluent in a statistical modeling language like Python or R. While those skills are all still necessary in some contexts, modern data tools and platforms are giving way to a new generation of data scientists: the citizen data scientist.
Gartner defines a citizen data scientist as “a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics.”
To function as a citizen data scientist doesn’t require a Ph.D. or even a background in coding. As long as someone has access to a clean data source, an interesting set of questions, and some basic data tools like Power BI, Tableau, or Excel, they can turn data into valuable business insights. A citizen data scientist could be a product manager determining which features to add next, a warehouse shipping manager picking the best freight rates, or a digital marketer optimizing website traffic.
The evolution of the citizen data scientist
In addition to the increased access to data, there’s also been a shift in how businesses view data in recent years. While many organizations used to view data as merely a byproduct of doing business, that tide has turned. More and more companies now treat data as the valuable asset that it is instead of something tossed aside or left forgotten in storage. Many organizations even have a Chief Data Officer role now as part of their C-suite.
Along with that influx of data and shift in mindset towards it came a storage need. With so much data being gathered, there’s a need to keep it accessible for whenever someone wants to use it.
One such storage method is to curate structured data into a data warehouse. A data warehouse is an excellent presentation tool for clean data that can be used for reporting, visualizations, and business intelligence. However, it isn’t so great for storing the raw data that feeds machine learning models or whose use hasn’t yet been identified.
The other challenge with a data warehouse is that it often requires a very high standard to actually import data into it. Because of this, much of the data produced by an organization doesn’t make it into the data warehouse. Additionally, data warehousing requires a highly specialized (and very costly) data team to maintain it and use the data within it in a meaningful way.
While data warehouses are still valuable, they’re often created as a secondary level from data stored in a data lake.
A data lake lowers the barrier to entry and allows for the storage of unstructured data. It keeps all of that raw data in one place and accessible for use with more predictive analytics methods like machine learning. Data lakes are a relatively inexpensive place to store data for a future use if and when a purpose is identified.
The ease of collecting brings downsides though, including difficulty managing the raw data and potential risk securing it appropriately. Once it’s deemed useful, the data in a data lake can be transformed into a usable format through various tools and processes. The process isn’t as easy as starting with the clean data of a warehouse though. Many a data lake has turned into a data swamp without proper maintenance, governance, and stewardship.
The method we continue to lean into here at MentorMate is a hybrid of the first two that is generally being called a data lakehouse. Data is collected in a flexible, unstructured data lake in a data lakehouse model. An extra layer of metadata within the lakehouse carries all the benefits of structured data like access controls and integrity without actually changing the underlying data itself.
The lakehouse model does away with the need to transform the data to get it into a warehouse. Instead, the tools that use the data transform it as they report and analyze on it, switching the ETL (Extract, Transform, Load) paradigm into ELT (Extract, Load, Transform).
We believe that of these three methods — and many other experimental models from the past 25 or so years — the data lakehouse is the best and most practical method of data storage today. The driving factor in that belief is that the lakehouse model enables the citizen data scientist to use readily available tools to complete data analysis and data science tasks.
Many of the tools used to transform data in a lakehouse exist within AWS and Azure and our data partner Databricks. If your organization is already using one of those platforms, you’re already on your way to implementing citizen data science within your company. That said, it isn’t entirely effortless. At the onset, there is some groundwork that needs to be laid to enable citizen data science in your organization.
How to enable citizen data science in your organization
The first step in enabling citizen science to happen is to think about the types of questions you hope to answer using data. What is that you want to accomplish? How do these questions link back to your overall business strategy? If you can’t formulate questions that you’d like to answer, the citizen data scientist that you enable won’t know what they’re looking for in the data.
With those key data objectives in hand, identify the types of data that you think will be useful in answering those questions. The next step is to start a data lake and begin capturing your unstructured data.
All of that might sound easier said than done, and that’s fair. But rest assured, once your organization is set up to empower people to become data scientists in their non-data-centric roles, all the work setting it up will pay dividends. And it’s important to note that you don’t need to conduct this setup work on your own. At MentorMate, we help clients build data pipelines and analytics platforms that enable citizen data scientists to answer those data questions and solve business problems.