This article is more than 1 year old

Findability: Kubernetes does data discovery, too

Finding Big Data should not be as hard as storing it is expensive

Sponsored For more than a decade and a half, people have been talking about big data and the attributes that it has in the modern enterprise.

Chant along with us, because you know the three Vs of big data: Volume, Velocity, and Variety, to which some have subsequently added Veracity and one or two Value. That’s stretching the point a bit, perhaps. But one thing that is missing in all of this V-speak about embiggened bits cascading into the datacenter is the simple fact that if you can’t find the data, it does you no good at all.

We need a V word for “findability” and we could not think of one after scratching our heads for a while and could not find a good one in the online thesauruses of the world, either, which is funny if you think about it. Visibility seems weaker than what we are looking for, so data discovery, the theme of the last article in this four-part series relating to Kubernetes in the enterprise datacenter, will have to do.

And as far as Red Hat is concerned, data discovery is not just about figuring out what data you have and finding the right data to solve a particular problem or create a specific algorithm, but also establishing that datasets are not full of garbage that could radically skew results.

Deep clean

“Data science is a disruptive force that is taking on increasing importance,” explains Pete Brey, Director of Data Services Marketing at Red Hat, whose OpenShift platform is becoming the touchstone for commercial-grade Kubernetes.

“Big companies are figuring this out, and no business will be untouched, and in fact their competitive stance will depend upon data science to a certain extent. And let’s face it, a lot of data science in the enterprise is used for marketing purposes. It is how companies engage with their customers. But depending on who you ask, 40 percent to 50 percent – some say 80 percent – of the time that data scientists spend in their work is just to find the right data, and then once they do find it, data cleanliness is a big challenge. Data scientists need to spend more time on algorithms and machine learning and less time worrying about finding the right data and making sure it is clean.”

This has been a constant problem in data processing, and it really was not that different three decades ago with the rise of Teradata and its competitors in the emerging data warehousing business, with the extract/transform/load nightmare done in massive batches out of historical relational database records. This indeed did create value for lots of companies, but at a massive effort. And the data discovery and data cleansing situation had not really gotten better for enterprises with the rise of the MapReduce distributed processing framework and its underlying Hadoop Distributed File System more than a decade ago, which allowed bigger data but resulted in much slower queries that people were used to. We went back to batch mode (and had to, given the size of the data).

Data scientists need to spend more time on algorithms and machine learning and less time worrying about finding the right data and making sure it is clean.

“The pace of change with data is accelerating,” says Brey. “While ETL and batch loading still happens, we are seeing more and more real-time analysis of data, and that is also changing the nature of data discovery. We don’t just think in terms of records and files and objects anymore. We need to worry about extracting data from big fire hoses and lots of little straws.”

To this end, data tagging and cataloging techniques have been developed, often using machine learning techniques, that can automatically create metadata about the data so it can be found more easily later down the road.

“Most smart organizations try to do this in an automated fashion,” Brey explains. “This is where Kafka and other streaming applications analyze the streams coming in and that triggers certain processes to happen downstream based on the tags that are associated with the streams.

"This is a very important first step. Tagging the data upfront is a very valuable approach because it helps solve the downstream problem of finding the data later for use. This way, the data is tagged and it can move right into a database, data warehouse, or data lake. In machine learning, the concept is taken up a notch to a special kind of database, called a feature store. This prepares data, does some preprocessing on it, and stores it in a database for future consumption by machine learning training models as they go through the many iterations and algorithm changes.”

Data tagging and cataloging is the first step in data discovery, but once this has been done and there are a dizzying array of datasets, you have to find the data. It is not as simple as pouring all of your data into an S3 object store and then layering Elasticsearch or IBM Spectrum Discover on top of it – although companies do this.

This is necessary but not sufficient because not all data that is useful to data scientists is in an object store. The pie in the sky goal is to have a federated data service – think PrestoDB from Facebook and its Ahana and Starburst commercial variants, or the Tachyon data caching software from the AMPLab at the University of California at Berkeley and its Alluxio commercial variant – that can leave data where it is, in the format that it is already in, and run a query against it whether it is in a relational database, an object store, HDFS, or whatever.

If your data is a mess, and you really need to start thinking about how you do data discovery, a Kubernetes platform rollout is the perfect time to get this work done. And if you already have your data discovery act together, then Kubernetes will be able to leverage a lot of the work that you have already done.

“If you are smart about it, you will think about how to leverage Kubernetes not just to deliver the agility you need for your applications, but also for your data,” advises Brey.

“You are going to be dealing with a diverse, heterogenous, distributed computing environment, and you have to build a data access and storage platform that has the same kind of agility and diversity. If you don’t get the data right, and you can’t find what you need to make a good decision, then all of this other work that you do essentially comes to nothing.”

Sponsored by Red Hat.

More about

TIP US OFF

Send us news