When someone says: we have a Big Data project, I’m torn between enthusiasm and fear. In this post, I’ll write about my experience doing big data work.
As a researcher in distributed systems, the whole big data/NoSQL thing is like living in a candy store. Distributed computing is hot again. The design of some big data systems is tantalizingly beautiful. Percolator, Dapper, Spark,… These systems are so cool to work with.
As an architect, I would advise extreme caution when thinking big data. Many people see big data as an opportunity, I see it as a last resort.You don’t do big data for fun. It has hidden costs everywhere: big infrastructures, complex systems, specific training required, ….
Is it big data
Conventional wisdom is that a data processing problem is big data if the data set doesn’t fit in the memory of a commodity server. Today this means more than 1TB of data.
However, it is not always as straightforward as it seems. We were approached to solve the following big data problem: 1T of data was delivered every day, and had to be processed in 24h. The actual processing took closer to 30 hours, even when load was spread over 3 beefy servers. The three servers were already all running at 100% CPU usage. The coordination between the servers was done by custom scripts, which were becoming a problem all by themselves. The backlog was growing rapidly. If we didn’t find a solution fast, valuable data would have to be dropped.
At first glance, this was a clear big data problem. We started doing the preliminary work: we obtained sample data, description of the process, sample code and started analyzing. We ran some benchmarks on the data and felt confident we could improve the process. We made our architecture and started porting the code.
When we started porting the code, we noticed the following line of code in the input parser.
for line in input: result = re.match(COMPLEX_PARSING_REGEX, line) // do something with result
We replaced this with
prog = re.compile(COMPLEX_PARSING_REGEX) for line in input: result = prog.match(string) // do something with result
This resulted in 20x performance increase. We spent an afternoon optimizing the input parser and achieved a whooping 60x increase in performance. The data processing, which took over a day on multiple machines could now be run in a few hours on a single machine.
The problem was no longer big data, no expensive big data systems was required. Just a little more attention to the basics.
This taught me two lessons:
- There is no substitute for know-how (vakkennis). Anyone will make this kind of mistakes, until experience teaches us otherwise. Processes can help (peer review, performance testing,…), but it will be people doing it.
- Exploit every other avenue before seeking big data. Big data comes with a price tag, and often there are cheaper ways out, even if you don’t see them.
Dream Yourself Rich
The project started with the words: “We have all this data sitting on devices in the field, if you give us the means to collect and process it, we can derive a lot of value from it.”
We built the system and started collecting data. Then came the words: “We have all this data now, what do we do with it.”
Just to be clear: there is no magic. It is not because you have the data, money will start pouring out of it.
Dream Yourself Poor
A big data system was set up, data was collected and a data scientist was hired. We had a few months worth of data already there. The data scientist started working on one day worth of data, on his laptop. He got some interesting insights. And the some, and some more. For the remaining months I worked on the project, he kept on going through this tiny bit of data and kept finding things.
If your data is valuable, it might even be valuable in small quantities.
Even if you think big data, it pays of to start small. If you want to do machine learning, first do that, get yourself a data scientist, feed him some data and see what comes out. If you want to sell specific data, get that data and sell it. If you have a business case, do business.
In big data projects, overhead is a big thing.
- Training cost. Very few people know big data. It requires a new way of thinking. It is hard to train people for big data. Getting a first conceptual understanding takes some work. Begin able to actually use Spark or Storm is a lot more work. Getting good at it takes a very long time. And the training is organisation wide: programmers, operators, data consumers, project managers,… Everyone needs to know something about it to run it.
- Coordination Overhead. Distributed computations require coordination. This makes them slower compared to single machine equivalent. Often a lot slower. If you scale out from one machine to two, you may not get any benefit at all.
- Minimal Footprint. Systems which work well at large scale often don’t work at a small scale at all. They have a minimal footprint of N servers. The cost curve is not smooth at all.
It is not all doom and gloom.
If you do have a big data problem, where you need to process more or faster, there are solutions. In the past, there was a hard upper limit to what you could store/process. Now, the sky is the limit. The hard problems of distributed computing have been solved, scalable systems are available, often as opensource.
If you like this topic, this post is quite inspiring