Saturday was Data Day Texas (twitter), a single day conference covering a variety of big data topics up at the University of Texas’s conference center. I went in my HP Helion big data guy role, and my wife Irma went as a python developer and PyLadies ATX organizer. I’ve written up some notes on the conference for those interested and unable to attend. As far as I know, there weren’t any recordings made, so this may be more useful than some other more archived conferences.
The conference was held at the University of Texas’s Conference Center. It’s a nice facility, and probably appropriate for the number of people, but I think the place they hold Lone Star Ruby’s a little more friendly. Conference organizers estimated the turnout at about 600 folks. From what I saw, when presenters asked questions like ‘how many of you are x’, the audience breakdown was something like:
- 70% app developers (not clear # of big data app vendors vs devs wanting to use big data)
- 10% data scientists
- 10% business types
- 10% ops people
Big takeaways were that landscape immaturity is a big deal, and that’s forcing people to weigh trade-offs between the approaches they think are right, and the ones with the most traction (specific example was samza vs spark streaming at Scaling Data), because nobody wants to commit to building out all the features themselves, or getting stuck with the also-ran. This is a problem for serious developers who want to architect or build systems with multi-year lifespans. Kafka got mentioned a lot as a glue piece between parts of data pipelines, both at the front and at the back. Everybody was talking about Avro and Parquet as best practice formats, and lots of calls not to just throw CSVs into HDFS. There was a Python Data Science talk that ended on a somewhat gloomy note (the chance to build a core Python big data tool may have passed, and a lot of work will need to be done to stay competitive, slides at http://www.slideshare.net/wesm/pydata-the-next-generation).
The specific sessions I went to:
A talk that wandered through the ecosystem. Paco’s big into containers right now. Things he specifically called out as good:
He emphasized focusing on features, not algorithms as you develop your big data solutions. Don’t get tied to a model, as our practices are all around proving or disproving models. Build something that helps you build models.
Machine Learning: A Historical and Methodical Analysis (Historic, AI Magazine 1983)
He recommended the Partially Derivative Podcast, too.
Application Architectures with Hadoop – Mark Grover
Related to the O’Reilly book: http://shop.oreilly.com/product/0636920033196.do
Mark talked about likely tradeoffs weighed in building a Google Analytics style clickstream processing pipeline. Talked about Avro and Parquet, optimizing partition size (>1 gig data per day = daily partitions, <1 gig = monthly/weekly), Flume vs Kafka and Flume + Kafka, Kafka Channel as a buffer to ensure non-duplication, Spark Streaming as a micro-batch framework, and the tradeoffs of resiliency vs latency. I think the clickstream analytics example is one of the ones in the book, so if this is interesting and you want more details, just buy an early access copy.
A general talk about sensors, Arduino, and Hadoop. The demo was a tweeting IoT device, and Irma won it in the giveaway!
Hari talked about Spark Streaming’s general use cases. Likely flow was:
Ingest (Kafka/Flume) -> Processing (Spark Streaming) -> R/T Serving (Hbase/Impala)
He talked about how Spark follows the DAG to re-create results as its fault-tolerance model. This was pretty cool, and an interesting way of thinking about the system. Because you know all the steps taken to create the data, you can re-generate it at any time if you lose part of it by tracing it back and running those steps on that data subset again. Spark uses Resilient Distributed Datasets to do this, and Spark Streaming essentially creates timestamped RDDs based on your batch interval (Default 2 seconds).
There’s good code reuse between spark streaming and regular spark, since you’re running on RDDs in the same code execution environment. No need to throw your code away and start over if you want to do batch vs micro-batch.
On the container and microservices front, Paco recommended watching Adrian Cockroft’s DockerCon EU keynote, State of the Art In Microservices. He then walked through an example using textrank and pagerank as a way to create keyword phrases out of a connected text corpus (specifically apache mailing lists).
He mentioned databricks spark training resources, which look extensive: http://databricks.com/spark-training-resources
Kite is an abstraction layer between the engine and your data that enforces best practices (always use compression, for instance). It uses a db->table->row model that it calls namespace->dataset->entity. He mentioned that they’d seen little performance difference between using raw HDFS vs Hive for ETL tasks, all things considered. Use Avro for row based data (when you need context) and Parquet for column oriented data (when you need to sum/scan or only deal with a few columns).
Building a System for Event -Oriented Data by Eric Sammer, CTO of Scaling Data
A great talk on practical problems building large scale systems. Scaling Data has built a product that essentially creates a kafka firehose for the enterprise datacenter, re-creating a lot of tooling I’ve seen at Facebook and other places, and making a straightforward-to-install enterprise product out of it. They pipe stuff into solr for full text search (ala splunk), feed dashboards for alerts, archive everything for later forensics, etc.
He recommended this blog post by Jay Kreps at Linkedin on real-time data delivery mechanics:
Said their biggest nut to crack was the 2 phase delivery problem, guaranteeing that events would only land once. They write to a tmp file in HDFS, close the hdfs file handle and ensure sync, then mark as read in kafka, then go process the tmp file.
Talked a lot about summingbird. Said it was probably the right way to add stuff up, but that it was too big and gangly, so they’d written something themselves. He recommended this paper by Oscar Boykin on Summingbird that covers a lot of the problems building this kind of system.
Also talked about Samza (best approach for the transform part of the pipeline, in their opinion, but low level and lacking community support), Storm (rigid, slow in their experience), and Spark (they hate it, but the community likes it, so they use it).
It was a harried (no lunch break, no afternoon break, if you were feeling burned out, you had to skip a session) conference, but that might be the nature of a one day brain-binge. The organizers were happy to reserve a table for PyLadies in the Data Lounge, and they had a mini-meetup and got a little outreach done.