For big data achievements, IT and analysts need to work together

[protected-iframe id=”98211826e4add1617e603d68606e2f3b-14960843-25766478″ info=”″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Session Name: Turning Your Data Into An Executive On Your Team.


S1 Announcer S2 Barb Darrow S3 Phil Francisco S4 Emile Werr


Up next we have Turning Your Data Into An Executive On Your Team. It’s going to be moderated by Barb Darrow – Senior Writer with GigaOM and she’s going to be speaking with Phil Francisco the VP of Big Data Product Management for IBM and Emile Werr – Head of Product Development and VP of the New York Stock Exchange. Please welcome our next panel.


Hi everybody I’m Barb Darrow. I’m going to have my panelists introduce themselves again. I want to make this interactive though so if there are questions, we’re going to have a few questions up front and then jump in if you have questions. You’ve got some heavy hitters here.


Hi, I’m Phil Francisco. I’m the Vice President of Product Management and Product Marketing in the Big Data Products area for IBM and specifically a grizzled veteran in the data warehouse appliance business. I’ve been with the Netezza product and company for about 10 years now. I’ve been in the world of Big Data in the structured world for a while.


Hi, my name is Emile Werr. I’ve been with the New York Stock Exchange for nine years. Currently, I head up Enterprise Data Architecture and now a new area within the exchange called NYC Big Data. I head up development for products.


And our topic for today is Making Big Data Part Of Your Executive Team. One of the themes that’s come out this morning is the need to blend big data expertise with human expertise. First I’d like Emile to talk about the implementation and then maybe get into where the human and the automated meet.


I’ll start off by saying that we have a big data problem and part of the problem is that the change is constant. And for IT to keep up with the changes – it’s too complex. So we really needed to decouple and get the business more online with the data and drive a lot of the architecture from a movement an analytics perspective. When we started out we basically built and formulated a team that was really more composed of business analysts and trained them to be data architects and then built a whole pipeline to get data in and out of analytical systems and platforms. Today we basically find the solution to be very powerful, robust and agile. We feel that financial services are coming to us with the same need. They have the same problem. They want to empower users. They basically want to evolve very quickly. This is the real core reason why we’re actually creating a commercial solution around it.


We’re coming up with a whole data engineering practice around big data architecture because this is really a differentiating factor with the old business – just business intelligence and data warehousing paradigm.


One of the other themes this morning was the use of bid data for what might be called trivial pursuits. There was some reference to using big data to ascertain that people who like curly fries are smarter than people who don’t like curly fries. In this case you’re talking about people’s money and making markets. This is not a trivial pursuit. Maybe you can talk a little bit about how you worked with IBM to set this up.


Obviously, regulating the market is a big part of our business. We deal with billions of transactions per day. Doing that type of analytics requires technologies that are capable of being able to do that level of use case. We actually leverage a lot of the IBM products stacked to be able to integrate in a real time to do the analytics that we need to, to be agile in terms of the data process, the architecture – in terms of what needs to support – things like market surveillance, real time capacity monitoring, research; all the lines of businesses that actually utilize it.


We built a whole community. Today you’re probably hearing a lot about data refinery and distillery and landing zones. Well, we built the whole architecture around that paradigm where data is moving in and out of fast engines like the Netezza platform. The power of what you have internally within the infrastructure of your network bandwidth, the systems, the NPP systems like the Netezza product, the pure systems product and the ability to basically do in line transformation and filtering – in terms of data and motion – is really powerful. So unlocking that and providing tools and capabilities where the end user can actually do the provisioning if you will – the sandboxing and the true data science functionality.


So like a mere mortal end user as opposed to a data scientist. Phil, maybe you can talk a little bit about when you go into a new account, what are the top three questions you get, both on the positive and negative side?


So I think there’s a number of questions that we get, especially as clients are beginning to progress from the structured data analytics world that they’ve gotten accustomed to over a period of time – might be the last 10 or 15 years that they’ve built in infrastructure to do some of that and now are facing this growing set of data that’s coming at them in higher volumes and faster than ever before that’s in an unstructured form in addition to more structured data coming at them. And one of the biggest questions is ” How am I going to get a profitable return on the investment here and what steps do I take – can you IBM provide us with some guidance or best practices on how to get started here?” Because when you first look at it, it might look like Mt Everest that you are trying to climb and so we really try to preach on having a flexible platform that people can use that allows for entry points from various places. You can come in from a data warehouse, you can come in using things like a hadoop platform to do some of the processing that you need to do. Emile referred to the notion of a landing zone so certainly we’re starting to see more and more cases of a hadoop platform as a landing zone where you’re bringing in other types of data and trying to explore that information and see what is pertinent to you.


Another piece that I think is critical to many clients when we talk to them is, How do I make all this data consumable?” Paul’s accomplice from our team was talking in the break out session early today about the consumable platform, so we really focus on this notion of big data platform of products that are aimed at making things consumable to the client and one of those things is visualizing your data set – being able to find the non-obvious relationships in data that might be in whole series of repositories in your business. You might have 10 or 20 or even 100 different repositories. How do I find the non-obvious relationships there? By providing an enterprise search to go and find that functionality? But then when I pull it back to visualize it – because the datasets get so large – it’s really difficult to just use standard reporting techniques. So those sorts of pieces are another piece of the puzzle.


The last area that we really focus on, and the clients really focus on as they’re moving from dabbling in hadoop and that sort of structure, to really engaging it in their enterprise is ” How am I going to govern this data; how am I going to manage the data in just process in a way where I actually can know the lineage of this information, where I actually can trace it back and know that I’m making strategic decisions in the business on a basis on trust in the data set that I’m working with.


One of the points that Paul Maritz made earlier today is that increasingly he sees companies worrying about how their competitors are using IP in new ways to wage war on themselves and I think you guys are both in a position to talk about that because you have your customers, you have your customers. I’m curious how much you guys see that come up as a topic?


It’s huge. We did a survey together with the Sloan School of Business at MIT. We did a survey last year of the CIO’s around the world and found that companies that were really engaged in using analytics were about 220% more effective in their own per groups than their competitors. And so they saw top line revenue growth that was 60% faster than their competitors. They earnings returns that were more than 200% better than their competitors and it was because they were using analytics as a competitive weapon, a competitive leaver in their industries and they were using that to help drive predictive analytics on how their clients or customers were going to be using their products. They were using it to detect fraudulent news patterns. They were looking at other sorts of solutions where they were really driving hard analytics and when we talk about big data, if we’re not also talking about analytics on that big data, we’re really doing a disservice for the client. And I think that’s really the important piece there.


I’d like to add big data just from an archive perspective doesn’t really make sense but being able to utilize it and being able to put that prototyping capability in the hands of an end user to do the what if analysis, to be able to do all sorts of discovery is very important. And the whole point is really more about how do you connect all the various pieces of information in a timely manner with as minimal intervention as possible. And in the big data world, it’s a heterogeneous environment. You’ve got different structures, different velocity, different regions – that’s the challenging problem. So really to get it right is to resurrect some kind of metadata mapping that allows you to bring it all in collocate very quickly; analyze it, put some kind of lease agreement with user so that way they can basically purge it out in an automated process. So really automation integration is a big piece of it.


And just getting back to the person angle again, we talked backstage about how you hired some of the people and it wasn’t programmers you were looking for?


No, Because I believe that a lot of the data analytics that needs to be done – IT’s very good at building out infrastructure. They’re good at plumbing and making sure that, that stuff works. What IT will lack is the data domain knowledge, unless you’re in data architecture and usually they’re far and few in an organization. But usually it’s the business that understands. That’s why the business is always involved in the early stages. So what we try to do is build the necessary capabilities where the developers are doing nothing more than providing integration and platforms like Netezza and simplifying is the whole goal and users are actually doing the analytics and provisioning. So to get that model you have to retrain, re-school that type of infrastructure and organization. You have to support the business. So we have a data request team, we have a strategic analytics team and their roles really are more about a cross between IT and business and they have more domain knowledge and they have the expertise through the technology stack that we provide to be able to do very fast data provisioning. To be able to build the right models and the right algorithms on top of those models.


And to know what they’re looking for?


Absolutely because big data is a problem. You can’t look at everything. You need to know how to refine and filter it to get it to a point where it’s manageable and you can discover and then you need to go through iterations of that.


I think Emile hit on an important part of that and the important piece that underlies all of that, is making the systems simple to deploy and manage and getting them out of the way. What we really try to do in terms of providing this expertise that we build into the systems is, make them really easy to consume the system so that his team can focus more on the data analytics and less on the tuning and the management of the day to day mundane things that they would normally have to be dealing with. So it’s a fundamental shift in putting this expertise into the platforms to make it easy for companies to consume.


When I was looking at the summary of the session today, it was talking about machine learning, machine data and automating these processes and then I think stock trading and fast velocities. What’s the circuit breaker mechanism – I mean can you program that into…?


Absolutely, I mean if you look at what happened after the flash crash, the reason why it happened is because one side has the right processes in place to put the circuit breaker on and with the Reg-NMS and trade through policy – execution happens in another market. So being able to automate it is definitely evident in our environment – we do that all the time – surveillance is an automated process. Level one Surveillance is an automated process. Level two is where you actually start to analyze the case is really a false positive/negative but it’s all about pattern. It’s all about being able to structure it in such a way that you can perform the kind of pattern analysis and trending analysis that you need to do. And a lot of the these types of things, it’s not just looking at what happened yesterday it’s looking at behavior over time and in order to combine the two you really have no choice but to look at it as a big data problem.


And surveillance was your first use case here, right?


Surveillance was our first use case but now we’re doing a lot of stuff. One cool initiative that we work on is LMS – latency monitoring system. As you all know, you barely hear about the exchange going down. We’re a five nines operation. That doesn’t come easy. In order to be five nines you have to be able to dynamically distribute your workload during the day. In order to do that you have to plan for capacity. Just to give you a sense of capacity problem that we have, on the NYC classic we trade about 3. 5 billion transactions a day. We deal about 5 Terabytes of data a data across all our matching engines. Two days after the flash crash we had almost an 8 x spike in volume. So how do you actually build systems that can actually scale and not break? That’s why we have real time monitoring in place and that’s all about analytics and that’s all about pattern recognition and looking at it overtime.


Drill down a little more into surveillance, like what are you looking for – you’re trying to find fraud as a pattern?


In surveillance there are all sorts of rules that we basically have to make sure both broker dealers and our market makers comply with. You have like interpositioning, where a market maker has a big order coming in understands the interest on one side, they can put an order in prior to that; that’s illegal. The whole point of trading is that it has to be a fair an orderly conduct.


And there could be fees?


Yeah and obviously fees. I mean the fees thing is interesting because the New York stock exchange when it was self regulated they were actually able to generate revenue out of the technology that we built in order to go back and find all the various instances and cases and we sort of found out that a lot of financial services have a fund for that. Now FINRA is now regulating the markets and they actually are a nonprofit organization and they run their IT operation based on having these types of fees and fines. Does anyone have any questions for these guys? If you do just shout. Or go to the microphone, otherwise I’ll just continue on. I have to ask this question because I’m a cloud reporter and when I hear finance applications, I hear Private cloud, private cloud, private cloud.” Although there are a few exceptions. Can you guys talk about when public cloud will ready for that or when finance will be ready for public cloud?


Yeah, I think there’s going to be specialized public clouds. I’ll give you an example like market data is a prime example. Everybody looks at NBBO which is basically liquidity in the markets. Everyone’s trying to grapple with that. So it makes a lot of sense to build infrastructure and to allow applications to integrate with that. The problem with the cloud where a lot of data sits in data centers, is that we still don’t have the right bandwidth across data centers, otherwise it’s a great model. Another really good use case for cloud is where you’re really quickly evaluating some capability and you want to be able to reduce your costs and you basically want to be able to go with something like AWS, where you can try out some new pattern, some new architecture or basically get some feed that you don’t even want and ingest into your own environment. So you almost run your environment in another data center – that’s where I’m seeing it. But, also another big thing that’s happening is the regulators that require data to be fed to them on a daily basis, well the old way of delivering an FTP type of service doesn’t work. Now it’s basically more streaming. So if you think about two data centers that are tightly integrated and coupled so when you make a change in your environment, you can not only propagate the stream of data to them you can also propagate any upstream changes to your underlying data structures. We found that to be really important because with FINRA, they obviously developed market surveillances and we can’t keep making changes to our business. So what happened actually restricted us from evolving quickly. By providing a real solution about how data gets integrated along with the meta data it allowed us to advance very quickly from a business perspective. So there’s a lot of that type of stuff happening between cloud, multidata center integration and then specific solutions around specific data domains.


Phil, you have any thoughts on that?


Well I think that if I look more broadly than the exchange and the trading platforms that we talk about – you think about the financial services in general – I think there’s areas where the governance of that data is going to become very important and what we’ll start to see is that the applications domains will probably lead the way into the cloud deployments and the analytics platforms – those are the platforms that are actually providing the decision making – will follow overtime. But I do think the public cloud is a place that people are experimenting with new techniques, that are looking at ways that they could evaluate that data set and as we go forward. And I do think we’ll see a pretty pervasive use of internal clouds and specialized clouds as Emile pointed out, at least initially, where you really have to focus on the performance that you’re going to get and the kinds of capabilities that you’re going to get to and the kinds of capabilities that you need to drive your business really predominate there.


So we’re influx here. You guys lost your chance, one question? Okay. Thanks very much I appreciate your time.