With enterprise data growing rapidly and with business and regulatory demands requiring continuous data access, organizations must have a well-thought-out approach for keeping years of history online — with the ability to scale easily as they grow. Enterprise data is growing at a rate of 40 percent to 60 percent per year and projected to grow 50-fold — from under one zettabyte in 2010 to 40 zettabytes by 2020. A big data archive and analytics solution must be able to scale with the needs of the organization if it is to archive and analyze large volumes of new and historical data.
In the past, historical data was archived mainly for regulatory compliance purposes. Today businesses often analyze their historical data against current data sets so they can derive a competitive advantage and better understanding of their customers while also generating incremental revenue.
If businesses are to extract value from years of history and corporate memory, they must store data in a fully accessible database or data store with access methods that are standards-based so they don’t need to maintain a different set of skills and tools. For some organizations, combining current and historical data sets is optimal for providing organizational stakeholders with query access to production data warehouses and data archives.
However, data warehousing on tier-1 storage can be a costly proposition for an enterprise, not only because of the cost of storage hardware but also because of the software, i.e., database management systems. Additionally, there are the human costs required to define internal business processes, analytic models, types of analysis, and, of course, costs for integrating multiple source systems.
The financial scrutiny CIOs exert on organizational IT expenditures magnifies any cost inefficiencies. Today’s flat or decreasing IT budgets point to a more cost-effective data-archive approach that intelligently moves data from expensive tier-1 storage to inexpensive tier-3 or even tier-4 storage. This shift can achieve lower costs and long-term data retention while providing fast and granular access to the data repositories.
Finally, every C-level executive must be mindful of data governance and the security of critical organizational data assets. Big data analytics platforms like Hadoop assume a flat security posture, but a big data archiving and analytics solution must abide by stringent data security and regulatory compliance requirements. Not doing so can result in fines, penalties, and bad PR.
This research report explores today’s big data archive, in which analytical and compliance solutions are implemented by large organizations for the purpose of:
- Moving large, historical data from tier 1 or tier 2 to cheaper storage for improved efficiency and future scale
- Making data available, usable, and queryable to organizational stakeholders for easy lookups, analysis, and revenue-generating endeavors
- Providing robust data security and data retention capabilities that facilitate regulatory compliance and data governance audits.
Note: In this paper, anything more than 100 terabytes (TBs) and growing above 50 percent annually represents big data. An organization that is managing 50 TBs of data in a proprietary enterprise data warehouse (EDW) but has invested up to approximately $100,000 per terabyte to install, build, manage, and maintain it at a cost of approximately $5 million has an expensive environment but doesn’t necessarily have “big data.”
2 Situational analysis: traditional big data analytics and archiving
In 2013, TED speaker, bestselling author, and Duke University professor Dan Ariely referred to big data as the “crude oil” of the new millennium — hugely valuable but useless if unrefined. This need for data refinement has prompted a rush to put data to better use than was possible or practical with legacy technology.
The big data analytics and archiving stack includes software, database, and storage hardware. Today organizations have at least four deployment choices:
- Build your own Hadoop cluster by leveraging open source Apache Hadoop
- Use out-of-the box Hadoop distribution from Cloudera, Pivotal, Hortonworks, etc.
- Store data in an EDW, e.g., Oracle Exadata, Sybase IQ, Teradata, etc.
- Use a data warehouse or analytics appliance from Netezza or Teradata
These big data analytics platforms can be deployed on a variety of storage and hardware options. Hadoop’s attractiveness is largely driven by its ability to run on low-cost commodity servers that are generally deployed in a direct attached storage (DAS) framework. Other options for low-cost scale depend on the sensitivity of the data set being managed and include scale-out network attached storage (NAS) or using a public or private cloud.
Appliances generally lack built-in data retention and disposition capabilities. Scaling a proprietary appliance is extremely difficult. Furthermore, adding storage nodes is fiscally irresponsible, especially when they still don’t solve the performance equation. The new operational expenditures associated with managing additional storage nodes and constantly migrating data between nodes dramatically increases the total cost of ownership (TCO) when managing and analyzing big data volumes.
The reality for most organizations today is that enterprise analytics data is managed with a combination of technology solutions increasingly made up of a Hadoop-based platform alongside a traditional EDW or appliance and potentially an archive that runs on a relational database management system (RDBMS) or another Hadoop framework. Today’s warehouse is often called a “warehouse of data.” A one-stop repository for all analytic workloads no longer exists.
Organizations often are caught up in the hype and try to implement Hadoop to solve their big data analytics equation, but Hadoop open source projects are not enterprise archive solutions because of their limitations on security, high availability, data triplication, etc.
How did big data happen?
Numerous reasons account for the 40 percent to 70 percent annual data growth in big data:
- People don’t delete data because they fear they might be deleting critical data.
- Unstructured data has experienced huge growth. According to Computer World, it accounts for more than 70 percent to 80 percent of all data today and includes books, journals, documents, metadata, health records, medical image files, audio, video, analog data, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis, and word-processor documents, to name only a few.
- Structured and semi-structured data that defines which fields of data will be stored and how that data will be stored has experienced unprecedented growth. This category includes data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input (number of characters, restricted to certain terms such as Mr., Ms., or Dr., M or F).
o Word-processing software now can include metadata showing the author’s name and the date created
o Emails have the sender/recipient/date/time
o Photos or other graphics can be tagged with keywords such as the creator/date/location/keywords, making it possible to organize and locate graphics
o XML and other markup languages from documents
o Social data from Facebook, Twitter, and other online forums
- Data must be retained longer for regulatory compliance, e.g., SEC 17a-4, Dodd-Frank, HIPAA, HL7, etc.
- Machine-generated/sensor data is also growing. This category includes communications network data, financial market trade data, and automated data generated from jet engines, automobiles, appliances, home alarm systems, satellites, utilities grids, etc.
Big data archiving
Traditional data warehouse systems are beginning to sag under the weight of the data that’s flooding into organizations, especially when business users continue to demand access to raw, detailed data records. This phenomenon has resulted in ballooning operational expenditures. Companies are now focusing on reducing their data footprint, storage consumption, and spend by:
- Creating online archives for long-term data retention, either on-premise or in the cloud
- Moving data from expensive Flash/SSD-based tier-1 storage to cheaper HDD-based SAS and SATA-based tier-3 and -4 storage, known as data-tiering
Active data archiving
An active data archive is quite different from a data archive that has been moved to tape drives. Offline tapes are probably the most common big data deep archive approach because they are deemed low-cost. However, tapes are not query-accessible; they are error-prone, they lack security, and the tapes can be lost, damaged, or stolen. Data on offline tape is like data in deep freeze storage that requires a time-consuming “thaw” (i.e., time to be brought online to a server or storage device that can run that particular version of the tape). For some organizations, this laborious process can take weeks. Often the business opportunity is lost before the data is “thawed.”
Active data archiving, which makes historic data available for real-time analytics, is different. It arose from the realization that analyzing historical data and possibly combining it with current or real-time data could generate revenue. An active archive is the repository for data that, once transacted or written to disk, changes infrequently. Read often but not modified, it is often referred to as WORM (write once, read many), which essentially means that, if stored on a WORM storage device, records can no longer be updated or edited. In fact, the particular business demands that it be archived with the ability to log or track any further activity such as a query or delete. Having this historical data easily available for all types of analytics enables meaningful business insights.
Hierarchical storage management (HSM) systems, which migrate data from faster storage to slower storage (based on metadata information such as the date last accessed and the date last modified), don’t create an archive. Instead, they help identify candidate data for inclusion in a deep or active archive. They are also adept at identifying temporary data that can be deleted once its usefulness has been depleted.
To Hadoop or not to Hadoop
Hadoop-based big data platforms were once thought to be ideal for performing analytics against voluminous data across a wide variety of data types. Until fairly recently, they have been perceived as potential data warehouse “killers.” That sentiment has largely given way to acceptance of their peaceful coexistence alongside other enterprise data infrastructures, including RDBMSs. For example, 78 percent of 263 IT professionals, business users, and consultants surveyed by The Data Warehousing Institute (TDWI) in Renton, Washington, in November 2012 responded that they thought Hadoop systems could be a useful complement to their data warehouses for supporting advanced analytics applications. In addition, 41 percent saw Hadoop as an effective staging area for information on its way to an EDW. Asked if Hadoop clusters could fully replace an EDW, more than half of the respondents replied no. Only 4 percent responded yes. Hadoop systems are seen as a useful complement to their data warehouses for supporting advanced analytics applications.
Hadoop’s attractive low-cost scale combined with its growing open source community and contributors is fortifying its place as the enterprise analytic data hub. Hadoop systems and NoSQL databases can serve as “a sort of loading dock” for raw data. Often data scientists work with that data to detect patterns or anomalies so they can understand what is actually going on instead of asking a set of predetermined questions that are typically applied to a data warehouse. At some time after it has undergone initial analysis, data scrubbing, or cleansing, a portion of the data set (or all of it) may be moved to a traditional data warehouse. That data may then be integrated with data in the existing warehouse for further insights. Of course, the analytics and BI platforms for each company will vary and the set of processes and policies will differ with business requirements.
The most common knock on Hadoop 1.x, which coupled the Hadoop distributed file system (HDFS) with the MapReduce parallel programming model, was that its batch-oriented format limited its use in interactive and iterative analytics. It nearly eliminated the possibility of using the technology in real-time operations. This architecture effectively disqualified Hadoop as a data discovery tool capable of accommodating iterative queries and made it unusable for most business users.
Hadoop 2 changes that, principally by inserting Yet Another Resource Negotiator (YARN). YARN is a rebuilt cluster resource manager that ends Hadoop’s total reliance on MapReduce and its batch processing format by separating the resource management and job scheduling capabilities handled by MapReduce from Hadoop’s data processing layer. As a result, MapReduce becomes just one of many processing engines residing on top of YARN in Hadoop clusters that can grow in use and prominence. YARN opens the door for other programming frameworks and new types of applications. Some organizations are adopting stream processing engines, such as S4, Storm, and Spark Streaming, with the goal of giving Hadoop more real-time chops.
Hadoop 2.0 also supports the federation of HDFS operations and configuration of redundant HDFS NameNodes, thus increasing scalability and eliminating the single point of failure that plagued the original release.
Hadoop and its ecosystem of open source projects have matured greatly over the last three-plus years. Adoption is definitely on the rise. Cloudera and Hortonworks now have a wide range of technology and solution partners that certify and run on their respective distributions (currently Cloudera 5 and Hortonworks 2.1). As the Hadoop platform gains more adoption and a wider range of use cases, more providers will build unique and value-added capabilities on top of that stack. It is the Apple “apps” model but for the enterprise.
Hadoop is not plug-and-play. It requires programming expertise and investment, which many businesses lack. Several Hadoop vendors are trying to eliminate the real-time analytics restrictions. Cloudera’s Impala query engine facilitates interactive SQL queries against Hadoop data with near real-time performance. Pivotal, a data management and analytics spinoff from EMC Corp., now also offers a similar query engine named HAWQ. Splunk captures streams of machine-generated data and offers a Hadoop data analysis tool called Hunk.
Hadoop was developed as a batch-oriented system. Its real-time capabilities are still emerging. Since Hadoop’s inception, many have tried remaking it into a real-time system through the HBase database, SQL interfaces like Impala and Hive, streaming data engines like Storm, and in-memory frameworks like Spark.
However, these technologies do not appear ready to match the real-time analytic capabilities that in-memory NoSQL and NewSQL databases deliver. Hadoop can’t compete when informing decisions against big data sets with latencies measured in milliseconds.
Hadoop security is very flat and lacks sophisticated security features like identity and access management (IAM), role-based authorization, legal hold, Kerberos Authentication, Linux-PAM Support, snapshots, data masking, etc. Therefore, the security posture of the Hadoop distributions is typically augmented with a security solution from Protegrity, Voltage software, etc., or open source security tools such as Sentry. Furthermore, Hadoop containers are completely open and accessible to anyone to use the content. This compromises security.
As Hadoop adoption widens, organizations will want to store high-value, sensitive data, such as credit cards, social security numbers, or personal identifiable information (PII). This data must adhere to the same rigor and policies that relational database management systems follow. Because Hadoop is not standards-based, it does not adhere to the widely adopted security and trust practices that a bank or financial institution must follow.
With Hadoop, massive numbers of multiple systems are scaled together, each with its own internal storage. Having the internal storage on each of those systems requires a metadata server that manages all the metadata between those internal systems and provides the cache coherency so all the systems can see the data on all the other environments. Having a metadata server on traditional storage solves the metadata problem of that environment.
Once users become reliant on the Hadoop-based platform for fast analytics or data transformation, that platform must be highly available with built-in disaster tolerance features. Today, Hadoop requires 3X replication, which creates more data to manage and therefore more storage. As a result, it increases management-related operational expenditures.
Hadoop projects generally start small. In the early stages, they could be classified as “sandboxes.” Once the developer-based community becomes hooked — demonstrating what is possible and what provides value — businesses must take a more serious look at the key drivers and the critical technology requirements to enable them.
3 Business challenges of analytics big data archiving
Today’s businesses are challenged to tackle increasing proportions of unstructured, structured, and semi-structured data that must be managed, identified, properly archived, accessed, and analyzed in a timely, efficient, and cost-effective manner.
Time-to-information (TTI) is the most critical criterion that businesses use to benchmark their big data analytics platform. Businesses will question a Hadoop-based initiative that takes multiple years to get started. Determining which Hadoop distribution to deploy often takes a few quarters. Getting a cluster up and running is another big hurdle.
About five years ago, big data was defined by the three Vs — volume, velocity, and variety — otherwise known as growing disparate data sets, created at very high rates of speed. Today, enterprises talk more about the business challenges and drivers for harnessing information or value from enterprise data. These are the three Cs.
The three Cs of big data management
- Cost. Most big data analytics solutions require a handsome capital expenditure upfront, followed by high TCO. Traditional data warehousing solution costs can vary widely depending on provider and the additional professional services required. The range can be as wide as $20 to $100,000 per terabyte of data. The wrong big data analytics solution can easily balloon operational expenditures and quickly turn into a management nightmare that drastically inflates the TCO.
- Complexity. Most big data analytics solutions are complex (time- and resource-intensive) and take months to years for full production deployment.
- Compliance. Regulatory compliance adherence is a costly and resource-intensive “must-do” for many organizations. External mandates, which vary by industry sector, include Dodd-Frank and SEC 17a-4 for financial services or HIPPA in healthcare. Internal data governance policies include such requirements as “the legal team needs data access for 50 years in the event of any lawsuit.” A big data archiving and analytics platform must adhere to both regulatory compliance and data governance guidelines because most dictate continuous data access and privacy, integrity, and trust to protect any breaches or attacks. Inability to meet these requirements can result in fines, loss of revenue or customers, inability to provide service, and a negative impact on a company’s reputation or stock price.
Big data analytics and archiving enterprise big data assets (which satisfy the three Cs) pass the corporate litmus test in overcoming the inertia of switching to a different storage tier or selecting the right solution for the job. Archiving enterprise data is not a straightforward task. An organization can’t simply load and store data in a traditional relational database and expect to conquer any of the three Cs challenges. That requires a uniquely architected and designed solution.
Business stakeholders have an absolute desire to retain more data sets for longer timeframes. Without question, data governance and compliance regulations enforce increased stringency, specifically focused on how data is stored and accessed.
4 Technical challenges of analytics big data archiving
The transformative wave of big data and its related technologies has ushered in an era of scale in which the amount of data to be processed and stored is breaking down every architectural construct in the database management and storage industry. Flexible data center architecture is critical to the success of big data analytics deployments.
A successful and cost-effective big data archiving implementation has very specific requirements:
- Data reduction. As data history grows, solving the problem with efficient storage alone is not the solution. Datacompression and de-duplication upon load is important not only to accelerate analytics but also to maximize storage footprint or capacity.
- Fault tolerance. This is theneed for data replication to protect against any outage and provide continuous uptime or the shortest time to recovery point.
- Interactive query response. This is the ability to perform interactive analysis — rather than work in batch mode — while using the standard query tools that the business has come to understand and expect.
- Security and compliance. This includes built-in data privacy and trust aspects in addition to data masking and encryption as well as data-related retention and disposition capabilities.
- Elastic storage capacity. This is scale-out storage architecture that is flexible and elastic to accommodate expanding data requirements organically.
These features are not “nice to have.” They are “must-haves” because the absence of even one can dramatically impact storage costs, introduce latency on analysis response time, or increase opex. The need for additional tools and software can also introduce complexity and longer time to value.
5 Regulatory compliance and audit challenges
Any CIO, chief risk officer, or chief data officer worries about meeting compliance mandates and facilitating audits. Taking a disciplined approach to data archiving by building an effective governance framework will result in a strong organizational compliance posture.
Regulated organizations may need to keep data for a very long time — even forever. They must also be able to find the data quickly when they need it for reporting or during an audit. For many companies, failing to stay compliant can be a death knell because it could prevent them from providing services. In the most extreme cases, this could result in damage to the brand, reputation, and stock price.
Every vertical has regulatory, geographic, and service compliance mandates. This section highlights some of the challenges for the banking, financial services, and insurance sectors as well as the communications verticals.
- Financial services (FS), banking, and insurance. Particularly since 2009, this has been the most regulated industry sector. A plethora of regulations apply.
- Payment Card Industry Data Security Standard (PCI-DSS). Protects credit card and PII data
- Gramm-Leach-Bliley Act (GLBA). Requires that financial institutions explain their info-sharing practices to their customers
- Sarbanes-Oxley Act (SOX) Section 404. Requires top management to individually certify the accuracy of financial information. This law applies to many other publicly-traded entities and not just financial services.
- Fair Debt Collection Practices Act (FDCPA). The Debt Collection Practice Act
- FTC Red Flags. Detect warning signs for identity theft
- Basel I II III. Strengthens bank capital requirements by increasing bank liquidity
- FISMA. Strengthens information system security
- SEC 17A-4(f)
- 2009 Dodd Frank. Updated existing SEC regulations in the 17a-4 rule, which dictates that:
- Records must be kept in a secure format
- Rules-based data disposition
- Recording process must be verifiable
- Records must be recognizable and identifiable
- Records must be downloadable to any acceptable storage medium
- Records must be fully accessible to authorities
- Records must be backed up
- Records must be stored in a WORM format that is not deletable
- A complete audit trail
The vast majority of retained financial data is structured and semi-structured. It can be more efficiently stored as a result of high data de-duplication, which can be an overall reduction of more than 90 percent. As long as the data is retained in a form that is fully queryable and maintains its integrity from the original transaction, it can be compressed or reduced for more efficient storage. Many regulatory mandates require that financial transactional data be stored forever on non-erasable media.
A major priority for financial services organizations is deploying and future-proofing systems that will manage extreme data volumes and scale efficiently. Standard & Poor’s reported that in 2014 the banking/financial services sector spend will reach $50 billion on compliance regulations that seriously impact profit margins. This number is expected to rise over the next few years. Approximately 20 percent of this spend is on technology-related aspects; the rest is spread across human resources, training, internal processes, legal teams, and audit costs.
Most financial services organizations are trying to get ahead of the compliance “marathon” and have realized that being proactive is the best course of action. Capturing illegal activity or an anomaly before the regulators see it is the goal. The first step to this proactive behavior is the retention of critical data in the most efficient and cost-effective form and the implementation of end-to-end technology solutions that have been designed for that purpose.
2. Communication service providers (CSPs)/telcos. Structured data is growing dramatically in communications environments because of geographic-specific mandates that demand ongoing retention of usage data, including call detail records (CDRs). Communications data becomes historical almost immediately after it is generated — think smartphone detailed records. For that reason, it could be stored in an instant archive. If the data doesn’t need to undergo any further updates, a repository — essentially the archive — is often what is needed.
Various countries and regions have different legislation with their mandated data retention timeframes (e.g., European Data Retention Directive, CALEA in the U.S., etc.). These mandated timeframes range from 90 days to an average of one year and sometimes up to a maximum of 10 years (e.g., Middle East equals 10 years, APAC is seven years, etc.).
Legislation usually dictates that CSPs cannot change or mask communications data before they retain it. Furthermore, typical record sizes for xDRs, including fixed/mobile CDRs, internet protocol detail record (IPDR), SMS/MMS/email metadata, etc., range from 200 bytes to 1,000 bytes, depending on the communication protocol. SMS/MMS/email retention can also include requirements for storing the message body as well as attachments (video, images, .wav files). This can quickly have an effect on the size of datasets and result in inflated opex.
Regulations require CSPs to retain a broader set of data amid enormous growth. The data retention systems they must deploy (by mandate) are already a sunk cost adding no business value because the data and systems are generally ring-fenced and cannot be used for other purposes. Therefore, the priority is to minimize the TCO of such systems.
- Lawful interception. Government agencies obtain communications network data, i.e., fixed/mobile calls, SMS/MMS messages, and internet activity (IM, webmail, VoIP, etc.), for the purpose of investigating criminal or terrorist activity. Communications metadata (not content) must be retained for an extended period to assist law enforcement and intelligence agencies with their investigations. Lawful interception also requires that CSPs allow law enforcement agencies (LEAs) to have access to their networks so they can monitor an individual’s communications in real-time (a.k.a., wire-tapping) when they present a warrant. The format of the request and hand-over of the data are strictly controlled and determined by legislation.
- Mass interception. CSPs provide intelligence agencies a feed of all communications to their monitoring centers. Depending on the jurisdiction and legislation, intelligence agencies retain varying amounts of communications data. This could include the content of calls, messages, and emails. Generally intelligence agencies use mass interception systems to prevent terrorist acts or organized crime or to react quickly to prevent the escalation of a specific incident.
CSP big data archiving requirements
The top priority for most CSPs is to deploy and future-proof systems that can manage today’s extreme data volumes and scale efficiently for the future. Additionally, keeping mass intercept systems operational, available, and accessible at all times is critical. Other priorities include reducing the storage footprint, reducing opex associated with managing these datasets (volume management and data migrations come into play here), and fast response times. Fast response times include the ability to ingest and analyze immense data sets and deliver a timely response to LEAs or other regulatory bodies.
All large public organizations with stringent data governance mandates, regardless of the vertical, require a prudent big data archiving solution that is specifically for their high-value analytics data assets and that has an immutable data model that conforms to compliance mandates. Additionally, data ingest/import, data retention rules, secure access, fast query, built-in audit trails, and data protection (encryption, backup, and disaster recovery) all come into play when considering a long-term store or archive of the corporate memory.
6 Real-world benefits of big data archiving and analytics
The big data active archive is emerging as the go-to technology for the efficient management of growing data assets. Today’s low-cost and scalable solutions deliver storage infrastructure efficiency, satisfy compliance demands, and enable the business to utilize data as the strategic corporate asset that it is.
Data insights, courtesy of big data archives and analytics, can:
- Foster new business revenue
- Provide competitive advantage
- Pinpoint and link buying patterns or marketing trends that improve customer experience and quality of service
- Detect fraud
- Identify likely hotspots for failure before it happens
Depending on the vertical, the value can range from the ability to transform healthcare data into actionable intelligence that providers and patients can act upon to financial services drilling deeper into a report to get details about a customer’s accounts, transactions, records, and overall value to the business.
A combined software and hardware solution that directly addresses these challenges involves several core technology differentiators:
- Seamless scalability. This becomes possible when the storage file system possesses a large storage container consisting of a single volume to avoid the high operational expenditures associated with volume management (when data is parsed across multiple storage volumes) and data migration. In addition to the storage hardware, the database must also scale-out and allow businesses to focus on the application and not the “plumbing” of the data itself. This requires human capital to manage the database-scaling task.
- Performance. Timeliness of information (TTI) can be the differentiator between staying in business or going out of business. Most big data archiving and analytics solutions are judged against faster response times from parallel processing of queries.
- Cost. Failing to scrutinize IT spend closely is fiscally irresponsible. This cost must be measured in the TCO equation. TCO is not just the capital outlay for an initial purchase or the cost per terabyte to archive and analyze the data; it is also total cost over the entire lifecycle of a solution along with data center space and human resource factors.
- Efficiency. Deep archiving has historically utilized tier-2 and tier-3 storage or tapes because the data had to be kept for compliance. Companies that employ active archiving load historic data from existing tapes onto tier-2 and tier-3 storage built on SAS or SATA disks running a SQL-compatible database for easy look-ups and analysis. This allows the big data archiving solution to perform analysis on the historic data and potentially integrate with current data to drive greater business value.
In this context, hardware efficiency achieved through data reduction techniques (de-duplication and compression) and utilizing inexpensive tier-2 and tier-3 storage instead of expensive tier-1 storage can boost hardware efficiency.
- Simplicity. Big data archiving reduces opex. For example, big data archiving eliminates database shard and saves on running costs because of the manual effort involved with sharding results in permanent complexity, increasing the risk and challenges associated with sharding. A competent end-to-end big data archiving solution should be operational in no more than three months, assuming the business has outlined which data to archive, at what intervals, and how long to retain it. Simplicity also comes from a solution that adheres to standards such as SQL-compatible query tools and existing BI tools (e.g., Cognos and Business Objects) in addition to all the open-source access tools run on HDFS, such as Hive, Pig, and MapReduce. Unlike an operational transactional database, an archive requires less tuning and fewer parameters, should be easy to maintain over time, and arguably demands less than 25 percent of a resource commitment compared to RDBMS administrators.
- Non-disruptive. There should be no rip and replace of existing hardware or software commitments. Most big data initiatives are not green field. Organizations must integrate with what they have today. A true big data archive must integrate with legacy architectures in addition to open source technologies to provide true intrinsic value to the enterprise.
- Facilitate regulatory compliance and auditing. Companies spend huge amounts on meeting compliance mandates because not doing so has serious negative consequences. Organizations with stringent data governance mandates, regardless of the vertical they are in, must find and make data available to facilitate audits. Data disposition and data retention must be granular, providing usable and extractable visibility and access down to the record level.
- Secure. A competent big data archiving solution must provide robust security features to protect against unauthorized access. Hallmarks of a secure big data archive include encryption at the data store layer in addition to encryption at the physical disk layer, security at the access control level, role-based access to the data, and data masking.
- Fault-tolerant. This includes built-in replication and failover. Ideally the secondary instance should be located in a different geographic location. The ability to copy files over the network is important. If the data is highly compressed, it is more efficient, and moving it essentially becomes a bandwidth multiplier.
7 Solution spotlight: real-world benefits of big data analytics and archiving
A leading global bank and financial services organization was experiencing exponential data growth. They were storing history of their regulatory-compliance Sybase and Oracle data on tape. The data archiving was outsourced to a third-party provider who stored their data backed up on tapes offsite. It took them weeks or sometimes months to make the data available to organizational stakeholders.
The bank had to strictly abide by SEC 17a-4 rule for data archiving and data retention. They were looking to:
- Move archival data back in-house to reduce reliance on the third-party vendor and have a more secure solution inside their firewall
- Move data online for faster access and queries, and avoid tape moving forward
- Achieve higher compression ratios of their data
- Implement a scalable solution to grow with their needs
- Deploy a SEC 17a-4 and WORM-compliant solution
Most of the bank’s data is structured so it can be easily queried. They were not looking for a vendor but rather wanted to work with a partner who understood their business and the regulatory mandates by which they had to abide and collaborate with enterprise-grade professional services who had proven experience in order to configure their data and make it available within their internal IT environment.
The bank found everything they were looking for and then some in the RainStor/Isilon solution. The up to 30X compression ratios they achieved resulted in superior storage and hardware efficiencies. They went with the RainStor/Isilon solution for its significant capex reduction. This active archiving solution eliminated the need for a third-party vendor for archiving. The faster queries resulted in empowering their users to do more analysis with the data as well as to use a broader range of history. This new solution also facilitated much faster audits.
In the future, the bank plans to move other regulatory compliance data from other business units into the RainStor/Isilon solution.
8 Future trends in big data analytics and archiving
The key trend on the horizon is broad adoption of open source software (OSS). CIOs are increasingly concerned with vendor lock-in and therefore demand vendors to either demonstrate OSS vision and strategy in their product roadmaps or the ability to integrate with open source platforms. Big data analytics and active archiving solutions will have to support open source databases and architectures in the near future.
Other than that, the big data analytics and active archiving future is all about getting the fundamentals right. Those include:
- Seamless scalability: The ability to run queries against larger data sets across a wider range of dates is the true essence of big data analytics and active archiving. The results achieved from these queries can turn information into an incremental revenue generator for the organization. Data scientists demand the ability to run queries across multiple clusters across disparate systems. Therefore, seamless scalability of big data analytics and active archiving is essential.
- Centralized dashboards: Enterprise-wide visibility is critical to business analytics and to facilitate regulatory compliance. A single “pane of glass” or view that enables IT operations and data scientists to analyze data across multiple clusters will be an essential demand from enterprises.
- Faster TTI: Timeliness of the data analysis is always an essential ingredient for an organization’s differentiation and competitive advantage. Big data analytics and active archiving solutions that deliver the fastest TTI will rise above others.
- Volume management: Data movement is highly operationally intensive and costly. IT experts across many organizations spend a vast majority of their time wrestling with volume management. With data growing exponentially each year, the problem gets further exacerbated. Therefore, an active data active storage file system with large storage containers consisting of single operating systems can drastically reduce opex associated with volume management and any necessary data migrations.
9 Conclusion and key takeaways
For today’s enterprise, data’s timeliness, reliability, and accuracy can convert it from an artifact to an asset — currency that can drive business innovation and competitive advantage. A big data active archive gives a business the ability to continue extracting value from historic data sets in addition to integrating it with current or real-time data to enable incremental revenue opportunities.
As data volumes continue to grow, asserting control over the data once it’s in an organization’s systems could mean the difference between success and failure in making effective use of the information. All too often, detailed data is discarded and aggregates are stored that don’t help the business when it needs to replay exactly what happened. Organizations now have an ecosystem of database and analytics technologies that co-exist and must focus on the right technology solution for the purpose. Companies are eager to monetize their digital data assets and create value from their history. As an example, analyzing what happened six years ago during the financial crash and comparing that with what is happening now in the current economy could be extremely valuable. Critical factors when selecting a big data archive solution include:
- TCO. Fiscal prudence dictates looking at the TCO of the entire big data archive solution stack. Simplicity of the solution is also key to lower TCO when a cost-effective system is easy to deploy and future-proof, i.e., scalable and elastic to meet ever-changing and evolving business demands.
- Time-to-information (TTI). This is the most critical criterion that businesses use to benchmark their big data analytics and archiving platforms. Taking weeks or months to run queries against older data sets (whether online or offline tape) isn’t fast enough to detect a pattern that may highlight a current business issue.
- Compliance. Companies are demanding a big data archive solution that helps with adherence to regulatory compliance mandates and facilitates audits. This includes data disposition, data retention (as mandated), and granular-level visibility and access all the way down to the record level.
- Secure. A competent solution must provide robust security features to protect against unauthorized access. Features such as encrypted physical disks, security at the access control level, role-based access to the data, data encryption, data masking, etc., are all hallmarks of an enterprise-grade solution.
10 About Ashar Baig
Ashar Baig (pronounced Usher Bég) is the president, principal analyst, and consultant at Analyst Connection, an analyst firm focused on cloud computing, storage and server virtualization, storage (hardware, software, file systems, I/O, etc.), data protection (data backup, disaster recovery, and business continuity), big data, DaaS/HVDs/VDI, security, regulatory compliance, archiving, public and private cloud storage, high-performance computing (HPC), grid computing, and managed service providers (MSPs).
Baig is an analyst in the Gigaom analyst network. Prior to his work at Analyst Connection, Baig was the senior analyst and consultant at Taneja Group focused on data protection, cloud storage, and public/private/hybrid cloud space. He also led Taneja Group’s MSPs consulting for vendors.
Baig is the founder and manager of LinkedIn Cloud Backup group. He writes for Talkin Cloud, Tech Target, Gigaom, etc., and actively blogs on the LinkedIn Cloud Backup group page.
11 About Gigaom Research
Gigaom Research gives you insider access to expert industry insights on emerging markets. Focused on delivering highly relevant and timely research to the people who need it most, our analysis, reports, and original research come from the most respected voices in the industry. Whether you’re beginning to learn about a new market or are an industry insider, Gigaom Research addresses the need for relevant, illuminating insights into the industry’s most dynamic markets.
Visit us at: research.gigaom.com.
© Knowingly, Inc. 2014. "Rethinking the enterprise data archive for big data analytics and regulatory compliance" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact email@example.com.