The Financial Industry Regulatory Authority (FINRA), the largest independent regulator for all securities firms doing business in the US, is moving its technology platform to the Amazon Web Services (AWS) cloud and open source platforms.
We conceived the move three years ago during a review of our systems that resulted in the decision to fundamentally rebuild our market regulation platform on the cloud, and to do so using open source platforms. The program has been underway for close to two years, and 70% of systems are currently operating in the cloud.
In describing our experiences, I will begin by outlining our objectives for moving to the cloud and how these resulted in choosing a virtual private cloud using a large-scale cloud provider (AWS) rather than building our own private cloud. I will also discuss how we addressed several concerns that companies considering a migration to the cloud often face, including security, the balance of business and architectural concerns, DevOps requirements, disaster recovery, and implications for our culture.
WHAT DOES FINRA DO?
FINRA is dedicated to investor protection and market integrity through effective and efficient regulation and complementary compliance and technology-based services. FINRA touches virtually every aspect of the securities business, from registering and educating all industry participants to examining securities firms, writing rules, enforcing those rules and federal securities laws, and informing and educating the investing public. In addition, FINRA provides surveillance and other regulatory services for equities and options markets, as well as trade reporting and other industry utilities. FINRA also administers the largest dispute resolution forum for investors and firms.
Most relevant to our cloud initiative, FINRA is responsible for regulating 99% of equities and 70% of options trading in US securities markets. The market regulation function within FINRA receives market-data feeds that can exceed 75 billion records per day and processes this data, creating multi-petabyte data sets and searching for wrongdoing by market participants.
For example, in the case of equities, the data is received from the various US stock exchanges, broker-dealers, alternative trading systems known as "dark pools," and industry organizations. It is then normalized and integrated to create a multi-node graph for each order on the US markets. These graphs can vary in size from several nodes to millions of nodes, as buy and sell orders are routed around the country in search of the best transaction price, also called the execution price.
After creating a complex picture of the state of the markets at every moment in time, surveillance algorithms scan the data for fraud and market manipulation. Alerts are generated, and analysts examine behavior patterns in the marketplace by querying the multi-petabyte data sets to home in on suspicious behavior in the markets.
The people and technologies needed to accomplish this represent the majority of the organization's IT footprint. These very high volumes of data come with challenges. Market volumes are steadily increasing and can be volatile. For example, it is not unusual to experience peak market volumes three times larger than the average. Exchanges are dynamic and evolving, regulations are continually being enhanced, and new rules are being created. Simultaneously, new products are being introduced that create new potential targets for wrongdoers. And during all of this, market manipulators themselves are continuously innovating.
BEFORE THE CLOUD
Until recently, FINRA's data center environment was similar to that of many other companies. Due to our big data processing needs, we have made extensive use of EMC Greenplum and IBM Netezza data-processing appliances, along with various NAS and SAN storage systems used for holding final data and as jump points for data movement. These are combined with various proprietary large-scale ETL tools and significant adoption of Linux and Oracle and are accompanied by some .NET and SQL Server environments. Most operations were housed in a primary data center, and a backup data center was maintained in a sufficiently distant location.
OBJECTIVES FOR MOVING TO THE CLOUD AND OPEN SOURCE
Two principles guided our effort to move the market regulation systems to a new platform. The first was a decision to move to the AWS virtual private cloud platform, and the second was to use open source technologies to totally update our systems. The migration plan itself was designed to accomplish three broad objectives:
Decrease our infrastructure spending in order to redirect expenditure to data analytics
Improve productivity, reliability, and efficiency by increasing automation of production support tasks
Increase the business value through improved accessibility to data and data analytics with burst access to unbounded commodity storage and computing power
Our choice of open source platforms was driven by a desire to harness large clusters of commodity hardware on the cloud rather than maintain exotic data-processing appliances, and to increase execution flexibility in the face of a rapidly evolving and fragmented big data tools market.
We believed -- and subsequent experiences have confirmed -- that by going the open source route and using platforms such as Hadoop, HBase, and Hive, we would avoid being overcommitted to a single vendor, benefit from the large community that is contributing to the advancement of these tools, and, perhaps most importantly, develop inhouse expertise in the cutting-edge tools best suited for our data needs. That last objective has allowed us to evaluate and contribute to other emerging technologies, adopting new tools with relatively little disruption.
VIRTUAL PRIVATE CLOUD VS. PRIVATE CLOUD
Our decision to move to a virtual private cloud using AWS instead of our own private cloud was made early in the process and was a natural outcome of our objectives, none of which would have been possible with a private cloud solution. This solution allowed us to decrease our infrastructure spending (Objective #1) by:
Provisioning for an average load and dynamically expanding our computing resources to handle peak loads, rather than maintaining a fixed infrastructure cost base dictated by peak loads
Purchasing resources at the time of need instead of incurring capital outlay six to nine months in advance
Taking advantage of Moore's Law cost efficiencies as new hardware emerged rather than waiting for the typical three-year depreciation cycle on purchased hardware to expire before exploiting new and more cost-effective generations of hardware
Objective #2, the automation of production support, requires a highly scripted, API-driven platform layer over the hardware infrastructure. We explored the option of developing this layer internally and decided against it for several key reasons. First, there would be a high level of investment required in the middleware, which would divert funds away from our development of technology to support core business objectives. Second, we would not be able to bring to bear the same level of resources as a company with a broad customer base. Third, the gap between our custom middleware and cloud providers' PaaS offerings would surely increase over time, making a homemade middleware solution progressively less viable. There are a host of third-party middleware solutions oriented toward bringing automated cloud platform–like functionality to a private data center. These options were rejected because fundamentally they required a private data center and would offset the commodity infrastructure savings goals of Objective #1.
Objective #3, which was to provide business analytics through innovative uses of commodity hardware resources, was naturally suited to a virtual private cloud solution hosted by a large-scale infrastructure provider like AWS. The economies of scale provided by this solution, coupled with the ease and cost effectiveness of rapid and temporary provisioning, eliminated the private cloud option.
By the completion of the migration program, all significant market regulation systems will have been migrated to AWS. A hybrid environment would entail increased complexity and cost without any tangible benefits. Of course, during the migration program, by definition we have been operating in a hybrid environment with some applications having transitioned while others are still in the process of changing.
ADDRESSING THE SECURITY CONCERNS
We performed an exhaustive analysis of cloud security as part of our planning. Rather than evaluate AWS cloud security against a theoretical ideal case, we took the practical approach of comparing AWS security against what FINRA can actually achieve in our private data centers. The analysis concluded that cloud security exceeds our private data center capabilities.
Any Internet-connected data center, whether privately built, colocated, or cloud built, requires best practices security safeguards such as intrusion detection and malware scanning. These are our responsibilities regardless of whether we are in a traditional data center or in the cloud. With this understanding, we turned our attention to the commonly raised security concerns surrounding cloud-based infrastructures. These are rooted in two issues:
At the core of the multi-tenancy risk is the concern that hardware resources are virtualized and one is unaware of other parties running on the same virtualized hardware. The issue is whether a party could bypass the various security safeguards of the virtualization software and gain access to your data. This concern is mitigated by two factors. First, the sheer scale of a large cloud provider that dynamically allocates workload across hundreds of thousands of machines provides a high degree of anonymity. If you don't know who your neighbors are, they don't know who you are either, and it is extremely improbable that they can find you. Second, we chose to encrypt all data in the cloud, whether at rest or in motion. This combination of factors effectively mitigates any practical multi-tenancy risk.
The risk of an insider threat at the cloud provider is analogous to the same threat in a private data center. We found this risk to be significantly lower with a cloud provider than in a private data center due to the former's scale of operations. To begin with, the data is striped over tens of thousands of disks in tens of data centers, so it is simply not possible for a cloud provider employee to remove a hard drive belonging to a particular company. Furthermore, higher-level access to data by insiders in an infrastructure team is much more complex due to the separation of duties that can be achieved when operating at the scale of a cloud provider. Thus, the barrier to coordinated collusion is much greater than in an enterprise-level data center. When combined with the data encryption mentioned earlier, we concluded that our risk of an insider threat is lower with a cloud provider than in a private data center.
Other mechanisms offered by the AWS cloud in particular provide us with greater security than we could achieve in a private data center. For example, software-defined networks let us effectively use and manage micro-segmentation, with firewall groups that allow an application server to access only one database server.
The general theme in these findings is that the scale of operations in AWS allows for approaches to security that would be impractical in an enterprise private data center. These approaches range from increased separation of duties to a level of investment required in security R&D and infrastructure that is not practical unless amortized over a large number of enterprises.
INCREASES IN BUSINESS VALUE
Increasing business value as a result of reimplementing our systems on the cloud was a goal from the outset. We believed this was necessary in order to provide a concrete basis on which to base architectural decisions.
One of the initial business-oriented goals was to provide rapid end-user analytics by utilizing open source big data platforms, particularly Hadoop and HBase. As an example, a commonly used system at FINRA responds to user queries and assembles complex graphs of securities trades across multiple execution venues by querying petabyte-scale data. The incumbent system in the private data center provided a response time of between 20 minutes and 4 hours for commonly executed queries, with times varying according to the complexity of the trading graph and query parameters. This system was reimplemented early in the program using HBase and harnessing massive compute clusters to provide a new system, which reduced query times to between a sub-second and 90 seconds for similar queries.
With successes such as this, our goals for business benefits have broadened to make the program into a joint technology and business effort.
THE SELECTION OF OPEN SOURCE TECHNOLOGIES
FINRA's decision to move from its proprietary platforms to an open source one and from an on-premises environment to the cloud was driven by the following factors:
Moving a legacy cost basis to the cloud makes little sense.
Functionality for application servers, relational databases, and ETL has become commoditized and is ripe for use of open source.
In a market for big data platforms that is highly fragmented and rapidly evolving with new technologies and no clear winners, open source provides the greatest agility and ability to both move to new technologies and take advantage of platform innovations as they emerge.
Our private data center environment had the typical enterprise mix of Oracle and SQL Server databases. We made the architectural decision to use the Postgres and MySQL open source databases in conjunction with Amazon's Relational Database Service (RDS) and immediately benefited from the scalability and multiple Availability Zone resiliency provided, along with the elimination of our system database administrator burden. Within RDS, we chose Postgres for large-scale systems and MySQL as an option for storing small application states. We allowed for deviations from these choices with justification and permission at the senior VP level, but interestingly, to date no teams have made a request for deviation.
The choice of open source in the big data arena was also accompanied by a choice to build inhouse core engineering competence in big data platforms. This skills development was coupled with strategic partnerships with key big data platform support vendors, including Cloudera, Pentaho, and AWS.
In all of these cases, we found that the open source decision was met with enthusiasm by inhouse development staff and was generally viewed as a way of enhancing and updating their technical skills.
THE ROLE OF DEVOPS
Prior to embarking on this program, FINRA had a fairly mature DevOps capability through the automation of builds and software deployment, along with a very exhaustive regression test suite for key systems.
These capabilities have been a necessary cornerstone of our cloud program. Our approach to operating system patching illustrates the importance of DevOps in this context. OS patching occurs as part of the build cycle instead of through the traditional private data center approach of applying patches to groups of machines. In the new model, the following steps are taken during the build and deploy pipeline: application code is built, an OS image is built with the latest patches and security updates, the two are combined into a single package, and the regression tests run on this image. Upon successful completion of the regression suite, the package is deployed to machines that are themselves deployed dynamically as part of clusters, autoscale environments, and static configurations.
In this setting, DevOps automation is used to eliminate the costs and unreliability of manual and semi-manual deployment processes. Perhaps more importantly, critical operating security updates can be applied on a continuous basis as part of an automated pipeline.
Extensive automation and scripting of our AWS application stack have further paved the way for eliminating much of the manual compliance checking that occurs in a traditional data center. Capacity reports, configuration checks, policy enforcement, monitoring for exceptional access, and other commonly performed administrative tasks are automated.
The traditional model of maintaining separate data centers for production and disaster recovery is superseded by the multiple Availability Zone facilities of a cloud provider like AWS. The multitude of data centers available in a local geographic area, together with redundant power grids, different flood plains, and redundant emergency fuel supplies, introduce a new and more reliable model for disaster recovery than the more limited minimum-distance, two–data center approach.
As part of rearchitecting our applications, we specified that systems would be brought up in arbitrary, rotating Availability Zones and data centers during normal operation, thus ensuring fault tolerance for disaster recovery purposes. We also chose to utilize the US West region of AWS as a backup in the event of nation-crippling disasters, however remote the possibility.
As part of our shift to cloud and open source platforms, we chose to introduce a number of culture changes. Early in the process, we decided to make the cloud migration a rallying cry for the technology organization. Specifically, we challenged senior technology staff regarding the fundamentals of what our systems did and how well they served the business. This resulted in key changes in the way we addressed the fundamentals of our multi-petabyte, big data problem. In this process, new high-potential technology leaders were identified and elevated in the organization. The hiring and staffing effort that accompanied this effort also provided an opportunity to further reshape the technology profile of the company.
Another culture shift has been related to infrastructure support and operational staff. There has been a clear reduced need for traditional operations and infrastructure staff due to the use of infrastructure and platform as a service. We have capitalized on the availability of API-controllable infrastructure and platforms by further automating production support, operational, monitoring, and reporting tasks with the goal of eliminating all manual work in this area and repurposing operations staff to script-writing DevOps roles.
Within software development teams, there has been a drive toward further emphasis on regression test suites. This has led to additional blurring of the tester and developer roles, with test suites being written by team members who have the same skills as those writing software features. I expect that the tester role will continue to be blurred with the developer role and lose its distinction in the near future.
Perhaps the most important culture shift has been in the profile of developers we seek to attract. With the combination of cloud, open source, and big data platforms, we require and hire the same profile of developer that product companies are seeking. Similarly, there is greater focus on software development acumen in managers along with the traditional managerial skills and business domain knowledge demanded in most IT environments.
Within this context, activities such as hack-a-thons -- usually associated with technology product companies -- have taken a prominent role in staff retention and career development. Projecting forward, we foresee a trend to place more emphasis on college graduate hiring in combination with the more experienced middle and senior managers.
As of this writing, we have completed 22 months of a 30-month program to rearchitect FINRA's market regulation portfolio in the AWS cloud. Approximately 70% of the systems are in production on the cloud, and the program has been a success by any measure. This success has now turned our attention to migrating the remainder of our portfolio to the cloud and utilizing the processing power, storage, and flexibility of the cloud to further our analytic capabilities in areas of pattern recognition, machine learning, and other data analysis capabilities.
For other enterprises contemplating such an effort, I would summarize three key considerations from our journey. First, it is important to gain hands-on knowledge about the cloud early on by assigning a group of motivated and highly competent programmers to develop prototypes and proofs of concept. This allows myths to be debunked and subsequent analysis to be grounded in practical experience and based on fact. Second, to meet our objectives, we found it necessary to rearchitect our systems and fully utilize cloud and open source functionality. Third, and perhaps most importantly, we chose to focus on consistent delivery of significantly increased business value. This focus gave teams a concrete basis on which to make architectural tradeoffs and framed the project as an enterprise project with business support rather than as an infrastructure upgrade.
While these are clearly not essential elements for all companies contemplating cloud migration, failure to incorporate these factors could well result in a lost opportunity to move an organization to the next level of capability.