Quantcast
Channel: Curalate Engineering Blog
Viewing all 26 articles
Browse latest View live

Welcome to the Curalate Engineering Blog!

$
0
0

At Curalate, we’re helping the world’s greatest brands unlock the power of pictures. We’re doing so by building novel, interactivecustomer experiences and extracting insights and intelligence from billions of consumer engagements every month. If you’re a digital marketer, we’d love to show you a demo of our tools at work.

Of course, rewiring the brand-to-consumer relationship brings with it a significant set of engineering challenges. To gather the data our clients need, we process over 225 million images per day and capture thousands of behavioral signals every second. To drive commerce, we build polished experiences that are seen by tens of millions of people daily and that are served in a matter of milliseconds.

Like any scrappy startup, we’re constantly learning as we go and we hope this blog can serve as a vehicle to share some of the lessons we’ve picked up so far. An intro to our team and our stack is below. Thanks for reading and we hope to see you back here regularly.

Our Team

We currently have 23 developers spread across our Seattle, Philadelphia, and New York offices (expect a blog post on working with remote teams in the future). We expect this number to grow to 50+ in 2016 and are excited to welcome more kick-ass engineers to the team. We run the gamut in experience and education and are working hard to build a diverse team. Describing a team is inherently challenging, but I’m going to try by giving you an unordered and incomplete list of some things that we believe:

  • We believe in just enough process to keep the train on the rails while moving fast. Marketing tech is a fast and innovative space, and we want to continue to win.
  • We believe in bottom-up innovation and decision-making. Engineers often know the product better than anyone. We encourage all developers to have a hand in what gets built, and not just how it gets built.
  • We believe in transparency, accountability, and respect. This might be better stated in the negative: No politics; No hoarding of information; No finger-pointing; No hiding of mistakes; No beratement for mistakes; No destructive criticism. Many of us have worked at big companies where bad behavior flourished. We do not let that happen here.
  • We believe in throwing people in the deep end. As an example, one of our recent major product releases was written primarily by an engineer less than a year out of undergrad. We feel that challenging people brings out the best in them and keeps them engaged.
  • We believe in constant improvement. The way we build things has changed dramatically in the last 3 years and we want this evolution to continue. We strive to find ways to deliver higher and higher quality to our customers while still moving fast.

Our Stack

We generally value how something is built and the end result over the tools used to build it, but technology choices are obviously an important factor. We’ll be covering our experiences with many of these technologies in upcoming blog posts. Have a question about something? Just ask.

Languages: Scala, JavaScript, Some Python, A tiny bit of C++

Cloud: AWS

Datastores: MySQL, Aurora, DynamoDB, Cassandra, Redis, S3, Memcached

Data Processing: Storm, Redshift, EMR+Spark

Image Processing: OpenCV, Caffe

Queues: Kafka, SQS, Kinesis

Search: CloudSearch

Server Frameworks: Scalatra, Finagle, Lift

Client Frameworks: AngularJS, jQuery, Bootstrap, Less, Backbone

Deployment: Chef, Asgard, Packer, Jenkins, Elastic Beanstalk

Observability and Alerting: Graphite+StatsD, CloudWatch, Loggly, Pingdom, PagerDuty

Test Frameworks: Specs2, Mockito, Jasmine

Load balancer: ELB

CDN: CloudFront


Angular Performance Considerations or: How We Ended Up Detaching $$watchers

$
0
0

When I came to Curalate, our dashboard was a bunch of page-specific jQuery mixed with some helpful libraries. Naturally, this setup became unwieldy as we added more developers to the team and more features to our product. This model became unmaintainable as we planned on more cohesive and complex features and we decided to begin using Angular. The change was welcomed throughout the team and has changed our development process for the better, but migrating to Angular did not come without issues. In this post, I’ll describe the performance bottlenecks we encountered when creating one particularly complex feature using Angular and the methods we employed to solve these issues.

The problem

After converting our first page to Angular and building our knowledge and confidence a bit, we began working on a new feature. The new page included a large grid of images, each of which had associated actions. We noticed sluggish performance when interacting with images in the grid and some galleries even caused the browser to hang completely.

The source of the slowness is a slight irony. The mechanisms which make Angular such a powerful framework, namely two-way binding and dirty-checking, were also the root of the issue. Our new feature required us to support many hundreds of images in a gallery, each outfitted with buttons which, when clicked, affected the state of that model as well as the collection as a whole.

  • We had an infinite-scrolling image gallery with potentially thousands of images
  • Users could alter the state of the gallery by interacting with buttons on each image. They could also filter the gallery.
  • Watcher count increased as more images were loaded into the gallery. If enough images were loaded, the page would crash

The low-hanging fruit

The first flaw we found is one that is well-documented on other blogs: we weren’t using track by along with ng-repeat to avoid churning of DOM elements. Under the covers, Angular assigns a value to each element in a collection called $$hashKey[1]. Within ng-repeat, Angular uses this key to determine whether it needs to update the DOM. This can cause element churn when altering the collection. We used track by to mitigate this issue by using an ID from the model. Some other blogs cover this in more detail [2].

After adding track by to the ng-repeat we saw a minor performance increase, but nowhere close to the easy win we were hoping for. As newcomers to Angular, we were just beginning to understand what was going on under the hood. After profiling the performance of the page using Chrome’s Timeline tool, we observed that an inordinate amount of page load time was spent in executing Javascript. The alarming aspect was the sheer number of functions executing and not the execution duration of each function. After a bit more digging, we attributed the function calls to watchers called from Angular’s $digest loop and began profiling the number of watchers on the page.

One way we reduced the number of watchers was with Angular’s one-time binding expression syntax [3]. One-time binding prevents recalculation of bound values after the given value has stabilized (and is not equal to undefined). We converted a handful of bind expressions to use this since many expressions will not change over the lifetime of the page, such as each image’s source url. However, converting all of the expressions we could to one-time binding didn’t quell all of our performance issues. This is where the real fun began.

Where we’re going, we don’t need roads

We wanted the page to be fast. This meant having results from user interaction take no more than 100ms to appear [4]. Many watchers couldn’t be moved over to use one-time binding syntax because their values actually needed to be updated after the user takes action on an image.

It was clear that the basic solutions, while helpful, were not enough to support this page. The key observation here is that although we theoretically load an infinite amount of images into the gallery, only a constant number are in the viewport at a given time. Thus, we should be able to limit the number of watchers on the page to be constant as well, since updates to models outside of the viewport have no immediate effect to the user.

Our idea was to create a directive which:

  1. Operated mainly outside of the Angular-context (because we couldn’t afford adding more watchers)
  2. Detected when scopes entered and exited the viewport
  3. Added and removed watchers from scopes based on this information, effectively “suspending” a scope

To support requirements 1 and 2, we used plain old jQuery. Using functions like .offset(), .height(), and .scrollTop(), we were able to discern whether an element’s top-left was within the viewport. Supporting requirement 3 was a bit trickier and required digging through some Angular internals. Angular watchers are stored on a given scope in a property called $$watchers. The directive simply keeps a map of generated IDs to $$watchers. When suspending a scope, the $$watchers array reference is moved from the scope to the directive’s internal map and $$watchers is set to an empty array. The opposite occurs when a scope’s element comes back into view.

The logic described above is thrown into a loop which traverses the scopes within a specified element. This loop is executed on a periodic timer, or any time the window is scrolled or. As the user scrolls the page, the stale scopes will be updated.

Integrating this directive on the new page changed everything. We were able to keep the watcher count constant on the page, allowing us to load in an incredible amount of content without overloading the browser. The page flew, going from a few thousand watchers maximum to a few hundred at any given time.

An example

An explanation is only worth so much. To better demonstrate the power of suspendable, we created an Angular performance playground. The playground has a contrived situation which hits on the watcher count pain point using a simple UI. The example loads with suspendable disabled. The header reports the number of watchers on the page currently as well as the duration of the last $digest loop. Try adding a few thousand rows and altering the contents of the inputs.

Once you’ve grown tired of the awful performance, try out the version which uses suspendable. Add a few thousand rows and scroll. With debug enabled, a border is added to illustrate the state of a scope. A green border means the scope is “active” and its watchers are attached, while a red border means the scope is “suspended”. Scroll and watch as suspendable does its job: resuming scopes and keeping the watcher count constant as you scroll.

The improvement is even more apparent when looking in the Timeline debug panel. The screenshots below show the effects of typing the word “test” in a text input with a row count of 2000.

without suspendable

with suspendable

The gain here is clear. suspendable is limiting the amount of work being done to update state at a given time and deferring updates to when relevant elements are actually in the viewport.

While suspendable is currently tweaked for our specific use-case, some trade-offs had to be made. For smoother scrolling, we set the heartbeat time to 2.5 seconds. While this worked for our situation, other interfaces could experience a moment after scroll where elements are presenting a stale version of the model in between heartbeats. 2.5 seconds was the right balance between smooth performance and keeping the DOM representation of the model fresh. Additionally, the “in view” algorithm for suspendable is simplistic. It doesn’t take z-ordering or overflow into account, so watchers may still exist on elements that are not actually visible.

Conclusion

Angular provides the developer with an incredible set of tools to make applications that are powerful and dynamic. Unchecked, it’s quite easy to shoot yourself in the foot. Understanding the inner workings of Angular is essential before using it for demanding, complex interfaces.

References

[1] https://code.angularjs.org/1.4.7/docs/api/ng/directive/ngRepeat#tracking-and-duplicates
[2] http://www.codelord.net/2014/04/15/improving-ng-repeat-performance-with-track-by/
[3] https://code.angularjs.org/1.4.7/docs/guide/expression#one-time-binding
[4] Response Times: The 3 Important Limits

Brewing EmojiNet

$
0
0

Emojis, the tiny pictographs in our phone’s keyboard, have become a important form of communication. Even Oxford Dictionary named 😂 the word of the year for 2015. Though maybe not worth a thousand words, each emoji evokes a much richer response for the reader than boring old text. This makes sense: people communicate visually.

But what do emojis mean? Inspired by Instagram’s textual analysis of emoji hashtags, we set out to help answer this question in a more visual way. Our core question is simple: if people use emojis to describe images, is there a connection between the emoji and the visual content of the image? For some emojis like 🍕 or 🐕, we expect that the emoji describes the content of the image. But what about a less obvious emoji like 💯 or ✊ ?

We investigate these questions by turning to the latest craze in computer vision: deep learning. Deep learning lets you train an entire model with low-, mid-, and high-level representations all at once. This comes at the expense of requiring a large amount of training data. For this investigation, we wish to train a model that suggests emojis for a given image. The deep learning technique typically used for computer vision is convolutional neural networks, and thus we call our model EmojiNet.

Our resulting model powers the Curalate Emojini, a fun app that suggests emojis for your images:

In this post I’ll discuss a few of EmojiNet’s technical details and present some of the more interesting visual meanings people associate with emojis. As with most of our services, we require a high degree of buzzword compliance: the Emojini uses an asynchronous web service deployed in the cloud that executes our deep learning model on GPUs.

Emojis Assemble

Our first task was to gather a large corpus of training images associated with emojis. Fortunately, Instagram provides a very extensive api and tends to have a large number of users hashtagging with emojis.

The most common emoji set contains 845 characters in Unicode 6.1. More recent sets contain many more emojis, but they may not be available on all platforms. IOS 9.1, for example, contains 1,620 emojis. For this investigation, we chose to focus on the most common emojis on Instagram. Thus, we hit the Instagram’s tags endpoint with the 845 base emojis. This returns the number of Instagram media with a hashtag of the requested emoji. When we gathered our data in the fall of 2015, the ❤ emoji was the most popular with over a million posts containing it. The distribution was rather skewed, with many emojis such as 📭 and 🈺 having less than 100 posts, and others none at all.

To ensure we have enough training data, we chose to only train our system on the 500 most popular emojis used on Instagram. As a result, EmojiNet doesn’t know about many newer emojis such as crab, hugging-face, or skin-tone modifiers

Next, we used the tag’s recent media endpoint to download images that contained specific emoji hashtags. This endpoint returns public Instagram posts that contain the given hashtag in the post’s caption or comments section. To ensure quality, we only keep posts that had five or fewer hashtags and contained the emoji we searched for in the caption.

Finally, we only mined a few thousand images from each tag. This ensures that the system isn’t over-trained on high frequency emoji and reduced our computational costs.

The resulting data set has 1,075,376 images, each associated with one or more of the 500 top emojis.

Learn Baby, Learn

🆒. We have our data. Let’s train EmojiNet. Like any good computer vision scientist, we used the Caffe framework to train a deep convolutional neural network to suggest emojis.

We framed the problem as a simple classification: the net should return a 500-dimensional vector containing the likelihood of the input image being associated with each emoji.

We trained the net on an Amazon g2.2xlarge instance running Ubuntu 14.04. These boxes have an Nvidia K520 which, while not the fastest cards on the market, have reasonable performance. We installed Cuda 6.5 and cuDNN v2, and compiled Caffe to link against them (note: Caffe now requires Cuda 7 and cuDNN version 3).

As with most GPU-bound tasks, true speed-up is dependent on how fast you can get data to the GPU. Caffe supports reading images from disk or an lmdb database. The database images can be compressed or uncompressed, while on-disk images can be pre-scaled to the input layer size or not.

We did a bit of off-the-cuff benchmarking to see what pre-processing/data store was the best for Caffe on ec2. To measure this, we used the Caffe benchmarking tool, which measures the average forwards/backwards pass over the neural net in milliseconds. Below are the timings for a batch size of 256 images:

DatabaseCompressionScaledTime (ms)
lmdbjpgyes1681.84
lmdbnoneyes1683.32
on diskjpgyes1682.05
on diskjpgno1918.49

Interestingly, using lmdb provided little speedup over on-disk jpeg images. Whether or not the images are compressed to jpeg in the lmdb database also seems to have little impact. The big speed-up comes from prescaling images (which must be done if you use lmdb).

For EmojiNet, we used an on-disk data set of pre-scaled jpeg images. This was slightly easier to manage than writing everything to lmdb, and still provided a decent footprint on disk and fast processing time. We used mogrify and xargs to pre-scale the images in parallel.

Rather than training the net from scratch, we used a common technique called fine-tuning. The idea is to start with a net trained on other images (in this case, the 14 million images in ImageNet), and execute the training with reduced weights on the earlier layers. This worked well for the EmojiNet, although it steered the net towards semantic biases in many cases, as discussed below. We trained EmojiNet for about a quarter-million iterations (about a week wall clock time) using a batch size of 256 images.

The EmojiNet Web Service

After training EmojiNet, we wanted to build a scalable web service to handle all of your awesome Instagram photos. This presents an interesting engineering challenge: how can we use modern web service models with a GPU-bound computation? Two key principles we adhere to when building web services are:

  • Asynchronous processing to make the most out of the server’s resources.
  • Auto-scaling based on load.

The Caffe software is great for loading up a bunch of data and trying to train a classifier, but not as great at managing concurrent classification requests.

To build a scalable web service, we used an akka actor to lock access to the GPU. That way, only one process may access the GPU at a time, but we get the illusion of asynchronous operations via the ask pattern. In addition, we can directly measure how long a process waits for the GPU to be unlocked. We publish this measure to Amazon’s CloudWatch and use it as a trigger on the auto-scaling group.

Below is a Scala example of our actor:

objectEmojiNetActor{valinstance=actorSystem.actorOf(Props(newEmojiNetActor))}privateclassEmojiNetActor()extendsActor{defreceive:Receive={caseClassificationRequest(images,queueTime)=>{// Compute how long the message was queued forvalwaitTime=System.currentTimeMillis-queueTime// Publish the wait time to CloudWatchcloudwatch.putMetricData(...)// Hit the GPU and compute resultsvalemojis=emojiNet.classifyImages(images)// send the results to the requesting processsender!ClassificationResults(emojis,waitTime)}}}caseclassClassificationRequest(images:List[BufferedImage],queueTime:Long)

The singleton pattern provides one actor per JVM process and thus locks the GPU.

Finally, we use the ask pattern to provide the illusion of asynchronous operations:

implicitvaltimeout=Duration(1,"second")defclassifyImages(images:List[BufferedImage]):Future[EmojiResult]={valrequest=ClassificationRequest(images,System.currentTimeMillis)EmojiNetActor.instance.ask(request).map{casex:EmojiResult=>xcasex:AnyRef=>thrownewRuntimeException(s"Unknown Response $x")}}

Thus, we get an asynchronous call to a locked resource (the GPU) and the back-pressure to the actor is published to CloudWatch and read by our auto-scaling group. That way, we scale out as the GPUs get backed up.

The EmojiNet web service has found it’s way into a few other places than just Emojini. Tim Hahn, one of our software engineers, wrote a hubot script we’ve made available on github that gives us EmojiNet integration with slack: Slack Example

Insights

So, what do emojis mean? One of the fascinating things about the results are the visual characteristics that become associated with each emoji. To see this, we looked at the images with the highest confidence value for specific emoji.

As expected, semantic associations are well captured. This is primarily due to users associating tags with their visual counterparts, but is amplified by the fact that we trained with a base semantic net. Pictures of pizza, for instance, are strongly associated with the pizza emoji 🍕:

Pizza Results

The non-semantic associations, however, are far more interesting. Specifically, the neural net captures what visual characteristics of the image are associated with each emoji. Many emojis may not have a clear meaning in terms of a traditional language, but are still associated with specific visual content.

Face With Tears of Joy 😂, for example, is associated with images of memes:

Face With Tears of Joy Results

What’s fascinating here is that the EmojiNet can apply an emotional response even though it has no knowledge of the context or subject matter. This is the AI equivalent of a baby laughing at The Daily Show: it doesn’t understand why it’s funny.

Also of interest is when an emoji is co-opted by a social trend. Raised Fist ✊, for example, is described by Emojipedia as commonly “used as a celebratory gesture.” The Instagram community, however, has associated it with dirt bikes:

Raised Fist Results

Similarly, the syringe 💉 is often used for tattoos:

Syringe Results

We even see examples of brands snagging emojis. The open hands 👐 has been snagged by Red Bull. We can only assume it’s because they look like wings.

Open Hands Results

Thankfully, other emoji have become visually associated with themselves:

Poo Results

What’s really interesting about these results is that society has assigned specific meanings to emojis, even when they weren’t the intention of the emoji creator. This makes the Curalate Emojini all the more fun: even unexpected results tell a story about how the Instagram community views a specific image. The EmojiNet demonstrates that such socially attributed meanings have clear visual cues, indicating that emojis themselves are evolving into their own visual language.

Build and Deploy at Curalate

$
0
0

Since Curalate began three years ago, our build and deploy pipeline has changed immensely. From a manual process run locally on our laptops to an automated system consisting of Jenkins, Packer, Chef, and Asgard, the progression has given us confidence in the system and allowed us to develop and deploy ever faster. In this post I’ll talk about how we build and deploy our code at Curalate. We’ll cover where we started three years ago, where we are now, and what the future holds.

The Past

Like most young startups, the build and deploy process at Curalate did not receive much attention at first. Whatever could get the code out the door quickly and easily was used and things worked moderately well initially. This makes sense: as a young company you have to focus on your product and don’t have the luxury to devote weeks or months to fine-tuning your build process. The small team and codebase size also allows this process to be fairly ad-hoc. To that end our initial process for deploying applications consisted of:

  1. Compiling code locally
  2. Uploading the build artifacts to a private S3 bucket
  3. Re-launching the relevant instances

Since our entire infrastructure is running on Amazon Web Services we were able to leverage the EC2 userdata feature. The instances were running on vanilla Amazon Linux Amazon Machine Images (AMI) and our custom userdata would download the build artifacts from S3 and launch the daemons at boot time. With the help of a few simple scripts this was quick and kept us moving along for a while. As both the number of applications increased as well as the size of the team this approach started to break down. The two biggest problems were relying on a manual (and therefore error-prone) process of building locally and the inability to quickly roll-back a bad deploy. The latter problem would exacerbate the former. Missing a build flag, using the wrong version, etc, are all mistakes we made while building and deploying locally. These are all things that having a standardized, repeatable build process would solve. While it’s true that these mistakes could have been solved with the addition of another script or a change to the EC2 userdata, we decided it was high time to invest in a proper build and deployment pipeline.

The Present

When designing our next generation build and deployment pipeline we had several goals:

  • Centralized deployment dashboard
  • Fast deployment and rollback
  • Scale to many applications
  • Immutable build artifacts

With those goals in mind and the desire to avoid reinventing the wheel we started to look at existing open source tools. Having used both Capistrano and Fabric in the past I realized that, while great tools, we were looking for something more full-featured. Deployinator from Etsy solves a few of our challenges but ultimately relies on updating code from source on the target machines. As we’re a Scala-based shop this didn’t make as much sense since the deployment artifacts (JAR/WAR) are already built. Deploymacy from Yammer sounds promising but it doesn’t seem like it will be open-sourced any time soon. That left us with Asgard from Netflix and further investigation revealed that Asgard would solve all of the goals mentioned above.

At a high level our current pipeline can be summarized in the below graphic. Concretely, this means our source is checked into GitHub, Jenkins builds JAR/WAR artifacts, and those artifacts are packaged into AMIs to be deployed.

png

Jenkins

Jenkins is the backbone of the build portion of the pipeline and its importance and utilization has only grown since our initial rollout. We use the Amazon EC2 plugin to automatically provision and terminate the build nodes on-demand, which is very beneficial in keeping costs low. Pull requests from GitHub are automatically retrieved, built, and checked to ensure that tests pass and they conform to our coding standards using the GitHub pull request builder plugin. For releases, Jenkins creates a tag and pushes it to GitHub, deploys the resulting artifacts to our internal Maven repository, and then kicks off a Packer run for each of the applications to be built. The Build Flow plugin offers a very nice DSL for designing complex build pipelines in code.

Packer

Packer is invoked by Jenkins, runs our Chef recipes, and bakes an AMI with the desired software and configuration. We followed Netflix’s approach to building AMIs as detailed in their blog post on Aminator. We start with a Foundation AMI: This is strictly a pristine OS installation (e.g., Ubuntu 14.04 LTS): no extra software, no customization, etc. This mostly exists so that we do not have to rely upon a public cloud image that may disappear at any time. Next is the Base AMI, which is built from the Foundation AMI, and installs common software and tools that are needed across all our instances. Think, curl, screen, specific JVM, etc. Finally, we bake custom AMIs on top of the Base AMI for each version of our applications. Packer and Chef make this whole complex process easy and repeatable.

Asgard

Finally, the “deploy” part of the build and deploy pipeline is handled by Netflix’s Asgard. As the AMI is the unit of deployment, a new version of an application is deployed by creating a Launch Configuration with the new AMI, assigning this Launch Configuration to a new Autoscaling Group (ASG), and sizing the new ASG appropriately. If the application is a web app or service, the new ASG is put into service simultaneously with the old version. Once the new ASG is scaled up properly and all instances are healthy the remaining traffic is shifted to the new ASG. At this point the old ASG is scaled down and can be safely deleted. Asgard’s Automated Deployment feature handles this workflow with ease.

Summary

Source code is checked into GitHub, Jenkins builds artifacts from it, and the Packer/Chef combination turns those artifacts into deployable AMIs. Asgard is then used to create an ASG with the new AMI and the code is rolled out into production. The key point here is that at every step the output is immutable: The git tag, the JAR/WAR, and the AMI never change once they are built.

The Future

Our build and deploy process has served us well over the last year and the properties it enforces (immutable artifacts, machine images as the unit of deployment, ASGs) are patterns we will continue to heavily utilize. It has allowed us to develop and deploy code faster, more reliably, and with more confidence while giving us the flexibility to start breaking our monolithic repository into discrete applications. That said, the build and deployment landscape is changing rapidly with the advent of containers and unikernels and we’re eagerly evaluating what our next version will look like. On the deployment side, new tools like Terraform promise to make the “infrastructure as code” phrase a reality. Netflix has recently released the successor to Asgard, Spinnaker, which builds upon the above properties but takes a more comprehensive approach. The ability to build AMIs as well as Docker images provides the flexibility to migrate to containers in the future, something we’ve been looking at recently. The introduction of Blue-Green deployments as a first-class citizen is also a welcome addition to the project. Needless to say, the build and deployment ecosytem is flourishing and we’re excited about what the future holds.

If working on any of the above challenges sounds interesting, we’re hiring in Philadelphia, Seattle, and New York!

Avoiding Pitfalls with DNS and AWS Elastic Load Balancer

$
0
0

Using ELB for your backend microservices? Seeing intermittent connectivity issues, partial outages across your instances, or other unexplainable failures?

TL;DR Respect DNS!

At Curalate, we are nearly two years into building out our backend microservices. We are using Finagle for a majority of these new services. When we first started operating these in production, one recurring problem was achieving our availability goals. The solution came down to understanding the interaction between the AWS Elastic Load Balancer (ELB) and DNS and the impact it can have on your services. Almost all of our outages boiled down to not handling DNS changes properly on our end and the three issues discussed in this post should help you avoid the same mistakes.

You must learn from the mistakes of others. You can’t possibly live long enough to make them all yourself. – Samuel Levenson

Issue 1: The default Java configuration uses an infinite DNS cache TTL

For security reasons the Java folks set the default DNS cache TTL to be FOREVER. Think about this for a second and you’ll realize that this configuration won’t work well for a dynamic/cloud-based services environment where the IP addresses that DNS resolves actually change (often quite frequently). If you’re thinking this might be your problem, compare the set of IPs the service clients were using during the problematic timeframe against the set of IPs after the issue passes. To record the live traffic you can use tcpdump and then analyze with Wireshark (covered in a future blog post!). To get the latest set of ELB IPs check your ELB DNS name or the Route53 entry pointing to it with:

host [dns_name] 

To set the TTL, modify the following value in the java.security file in your JRE home directory (e.g. ./Contents/Home/jre/lib/security/java.security)

Default:

#networkaddress.cache.ttl=-1

Change to:

networkaddress.cache.ttl=10

The AWS recommendation is to set the TTL to 60s. However, it’s not clear to me from the documentation that that will ensure zero issues pointing to stale load balancer IPs. Does the ELB guarantee that it keeps old load balancer instances available for at least 60s after they update DNS to point to a new set? Why risk it? We now use a 10 second TTL and we haven’t detected any performance degradation for our scenarios in doing so.

If you don’t rely on the Java Runtime, then double check how your runtime handles DNS TTLs by default; there could be similar default behavior.

Issue 2: Your service client or framework is not respecting the DNS TTL

As mentioned we are using Finagle as our backend services framework and like many service or database client frameworks, Finagle manages a connection pool layer to decrease latencies connecting to the same destination server machines (in this case the ELB instances). The recommended pattern is to create the client object with the connection pool once per process and then have each client request use a connection from the pool and return it when it’s done. In the ELB scenario where all requests are going to the same load balancer instance, this works especially well and you can enable network keep alive to further reduce the latency of each request.

So what’s the problem? The issue with these client frameworks and connection pools is they don’t necessarily handle DNS changes, so you can get stuck with a stale IP address.

To get around this, we evaluated a few solutions:

  1. Detect connection failure and shut down the server. Let the auto-scaling mechanism of wherever the client lived kick in with some fresh EC2 instances and up-to-date DNS. This was too aggressive for our current system. We are still young in our microservice journey and the clients of our backend services are monolithic web applications serving many different request loads. Taking down instances for availability blips from a single ELB doesn’t make sense for us. Also, the health checks on the web applications consuming the backend services are themselves simple ELB health checks and we don’t have the ability to implement rules like “kill the instance until less than 50% of instances are alive.”
  2. Finagle supports an extensibility point where we can plug in a DNS cache, as explained in this gist from a Stackoverflow post. This option wasn’t available to us at the time because we were on finagle 6.22, but we’ll revisit this now that we are on 6.33. It looks more elegant than where we ended up.
  3. Detect connection failure, create a new client, and retry. Note that these retries are handled outside of the Finagle client framework since we found the retries built into the Finagle framework were not picking up the new DNS value either.

This sample Scala code for a Finagle client wrapper should be encapsulated in a helper of some sort. If you already have some simple retry helper code, it can fit in there. Some extra care is taken here so that we don’t create a bunch of new client objects once DNS changes occur.

objectSampleServiceClientWrapper{privatevalCLIENT_MIN_LIFETIME_MILLIS=2000privatevalNUM_RETRIES=3privatevalSLEEP_BETWEEN_RETRIES_MILLIS=200// This is the reference to the client that is overwritten in case of connectivity failure.
privatevarclient=createClient()privatevarlastClientResetTime=System.currentTimeMillis// The wrapped request method...“BYO” retry pattern.
defmakeSomeRequest(x:Long,y:Long):Long={varretryCount=NUM_RETRIESwhile(true){try{returnclient.makeSomeRequest(x,y)}catch{caseretryableException(e)&&retryCount>0=>{retryCount-=1Thread.sleep(SLEEP_BETWEEN_RETRIES_MILLIS)resetClient()}}}}// Finagle client builder code that returns a typed client for making requests.
privatedefcreateClient():SampleServiceClient={clientFactory.buildClient(...)}// recreate the client if we haven't done so in the immediate past
privatedefresetClient():Unit=synchronized{if(System.currentTimeMillis-lastClientResetTime>CLIENT_MIN_LIFETIME_MILLIS){client=createClient()lastClientResetTime=System.currentTimeMillis}else{// Client was reset recently. Do nothing. Trace. Log. 
}}// Handle all the exceptions that could be thrown if
// it can't connect to the load balancer
privatedefretryableException(throwable:Throwable):Boolean={throwablematch{casee:TimeoutException=>truecasee:ChannelWriteException=>truecasee:UnresolvedAddressException=>truecase_=>false}}}

It’s not perfect, but it has worked for us. A variation of this would be to have a background thread checking DNS of the ELB host name that signals to reset the client proactively upon detecting a change. But since stale DNS entries are not the only intermittent network problem we were already retrying on most of these exceptions anyway.

Issue 3: Wildly inconsistent Request Rates

If your request load varies wildly throughout the day it can amplify any DNS issues because the more scaling operations that ELB has to do, the more it will be switching out load balancer instances and changing IPs. Incidentally, this actually makes for a good end-to-end test if you are rolling out new backend services: vary the load dramatically over hours of the day and see how the success rate of your client requests holds up.

One issue with request spikes is that you could overload the capacity of the ELB before it has time to adjust its scale. This AWS article describes that the scaling can take between 1 and 7 minutes. If this isn’t sufficient to handle your load spikes you can contact AWS Support and file a request to have them “pre-warm” specific ELBs with a certain configuration if you know the expected load characteristics. We haven’t needed this yet, but our backend services still have relatively low throughput and our latency requirements aren’t that strict yet. I expect this to be an issue in the future.

Conclusion

If you’re just starting to scale up your fleet of microservices, learn from our mistakes and get your DNS caching right. It’ll save a lot of time chasing down issues.


Further details from AWS documentaiton

“Before a client sends a request to your load balancer, it resolves the load balancer’s domain name using a Domain Name System (DNS) server. The DNS entry is controlled by Amazon, because your instances are in the amazonaws.com domain. The Amazon DNS servers return one or more IP addresses to the client. These are the IP addresses of the load balancer nodes for your load balancer. As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. Note that the DNS entry also specifies the time-to-live (TTL) as 60 seconds, which ensures that the IP addresses can be remapped quickly in response to changing traffic.”

How Elastic Load Balancing Works

“If clients do not re-resolve the DNS at least once per minute, then the new resources Elastic Load Balancing adds to DNS will not be used by clients. This can mean that clients continue to overwhelm a small portion of the allocated Elastic Load Balancing resources, while overall Elastic Load Balancing is not being heavily utilized. This is not a problem that can occur in real-world scenarios, but it is a likely problem for load testing tools that do not offer the control needed to ensure that clients are re-resolving DNS frequently.”

Best Practices in Evaluating Elastic Load Balancing

Bridging C++ to Scala with BridJ

$
0
0

At Curalate we’ve moved towards a microservice architecture with each service living in its own git repository. For the most part, we’ve standardized the way we build our Scala projects using Apache Maven to manage dependencies and compilation. This is convenient since any Curalady / Curalad can clone one of our repos and type mvn install at the root with the expectation that everything will compile successfully on the first try. We wanted this same ease of use for our Scala projects that needed access to native libraries and this post explains how we obtained it.

Seamlessly Interfacing Scala with C++

The JVM is an impressive piece of technology and enables awesome high-level languages like Scala. However, there are times that we need to use native languages like C++, especially when applying computer vision and machine learning. Like any good startup, we racked up technical debt to move quickly. Initially, our native projects were compiled manually and interfaced with Java via JNA. This required an error-prone multi-step process when making changes, including manually placing a dynamic library in a JAR for deployment. As native development became more important and the size of our team increased this manual process became cumbersome.

It was clear to us that we needed to overhaul our native development infrastructure. When we approached the task of redesigning our native build system we had several goals in mind:

  1. Standardizing native builds and providing push button operation (i.e. mvn install is all we need)
  2. Adding native functionality to a Scala project should be as simple as putting native source files in the right directories
  3. Minimizing boilerplate and saving developer time
  4. Including shared libraries in the final JAR should be automatic

Choosing the Interface

There are several options for using Java and C++ together which in turn, allows us to interface with Scala. The classic option is the Java Native Interface (JNI) which is part of the Java language specification. If you’ve ever used the JNI you may recall that there is quite a bit of boilerplate. In addition, almost all communication between the native code and Java must be done through special native JVM calls requiring a significant amount of glue code to do seemingly simple things.

A higher level alternative to JNI is Java Native Access (JNA) which when paired with JNAerator can minimize the boilerplate we need to write. JNAerator takes in a C/C++ header file and generates a Java source file with wrappers for each native function. This makes JNA appealing since we only need the header file, which we had to write anyway! The price for these high-level features is that JNA is significantly slower than the JNI. Often the sole reason for crossing the native boundary is speed so this is problematic.

Fortunately, JNAerator recently added support for yet another way to interface native code with Java, BridJ. BridJ is a relatively young project, but it claims to have speeds comparable to the JNI and it allows direct interfacing with C++. In contrast, the JNI and JNA are designed to interface with C which requires redundant extern declarations to use C++. BridJ also allows building shared libraries for multiple target operating systems and architectures. As long as the libraries are placed in a specific directory they will be included in the final JAR and at runtime BridJ extracts the library from the JAR and instructs the class loader to load the library.

Automating the Build

To integrate all of this into our existing development infrastructure we wrote a specialized Makefile along with a suite of scripts. We wrote hooks for specific Maven lifecycle phases to make everything seamless. Simply placing C++ header and source files in the right sub-directories is enough to get a working hybrid Scala / C++ project. Our build system takes care of calling JNAerator to generate Java wrappers, building shared libraries, and putting everything in the correct place in the final JAR for deployment.

Using the Interface

Now we’ll work through the obligatory hello world example here to show what BridJ looks like in practice.

First, we’ll write our C++ header to define the interface with Java. It’s best to stick to primitive types here like char*, int, etc. since JNAerator’s support for parsing header’s is limited. To transfer arbitrary data or objects we found it was easier to serialize everything to a byte array and unpack that on the native side (Java’s ByteBuffer is handy here). This method of passing serialized data was better captured with two headers instead of just one. For the first header, we would restrict ourselves to primitive types to define the Java interface that will perform the serialization and call the appropriate native function. The second header file would be written to accept the unpacked data as more complex native types like objects and to supply the actual native implementation. This separation of concerns made things a little cleaner to implement.

Here’s our C++ header that specifies the Java interface:

#define HELLO_WORLD_HPP
#ifndef HELLO_WORLD_HPP
/**
 * Prints the given string to stdout.
 * @param str  the characters making up the string
 * @param len  the length of the string
 */voidhelloWorld(constintlen,constchar*str);#endif

and here’s the C++ implementation file:

#include "hello-world.hpp"
#include <iostream>
voidhelloWorld(constintlen,constchar*str){std::cout<<std::string(str,len)<<std::endl;}

Here’s the file automatically generated from our header by JNAerator:

packagecom.curalate.helloworldimportorg.bridj.BridJ;importorg.bridj.CRuntime;importorg.bridj.Pointer;importorg.bridj.ann.Library;importorg.bridj.ann.Ptr;importorg.bridj.ann.Runtime;/**
 * Wrapper for library <b>hello-world</b><br>
 * This file was autogenerated by <a href="http://jnaerator.googlecode.com/">JNAerator</a>,<br>
 * a tool written by <a href="http://ochafik.com/">Olivier Chafik</a> that <a href="http://code.google.com/p/jnaerator/wiki/CreditsAndLicense">uses a few opensource projects.</a>.<br>
 * For help, please visit <a href="http://nativelibs4java.googlecode.com/">NativeLibs4Java</a> or <a href="http://bridj.googlecode.com/">BridJ</a> .
 */@Library("hello-world")@Runtime(CRuntime.class)publicclassHelloWorldNativeLibrary{static{BridJ.register();}/**
	 * Prints the given string to stdout.<br>
	 * @param str  the characters making up the string<br>
	 * @param len  the length of the string<br>
	 * Original signature : <code>void helloWorld(const int, const char*)</code><br>
	 * <i>native declaration : hello-world-native/src/main/jnaerator/include/hello-world.hpp:9</i>
	 */publicstaticvoidhelloWorld(intlen,Pointer<Byte>str){helloWorld(len,Pointer.getPeer(str));}protectednativestaticvoidhelloWorld(intlen,@Ptrlongstr);}

We’re ready to call this from Scala now! Let’s fire up the REPL and try it out:

scala>importcom.curalate.helloworld.HelloWorldNativeLibraryimportcom.curalate.helloworld.HelloWorldNativeLibraryscala>importorg.bridj.Pointer.allocateBytesimportorg.bridj.Pointer.allocateBytesscala>importjava.nio.charset.Charsetimportjava.nio.charset.Charsetscala>valmessage="Hello World!"// The message we'd like to print.
message:String="Hello World!"scala>// It's important that we hold onto a reference to the allocated Pointer until
scala>// the native side returns or it may be freed by the JVM too early leading to a SEGFAULT.
scala>valnativeBytes=allocateBytes(message.size)nativeBytes:org.bridj.Pointer[Byte]=Pointer(peer=0x7fab49c7e9b0,targetType=java.lang.Byte,order=LITTLE_ENDIAN)scala>// Now let's copy the bytes into the natively allocated memory.
scala>nativeBytes.setBytes(message.getBytes(Charset.forName("US-ASCII")))res3:org.bridj.Pointer[Byte]=Pointer(peer=0x7fab49c7e9b0,targetType=java.lang.Byte,order=LITTLE_ENDIAN)scala>HelloWorldNativeLibrary.helloWorld(message.size,nativeBytes)HelloWorld!

Let’s take a look at what the final JAR looks like when compilation is complete:

$ jar -tfv target/hello-world-native-0.1.0-SNAPSHOT.jar
     0 Wed Apr 06 14:49:22 EDT 2016 META-INF/
   132 Wed Apr 06 14:49:20 EDT 2016 META-INF/MANIFEST.MF
     0 Wed Apr 06 14:49:20 EDT 2016 com/
     0 Wed Apr 06 14:49:20 EDT 2016 com/curalate/
     0 Wed Apr 06 14:49:20 EDT 2016 com/curalate/helloworld
     0 Wed Apr 06 14:49:18 EDT 2016 lib/
     0 Wed Apr 06 14:49:18 EDT 2016 lib/darwin_universal/
  1130 Wed Apr 06 14:49:20 EDT 2016 com/curalate/helloworld/HelloWorldNativeLibrary.class
  9568 Wed Apr 06 14:49:18 EDT 2016 lib/darwin_universal/libhello-world-native.dylib
     0 Wed Apr 06 14:49:22 EDT 2016 META-INF/maven/
     0 Wed Apr 06 14:49:22 EDT 2016 META-INF/maven/com.curalate/
     0 Wed Apr 06 14:49:22 EDT 2016 META-INF/maven/com.curalate/hello-world-native/
  3505 Wed Apr 06 14:43:48 EDT 2016 META-INF/maven/com.curalate/hello-world-native/pom.xml
   131 Wed Apr 06 14:49:20 EDT 2016 META-INF/maven/com.curalate/hello-world-native/pom.properties

For this example, we compiled this for Mac OS X and the final library is stored in the JAR as lib/darwin_universal/libhello-world-native.dylib. If we also built the Linux library binary we could add it to this JAR as lib/linux_x64/libhello-world-native.so. At runtime BridJ would extract the appropriate library for the class loader allowing the JAR to be used with both Linux and Mac OS X.

Great, now when someone would like to use our code they can simply clone a git repo and type mvn install to get things compiled! At Curalate, we often need to interface with native code when working with machine learning or computer vision. In this post, we’ve given a brief tour of our custom build system that gives us a consistent, fast, and easy-to-use framework for interfacing Scala with C++. Have you had to tackle a problem like this before? If you have suggestions or another approach let us know. We’re always listening!

Hotline Ping: URL availability monitoring built with AWS Lambda and the Serverless Framework

$
0
0

Say you want to track the health of your API. Pingdom is probably your first move. But what if you want to track thousands of endpoints? Suddenly you’re looking at a monthly bill of over $500! So, what if you’re willing to build an API health tracker yourself? This blog post is for you. Not only can we beat $500 per month, we can build our API health tracker for damn near free!

First, let’s clear something up. We love Pingdom at Curalate. We’ve been using Pingdom for almost three years to track the uptime and latency of our major products, and we will continue to use it for the forseeable future. But we’ve reached a point in our product where it is possible for us to break an endpoint for just one client, and that won’t be reflected in the overall API health. We’d like to check that data is flowing for each of our clients.

So let’s complement Pingdom with a lightweight solution that pings endpoints for all of our clients. And let’s call it Hotline Ping.

Choosing AWS Lambda

One major goal here is to run this health check regularly and frequently. We have a cluster of machines that run scheduled jobs, but there is no explicit execution start time due to the variable queue length. We could consider something like cron to launch the health tracker at an exact time, but that introduces a single point of failure. Neither of these are good options for a production monitoring system.

We could also spin up a standalone service, but making it resilient would require multiple servers and extra engineering and maintenance. That adds up to some real costs.

Revenge of the Nerds - Lambda House

AWS Lambda is built to handle this type of work: regularly scheduled but low density work. Why pay for a ton of unused CPU cycles? Also, NodeJS is well-suited for the task of pinging a long list of URLs and Lambda supports Node natively. Lambda essentially lets you spin up a mini-instance to run your function once, then spin it down. Only need to run it once a day? Schedule it with Amazon’s built-in scheduled events and only pay for the CPU cycles it takes to run it once a day. Need to run the function a million times at once? Spin up a million instances of your function to run them all concurrently!

Lastly, we were looking for a nice test-case for AWS Lambda. We wanted to get a feel for how to code for it, how to deploy for it, and how to monitor it.

Serverless Framework

Serverless, previously known as JAWS, is a framework created in response to Lambda. It’s built by a company of the same name working fulltime on making this open source framework great. It helps you automate deployments and versioning of your Lambda functions. It also helps you write clean code by separating the Lambda event handler code from the rest of your code. Using Serverless correctly allows you to deploy the same code you wrote for Lambda to an EC2 instance with relatively low overhead.

The framework is still pretty young (v0.5.5 as of writing this), but the team and contributors were incredibly responsive and helpful when we were building the first version of Hotline Ping. Their Gitter chatroom is very busy with the team, contributors, and new Serverless users.

Overall, Serverless helps smooth the few rough edges in Lambda.

Building Hotline Ping

Now that we’ve picked our stack, actually building Hotline Ping is pretty straightforward.

  1. Set up a scheduled event to run every 5 minutes (most frequent currently supported schedule)
  2. Write a function that reads a list of URLs from S3, pings each, and sends metrics to DataDog
  3. Hook that function up to a Lambda event handler using Serverless
  4. Configure Serverless for your AWS environment
  5. Bob’s your uncle

Hotline Bling pizza

We built Hotline Ping to avoid upping our montly bill with Pingdom, so how cheap is Lambda for our project? We can use Matthew Fuller’s Lambda Cost Calculator to find out. Running our function every 5 minutes for 30 days is 8,640 executions per month. Our function does not need anything more than the minimum 128MB memory instance. And empirically, our function runs in a maximum of 90 seconds (so this will actually be an overestimate). That comes to a whopping $1.62 per month. Plus that’s ignoring the free tier AWS provides!

There are still a few things we would like to tidy up and add in future work:

  1. Get this code into our Jenkins workflow, including for deployments. Right now we just deploy from a local machine.
  2. Track latency to give us a more complete picture of our API health
  3. Upgrade to the latest and greatest version of Serverless. We started Hotline Ping with v0.3.0, and they’ve already added a bunch of great changes by v0.5.5:
    • Much simpler configuration
    • Better directory structure
    • Configuration for scheduled events
    • Support for multiple AWS accounts for different deployment stages
    • Plus many more; they’re seriously cranking out some great new features

Final thoughts

So, how do you feel after your first foray into Lambda? Not so bad, right? With this simple example, our hope is that you better understand the potential to create services that use minimal resources but that can also scale massively and seamlessly.

Announcing Curalate's New Muse Page!

$
0
0

The good people at The Muse have made a lovely profile page detailing what it’s like to work at Curalate. We’re extremely proud of the culture we foster, the vision we pursue, and the products we build. You can check out more about the profile at the main Curalate blog.

And of course, we’re hiring! We’re looking for great engineers, designers, and product managers to join our growing team.


Tips for Starting With Redshift

$
0
0

When beginning to use Amazon Redshift there are always some growing pains. In this blog post we’ll go through 3 tips and tricks we learned from starting up our own Redshift pipeline!

Why We Use Redshift

At Curalate we serve a lot of images. In July of 2016 we served billions of images throughout our Fanreel, Like2Buy and Reveal products, and that number is steadily increasing. We don’t just serve plain images, though. All of these images are “productized”, containing information about the products within them or any associated information, and are used to drive interactions and sales for our customers. It’s easier to show what I mean than to describe it.

Say you stumbled across this picture of my dog and thought, “Wow I bet Fluffy would look great in that harness, and I wonder how much that leash costs”. Lucky for you, this image is productized and served through Reveal! You could easily hover over the image and answer those questions…


The value we add is not just how many images are served but how effective they are for our clients. We need to be able to answer questions with the analytics we provide, such as:

  • How well were they received by the public?
  • How many people clicked through to the products and came back for more?
  • Do people convert more often after interacting with these productized images?
  • Which images drove the most traffic to the product page?

To help answer these questions we’ve built a custom data pipeline which captures and stores usage metrics from our client-facing products. All of these metrics end up in an Amazon Redshift cluster, a columnar data warehouse from Amazon, which we use for daily rollups, client reports, and our own internal investigations to help make our products better.

Here’s a quick overview of our pipeline:

Our data pipeline has been running with near 100% uptime for well over a year. It is structured like this:

  1. Curalate products send usage metrics like impressions, clicks and hovers and we convert them to a standardized JSON format.
  2. Using an Apache Kafka queue and Pinterest’s Secor we send batches of metrics to S3 for storage. Were we to build this system today we would likely use Amazon Kinesis with Amazon Firehose due to Firehose into Redshift being available and Kinesis being more fully featured than it was during initial development.
  3. We run a nightly job, which we’ll walk through below, to safely load our data from S3 into Redshift.

Learning to work with Redshift had a lot of interesting challenges and plenty of lessons learned. Here are a few of them!

Tip #1 - Design For Deduplication

TL;DR - Add a GUID and use a staging table to guarantee you only load new data

Redshift does not enforce a primary key, which means we need to deduplicate our own data. Duplication will confuse our analytic reports, much like this cute baby was confused by his duplicate dad.

Since we’re loading automatically from S3 we need to make sure the process is repeatable without introducing duplicates so that any engineer can re-run the load at any time without worrying about the state of the cluster. We designed this into our metrics system from the beginning and you should too. This will be especially useful if you’re working with time-series data and can rely on a strict ordering.

Step 1: Add a GUID

Make sure that every piece of information recorded into your pipeline has a GUID attached to it. It’s a relatively cheap operation to do per metric and is required for easy deduplication. You could also use Redshift’s string concatenation (e.g col1 || col2) if you have a tuple primary key, but that will be a bit slower due to the computation requirements.

Step 2: Use a server side timestamp

It may seem obvious, but users do weird things. Sometimes they like to pretend they’re living in the future so they set their clock days, months, even years into the future (or past). Never rely on a timestamp that came from something you don’t control. A server side timestamp should also be the sortkey for your cluster (we’ll get back to choosing that in Tip #2 below).

Step 3: Load into a staging table

Redshift’s COPY operation is a fantastic tool, but it’s fairly inflexible and can’t handle the logic we need. Instead of loading directly into your primary table(s), create a temporary table with the same layout as your production table:

CREATETEMPTABLEstagingTableForLoad(server_timestampBIGINTNOTNULL,guidVARCHAR(128)NOTNULL,dataVARCHAR(128)NOTNULL,...)DISTKEY(guid)SORTKEY(server_timestamp);

It’s important to give a sortkey and distkey for this table, especially if your daily loads can be large. Once you’ve created your table, run the COPY command.

COPYstagingTableForLoadFROM's3://your-s3-bucket/todaysDate'CREDENTIALS'...'JSON's3://your-s3-bucket/jsonpaths'GZIP

Step 4: Insert any new rows (efficiently)

Using our GUID and our timestamp, we can efficiently check for any rows which may have already been loaded. This takes two calls. First, get the minimum and maximum timestamps from the data to be loaded. Since the server_timestamp column is the sortkey and our staging table is relatively small, this is a fast operation:

SELECTMIN(server_timestamp)ASmin_timestamp,MAX(server_timestamp)ASmax_timestampFROMstagingTableForLoad;

Using your new values, insert by comparing against a set of the GUIDs already present in the data’s time range.

INSERTINTOproductionTableSELECT*FROMstagingTableForLoadWHEREguidNOTIN(SELECTguidFROMproductionTableWHEREserver_timestampBETWEENmin_timestampANDmax_timestamp)

We use this method for our daily load and it runs very quickly, adding only a few seconds on top of the transmission time from S3.

Tip #2 - Pick Your Sort Key and Dist Key Carefully

TL;DR - Choose your keys carefully or you’re in for slow queries and a very annoying weekend trying to rebalance a skewed cluster.

The sortkey and distkey (or partition key) of your cluster are important choices. They are set when your table is created and cannot be changed without rebuilding the full table. When you start dealing with huge tables, a poorly chosen distribution key or an under-utilized sortkey can make your experience using Redshift very painful.

Sort Key Selection

If you haven’t yet, read Amazon’s documents about choosing a sort key. If you are using time-series data (such as usage metrics or sales reports), set your SORTKEY column to be a timestamp generated by a machine that you control. By inserting data daily and running regular VACUUM operations, the sort key will help to keep your queries running fast.

If you aren’t working with time-series data you can use another column as your sort key to help make scans more efficient. Think of a few queries you’ll likely be writing, and if there’s a column where you frequently include range boundaries then use that as your sort key. Make sure to run frequent VACUUM operations if you aren’t adding new data in sortkey order.

In general you should include your sort key on every query you write unless you have a good reason not to. If you frequently run unbounded queries without a range component, especially for a query that is frequently run, think really hard if Redshift is the right choice for you.

Distribution Key Selection

The distribution key is a bit trickier. Read up on the documents here. The distribution key (or distkey) sets up which column of your table will be hashed to choose the cluster partitioning for your data.

The enemy you’re constantly fighting with the dist key is data skew. Skew is the term for uneven resource distribution, where the resources of your cluster aren’t uniformly distributed causing a few nodes to be responsible for more than their share of the cluster’s load. No matter what you pick for your distkey make sure that the data skew of your cluster is as close to 0 so that you use everything you’re paying for and be sure to monitor skew as your cluster grows in case your assumptions about your data are incorrect. If you already have a cluster set up, Amazon has a very useful script you can use to measure your table’s skew and more.

This arm wrestler has a very high arm-muscle skew

Given that skew should always be a concern, there are two sides to balance when choosing your key: data co-location and parallelization.

  1. You can choose to co-locate similar data onto the same machine by choosing a distkey that is shared by multiple events, which will help speed up processing by limiting communication between nodes in your cluster, but can slow performance by putting potentially more processing onto fewer nodes depending on your query. You have to be careful here, as this is likely to introduce some amount of skew into your disk usage.
  2. You can choose a purely random distkey. The query load and data will be evenly split across all nodes and skew will be 0, but you could incur some extra network delays due to network communication to move data between nodes.

Currently our distribution key is purely random, but it wasn’t always that way. Migrating a distribution key was definitely a big growing pain for us.

A Story of Skew

When setting up our first cluster we chose a bad distribution key. We leaned too far towards co-locating data for query speed and we didn’t check our per-node Cloudwatch metrics often enough to detect skew. After collecting a few months of data our cluster became inoperable without firing any alerts first. A quick investigation found that some of our largest clients were sharing a single node. That node’s disk had filled up and our cluster became unusable despite the reported overall disk usage being only around 20%. We couldn’t run any queries without freeing up disk space on the single full node first, so we needed to free up space before we could try to rebalance the table.

We doubled the number of nodes in the cluster hoping to make room on that single node. This was ineffective because of the poor distkey choice. We didn’t have any control over the hash function used and so with doubling our cluster our full node only gained 2-3% more free storage. We ended up having to grow the cluster by 8x to free up that node by around 15% so that we could run a staggered copy onto a new, well-balanced table.

Tip #3 - Constant Functions Aren’t Cheap

The Redshift query planner does a great job with the hard stuff, but it has some blind spots as far as simple optimization goes. As an example, at Curalate we use epoch timestamps throughout our platform, which Redshift does not handle well. It has no native datatype for them and somewhat limited methods. If you want to query against our cluster you need to use epoch timestamps, and we frequently want to have an easy way to use real dates within our SQL workbench instead of using an external converter.

For example, the below will calculate the epoch timestamp for midnight on April 1st, 2016:

EXTRACT(epochFROMtimestamp'2016-04-01 00:00:00')

This does exactly what you’d expect, getting the epoch timestamp in seconds for April 1st. It returns a constant, there is no column or variable here, and it is a fast query. Redshift doesn’t currently support variables, so if you want to use this data you may include this function in your query.

Say we want to find the total number of metrics recorded way back in June of last year (remembering that server_timestamp is our sortkey and it contains an epoch). If we use our easily-readable function, it looks like this:

SELECTCOUNT(*)FROMproductionTableWHEREserver_timestamp>=EXTRACT(epochFROMtimestamp'2015-06-01 00:00:00')ANDserver_timestamp<EXTRACT(epochFROMtimestamp'2015-07-01 00:00:00')

Looks good, right? Easy to modify, simple to run against a variety of dates. Sadly, this query takes a very, very long time to run against a cluster of any meaningful size. The EXTRACT function is not treated as a constant, and is evaluated against every row in your table. This query performs a scan of the entire productionTable, loading and decompressing every timestamp and reevaluating the EXTRACT function for every row. In practice, the above query on our 19 billion row cluster took over 10 minutes before I got tired of waiting and killed it. It likely would’ve taken hours.

However, if you precompute the value, you’ll end up with this:

SELECTCOUNT(*)FROMproductionTableWHEREserver_timestamp>=1456804800ANDserver_timestamp<1459483200

This query finished in 6 seconds! When in doubt, simplify all of your constants.

And for what it’s worth, the answer for us is hundreds of millions of events in June 2015. For comparison, in July 2016 we recorded nearly 8x that many. We’ve grown so much!

I hope some of these tips have been useful! If you have any questions or feedback please leave a comment.

Zipkin At Curalate

$
0
0

Curalate uses a micro-services infrastructure (SOA) to power its products. As the number of services began to grow, tracking down performance issues became more difficult due to the increasing number of distributed dependencies. To help identify and fix these issues more quickly, we wanted to utilize a distributed tracing infrastructure. The main goals were:

  • Standardize on performance tracing across services
  • Generation of trace identifiers that flow throughout the platform
  • Identify and examine outliers in near real-time
  • Collection of performance data that could also be post-processed
  • Easy for developers to add trace points

After evaluating several distributed tracing solutions, we chose Zipkin. Zipkin is an open source Distributed Tracing project released by Twitter that is based on the ideas behind Google’s Dapper. Since a majority of our services were already based on Finagle, Twitter’s open source RPC system, using Zipkin is a natural fit.

What are the main benefits of Zipkin?

Performance and Dependency Visualization

Zipkin allows you to quickly visualize complex requests, gaining insight into the timing, performance and dependencies. Here is an example trace from our QA environment (with names altered):

Traces can be queried based on endpoint, service, or even using custom trace annotations. For example, display requests that occurred in the last hour targeting object “foo” that were greater than 250 ms.

The Zipkin Project

The Zipkin project has done a majority of the heavy lifting! They separated trace generation, collection, storage, and display to allow for trace collection at large scale. The project provides the following components:

  • Collector: Reads traces from Transport layer (Kafka, HTTP, Scribe) and writes them to a storage layer (MySQL, Cassandra, ElasticSearch, etc.)
  • Query Server: Queries the storage layer for relevant traces
  • Web Server: UI that access the Query component for traces

Depending on your trace throughput requirements, you can choose the trace transport, storage and number of collectors to fit your needs.

The amount of actual tracing-related code needed to generate a trace similar to the example above is quite small. We only needed to add a few tracing wrappers, a filter on the incoming requests, and trace points where non-finagle services are accessed (since any finagle-based access gets traced automatically).

Finally, Zipkin also has an active open source community and is under active development. For instance, they recently combined the Query and Web server components and added ElasticSearch as an additional storage option. And it isn’t hard to find others using Zipkin on the web too.

Pitfalls

Why didn’t my trace show up?

Curalate uses Scala as its main language for backend infrastructure. The Zipkin trace identifiers are stored in 
Twitter Locals (i.e. thread local storage). Our codebase contains a mix of threading models and thread pools. 
Making sure the current trace identifier can be accessed from any thread without explicitly passing it around required overriding the 
standard Scala thread-pools. There were quite a few times during development when a trace point was not logged as expected.

MySQL as the storage layer

MySQL was great for getting up a quick proof of concept and the initial roll out. However, as we increase the number of traces being sampled, MySQL is becoming a bottleneck on both the trace ingest and serving side. The good news is that we knew this possibility going in, and swapping out storage options (for Cassandra or ElasticSearch) should be relatively painless.

Final Thoughts

We have had Zipkin deployed for a few months now. The tracing pipeline is stable, and with the help of Zipkin we have identified and fixed several performance issues across multiple services. It is great to have another tool in our distributed systems tool belt!

Programmatic Jenkins jobs using the Job DSL plugin

$
0
0

Jenkins is an incredibly powerful and versatile tool but it can quickly become a maintenance nightmare: jobs are abandoned, lack of standardization, misconfiguration, etc. But it doesn’t have to be this way! By using the Jenkins Job DSL plugin you can take back control of your Jenkins installation.

It always starts off so simple: you just shipped a couple of nasty bugs that should have been caught and decide it’s high time to jump on board the Continuous Integration train. You have a single application and repository so this should be pretty easy. You fire up a Jenkins instance, manually create the first set of jobs using the build commands you were running locally, and everything is humming along just fine. You lean back, put your feet up, and watch as Jenkins runs your impossibly-comprehensive test suite and prevents you from ever shipping another bug again.

Fast-forward two years and now you have a service-oriented architecture spread across a couple dozen repositories. Each of these repositories has its own set of Jenkins jobs, and, despite being based off an initial template, they’ve all deviated in subtle and not-so-subtle ways. Some jobs (for instance, the pull request builder) should be nearly identical for each repository; the only differences are the repository being built and the email address who will receive the scathing build break email. The pain now becomes apparent when you need to make a change across all these jobs. For example, you realize that keeping the results of those pull requests from two years ago does nothing but take up disk space. Rather than edit all these jobs by hand you think “there must be a better way”, and you’re correct. Enter the Job DSL plugin.

Background

The Job DSL plugin really consists of two parts: A domain-specific language (DSL) that allows us to define job parameters programmatically and a Jenkins plugin to actually turn that DSL into Jenkins jobs. The plugin has the ability to run DSL code directly or to execute a Groovy script that contains the DSL directives.

Here is a trivial example of what the DSL looks like:

job('DSL-Test'){steps{shell('echo "Hello, world!"')}}

This creates a new Freestyle job named “DSL-Test” and contains a single “Execute shell” build step. This is known as as the “seed job” since it is used to create other jobs and is the only job you’ll need to manage manually. Before proceeding let’s put this into action:

  1. Create a new Freestyle job
  2. Add a new build step and use the “Process Job DSLs” option
  3. Select “Use the provided DSL script” and paste the above snippet in the textarea

The configuration should look something like the following:

After running the seed job it will report the jobs it created both in the console output as well as the summary page. Further, any jobs that were created by the seed job will indicate they are managed by this job on their summary page.

Congratulations, you’re well on your way to DSL scorcery!

Now that we’ve got a handle on the basic usage let’s take it a step further. I’ll now show how we tamed our jobs by using a simple configuration file to drive the jobs being created.

Configuration

As hinted at above, the Curalate codebase is now spread across many repositories and we needed a fast and straightforward way to create and modify the Jenkins jobs that acted on these repositories. To that end we designed a YAML configuration file that would be used when creating the jobs:

project:Banana Standrepo:curalate/banana-standemail:gmb@curalate.com

Any parameter that can differ between jobs can be defined here. As all our services and projects are written in Scala and built with Maven there is little differentiation currently. However, this is designed to be flexible enough to support future projects that don’t conform to these parameters. We aimed to strike a balance between ultimate flexibility and a sprawling configuration file.

In order to make working with the YAML file easier we’ve also created a small value object that will be populated with the values of the YAML above:

packagemodels/**
 * Simple value object to store configuration information for a project.
 *
 * Member variables without a value defined are required and those with a value
 * defined are optional.
 */classProjectConfig{/*
     * Required
     */StringprojectStringrepoStringemail/*
     * Optional
     */Stringcommand_test="mvn clean test"}

As we’ll see below, we use SnakeYAML to read in the configuration file and create the ProjectConfig instance.

Template

The template below is where we actually define the parameters of the job. In this case it’s a GitHub pull request builder. It will listen for pings from GitHub’s webhooks, run the test command (as defined in the configuration), and then set the build result on the pull request. We also define a few other parameters such as allowing concurrent builds, discarding old build information, and setting a friendly description. This is just a very small sample of what the DSL can do but it is very close to what we actually use for our pull request template internally. The Job DSL wiki is an excellent resource for more advanced topics and the API Viewer is indespensible for figuring out that obscure directive.

packagetemplatesclassPullRequestTemplate{staticvoidcreate(job,config){job.with{description("Builds all pull requests opened against <code>${config.repo}</code>.<br><br><b>Note</b>: This job is managed <a href='https://github.com/curalate/jenkins-job-dsl-demo'>programmatically</a>; any changes will be lost.")logRotator{daysToKeep(7)numToKeep(50)}concurrentBuild(true)scm{git{remote{github(config.repo)refspec('+refs/pull/*:refs/remotes/origin/pr/*')}branch('${sha1}')}}triggers{githubPullRequest{cron('H/5 * * * *')triggerPhrase('@curalatebot rebuild')onlyTriggerPhrase(false)useGitHubHooks(true)permitAll(true)autoCloseFailedPullRequests(false)}}publishers{githubCommitNotifier()}steps{shell(config.command_test)}}}}

Seed job

Lastly, we need some glue to tie it all together and that’s where the seed job comes in. Rather than define the seed job inline as we did above we’re using a Groovy script that is stored in the repository. This moves more of the code outside of the Jenkins job and into version control. We’ve configured this seed job to run periodically on a schedule as well as after every commit. This ensures that the jobs it manages never deviate for too long from their configuration.

Before we can jump into the actual Groovy code we first need to make sure we can parse the YAML configuration files. For that, we’ll be using SnakeYAML. Since this library isn’t available by default it needs to be manually included. Thankfully, the plugin authors already thought of this and have the situation covered. To include the SnakeYAML library we add an initial “Execute shell” build step and download the library directly from Maven central into a ./libs directory:

#!/bin/bash
mkdir -p libs &&cd libs

if[ ! -f "snakeyaml-1.17.jar"]; then
    wget https://repo1.maven.org/maven2/org/yaml/snakeyaml/1.17/snakeyaml-1.17.jar
fi

The next step in the seed job is to execute the DSL script. Notice that we’re using main.groovy as the entry point and we instruct the plugin to look for additional libraries in the ./libs directory that was created in the previous build step.

Now that we’ve got the foundation in place we can finally take a look at the contents of the main.groovy script. This is a single entry point where all our programmatic jobs are created.

importmodels.*importtemplates.*importhudson.FilePathimportorg.yaml.snakeyaml.YamlcreateJobs()voidcreateJobs(){defyaml=newYaml()// Build a list of all config files ending in .ymldefcwd=hudson.model.Executor.currentExecutor().getCurrentWorkspace().absolutize()defconfigFiles=newFilePath(cwd,'configs').list('*.yml')// Create/update a pull request job for each config fileconfigFiles.each{file->defprojectConfig=yaml.loadAs(file.readToString(),ProjectConfig.class)defproject=projectConfig.project.replaceAll(' ','-')PullRequestTemplate.create(job("${project}-Pull-Request"),projectConfig)}}

In this example we run the following steps:

  • Look in the ./configs directory for all files ending in .yml
  • Parse each YAML configuration file and create an instance of ProjectConfig
  • Create an instance of PullRequestTemplate, passing in the configuration instance

While I’ve left out some additional steps we take (such as creating [dashboards][dashboard-view] for each project) the overall structure and code is very similar.

Conclusion

Using the above structure, a few days of work, and a hundred lines of code we were able to transform dozens of manual jobs into a set of coherent, standardized, and source-controlled jobs. The use of source control has also allowed us to set the seed job to automatically build when a new change is merged. So not only do we get the benefit of peer review before a change is merged but that change is deployed and available within mere seconds after it is merged. This speed combined with the DSL and Groovy scripting ability make this an incredibly powerful paradigm. All the code presented here is available in the jenkins-job-dsl-demo repository on GitHub so feel free to use it as a jumping off point.

Are we doing it wrong? Did we miss something? Come join us and help solve challenging problems. We’re hiring in Philadelphia, Seattle, and New York.

Building a Tracking Pixel in 3 Steps (featuring AWS Kinesis Firehose!)

$
0
0

At Curalate, we need to be able to use data to demonstrate that our products hold value for our clients. One of our products, Fanreel, uses user-generated content to enhance online shopping experiences and product discovery. We record and store usage metrics from Fanreel but we also need to take those usage metrics and connect them to product purchases. If Fanreel analytics were a puzzle, purchase information would be the last piece and historically, Google Analytics or Adobe Omniture served as this last piece. However, every ecommerce site is different so sometimes the intricacies of Google Analytics and Adobe Omniture got in the way.

We wanted to have a simple “one size fits all” solution so we have turned to a simple tracking pixel. Specifically, our first tracking pixel is a checkout pixel which lives on our clients’ checkout confirmation pages to collect transaction data and serve as the last piece of our analytics puzzle.

What Is a Tracking Pixel?

A tracking pixel is a 1x1 transparent image that sends data from the webpage the pixel lives on. When the page loads, a GET request is made for the image along with query parameters that contain user data.

<imgsrc="https://yourwebsite.com/trackingpixel?pixelid=12345&username=shippy&company=curalate&position=engineer">
This tracking pixel is sending information from a pixel with id 12345 about a user named Shippy who is works at Curalate and whose position is engineer.

When the server receives the request for the image, it also receives all of the query parameters, simplifying data transfer between different websites. Since you can place a tracking pixel anywhere that you can use Javascript, it’s a simple and flexible way of transferring data. To get a tracking pixel up and running, you’ll need a few things:

  1. a Javascript snippet to collect the data you’re tracking and request the tracking pixel image
  2. a servlet to receive the tracked data and return the tracking pixel image
  3. a way to stream/store the data

We solve step 3 with AWS Kinesis Firehose. We like Kinesis Firehose because it’s easily configurable and because it fits nicely into our existing data pipeline which uses AWS Redshift extensively as a data store.

Let’s go through how you can set up a tracking pixel and AWS Kinesis Firehose to collect some information on our friend Shippy, the engineer who works at Curalate:

Step 1: Create a Javascript Library to Collect and Send Data

First, you need a small snippet that initializes a global queue of pixel functions to execute and also asynchronously loads your Javascript library of pixel functions (more on this soon!).

(function(){// initialize the global queue, 'q' which is attached to a 'crl8' objectif(!('crl8'inwindow)){window.crl8=function(){window.crl8.q.push(arguments);};window.crl8.q=[];}// load your js library of pixel functions...varscript=document.createElement('script');script.src='https://yourwebsite.com/js-min/pixel-library.min.js';// ...do it asynchronously...script.async=true;// ...and insert it before the first script on the page!varfirstScript=document.getElementsByTagName('script')[0];firstScript.parentNode.insertBefore(script,firstScript);})();

You can minify this file and put it in the head tag of whatever pages you want your tracking pixel to live on.

As for Your Pixel JS Library File…

This file defines all of the functions needed to gather the data you’re interested in and to generate a request for the pixel that will send the data to your server. It will also pull events off of the global queue and execute them. This library should provide functions so that whoever is placing your pixel on their website can use these functions to include exactly the data you want the pixel to collect.

(function(){varapi={};// use this object to store all of your library functionsvarpixelId=null;vardata={};// use this object to store the data you're collecting and sending// if your pixel will be used in multiple places, unique pixel ids will be crucial to// identify which piece of data came from which placeapi.init=function(pId){pixelId=pId;};// include a function for each type of data you want to collect and add it to your data object.// if we're trying to collect Shippy's name, company, and position, we'll have the following// functions which should take in an object with key and value as argument (this will form your// query parameters):api.addName=function(n){data.push(n);};api.addCompany=function(c){data.push(c);};api.addPosition=function(p){data.push(p);};// include a function to turn all the data you've collected in the data object into query// parameters to append to the url for the pixel on your serverapi.toQueryString=function(){vars=[];Object.keys(data).forEach(function(key){s.push(key+"="+encodeURIComponent(data[key]));});returns.join("&");};// include a function to add the query parameters to your pixel url and to finally append// the resulting pixel URL to your documentapi.send=function(){varpixel=document.createElement("img");varqueryParams=api.toQueryString();pixel.src="https://yourwebsite.com/trackingpixel/"+pixelId+"/pixel.png?"+queryParams;document.body.appendChild(pixel);};// pull functions off of the global queue and execute themvarexecute=function(){// while the global queue is not empty, remove the first element and execute the// function with the parameter it provides// (assuming that the queued element is a 2 element list of the form// [function, parameters])varcommand=window.crl8.q.shift();varfunc=command[0];varparameters=command[1];if(typeofapi[func]==='function'){api[func].call(window,parameters);}else{console.error("Invalid function specified: "+func);}};execute();}

Step 2: Set Up a Servlet

The servlet ties everything together.

  1. receive the data sent by the tracking pixel
  2. return a 1x1 transparent image to the page that requested the pixel
  3. send the data to your Firehose delivery stream
importcom.amazonaws.services.kinesisfirehose.AmazonKinesisFirehoseClientimportcom.amazonaws.services.kinesisfirehose.model.{PutRecordBatchRequest,Record}classTrackingPixelServletextendsScalatraServletEx{getEx("/:pixelId/pixel.png"){privatevalfirehoseClient=newAmazonKinesisFirehoseClient(credentials)// this should match the name you that you set for your Kinesis Firehose delivery stream
privatevalDELIVERY_STREAM="tracking-pixel-delivery-stream"// extract the tracking pixel data from query parameters
valpixelId=paramGetter.getRequiredLongParameter("pixelId")valusername=paramGetter.getRequiredStringParameter("username")valcompany=paramGetter.getRequiredStringParameter("company")valposition=paramGetter.getRequiredStringParameter("position")valuserData=UserData(pixelId,username,company,position)// create a record
valjsonData=JsonUtils.toJson(userData)+ENTRY_SEPARATORvalrecord=newRecordrecord.setData(ByteBuffer.wrap(data.getBytes()))// send the record to your firehose delivery stream
valrequest=newPutRecordBatchRequest()request.setDeliveryStreamName(DELIVERY_STREAM)request.setRecords(record)firehoseClient.putRecordBatch(request)// return a 1x1 transparent image to the page with the tracking pixel
{contentType="image/png"PIXEL_IMG// your pixel
}}}caseclassUserData(pixelId:Long,username:String,company:String,position:String)

Step 3: Set Up AWS Kinesis Firehose

Now that you have all of this data collected by your tracking pixel, how do you store it? At Curalate, we’ve turned to AWS Kinesis Firehose. Firehose is specifically designed to provide an easy and seamless way to capture and load data into AWS, not to mention that setup literally consists of clicks on the AWS console. Firehose is also great since since data streamed to a Firehose delivery stream can ultimately land in ElasticSearch, S3, or Redshift.

Since we love Redshift and already use it very heavily, that’s our chosen destination for our checkout pixel. The AWS setup documentation is quite thorough but here are a few tips that we picked up from our setup for our checkout pixel:

  • Adjust the buffer size and buffer interval of your Firehose delivery stream to control your throughput. If you have a lot of data, consider reducing the buffer size and interval to get quicker updates into your final AWS destination.
  • After you’ve set up your tracking pixel Javascript and your Firehose delivery stream, take advantage of the error logs and Cloudwatch monitoring that Firehose provides to verify that your tracking pixel is correctly sending data to your delivery stream and to your delivery destination.

And That’s It!

Since you can place a tracking pixel anywhere that you can use Javascript, it’s a simple and flexible way to collect data. Combined with AWS Kinesis Firehose, the pipeline from data collection to storage is very adaptable to your specific needs and very easily configurable.

Mechanical Turk Lessons Learned

$
0
0

At Curalate, we constantly dream up big ideas for new products and services. Big ideas that require lots of work. Lots of boring, repetitive, simple work that we honestly do not want to do ourselves. In situations like this we turn to the industry standard for getting other people to do work for you, Amazon Mechanical Turk. Amazon’s “Artificial Artificial Intelligence” service connects requesters to people from across the globe (known as “Turkers”) to complete a set of Human Intelligence Tasks or HITs. We’ve used Mechanical Turk in the past to create labeled datasets for use in our machine learning models to tackle various deep learning problems here at Curalate (such as the Emojini 3000 and Intelligent Product Tagging). In the process we have learned a few lessons that would have saved us a lot of time if known beforehand, so here they are to help if you want to get the most out of your valuable Turk time.

Turkers are surprisingly cost effective

So why use Mechanical Turk in the first place? Turkers will work for a single penny in many cases. Even with Amazon’s additional fees such as a 20% service fee and additional fees for premium qualifications, you can get a lot of work done for very little investment. Because of this low upfront cost, it’s beneficial to test your HITs on smaller data sets first to work out any issues. Maybe the instructions for your task are not clear enough for the Turkers or maybe your job has a poor conversion rate (workers finding your HIT, but not wanting to do it). More on avoiding some of these issues later.

Now you will have to tune the reward for your HITs a bit. Too much and your HIT will get done very quick, but you will be wasting money. Too little and no one will want to do your HIT. Once again this is where you can set up multiple versions of your HIT at different price points to find the sweet spot.

Master turkers are worth it

To help avoid some of the possible quality issues associated with using Mechanical Turk, we use Masters Turkers on all of our HITs. These are Turkers who have a history of giving good results. There is a cost associated with using them however and they are harder to attract to a job, but their answers are more consistent compared to non-masters.

Setting up a job is easy

If you are reading this engineering dev blog you likely know everything you need to set up a Mechanical Turk job. The HIT layout and questions are in standard HTML. HITs are generally unstyled (we’re talking 1995 era web styling here), so you can largely forgo any fancy CSS. You really just need to know how to make lists, tables, and standard input fields. The online editor is clunky and will not automatically save work if you accidentally go back a page or something, so save your work often.

Your jobs are uploaded as a CSV file, 1 HIT per row. Each column containing string variables that are replaced with placeholders in the HTML source of your HIT. You can use this for various things like setting variable addresses for hosted images or links to external pages, or custom text or strings per task for the user, but be aware that Mechanical Turk does not support full UTF-8 and will complain if you try to upload a CSV file containing your favorite emojis 😞

Less is more

A problem that we ran into early on was having a poor conversion rate with the Turkers. They were viewing our job and then leaving and our best guess as to why was because they didn’t feel it was worth their time. The problem was we were either asking too many questions, even if they are very simple questions, or were presenting the Turkers a massive wall of text that they did not want to read.

Our advice

  • Have them answer as little questions as possible.
  • Try and keep them yes/no style questions if possible.
  • Hide your instructions for the task in a drop-down drawer, the example HITs provided by Amazon do this as well, since a Turker really only needs to see this once.
  • Make sure the instructions are not annoyingly long to read.
  • Make sure your HIT is short enough that the Turker does not have to scroll to view it all.

To help track our conversion rate of Turkers we embedded an HTML only tracking pixel in our HIT template. There are many free tracking pixels available, but this is the one that we use. Just use the basic anchor tag tracking pixel, stick it at the bottom of your hit template HTML, and everything shouldTM work fine.

Standing out from the crowd

When your HIT goes live, it is going to be placed in a pool with all of the other current HITs available. You have to stand out from the crowd and make your job easy to find for the Turkers who would want to complete it.
In our past experience, short and catchy titles and keywords increase the amount of new Turkers finding your HIT, which in turn increases the rate at which your overall job gets completed. Likewise, Turkers do judge a book by its cover, so try to not include words in the title that would make the Turker think that the job would take too long or contain content that is boring or uninteresting to work on. At the same time, if your HIT contains NSFW content you do have to properly mark it as such when creating the HIT in the dashboard.

Don’t assume background knowledge

Overall Turkers are largely from the US and India. Amazon is slowly expanding into other markets around the world, but you can expect to get citizens of these two countries on your HIT. Therefore it’s important that your HIT makes sense to non-native English speakers or those who may be unfamiliar with certain cultural knowledge. For example, Instagram’s active user base is largely in American and European markets. This would mean that a question like “Which Instagram filter best describes you?” would make zero sense to a lot of Turkers.

Turkers do not want to waste time trying to figure out your crazy HIT. Keep an elementary difficulty level in the task and provide simple examples over explanations when possible. The faster a Turker can complete your HIT / the more straightforward it is to understand, the more likely you’ll see completed HITs and repeat workers.

Repeat visitors is a good sign

If your HITs are good in terms of a price to difficulty ratio, you will see many repeat workers. This is generally a good thing as having the same workers on your tasks will result in faster results and more consistent answers. The tracking pixel also comes into use here for tracking the unique visitors.

Manual checking of your results

Even after all of this you will still get bad answers. We have had plenty of Turkers who would just answer all of the same option. To help avoid this we would have multiple, preferably an odd number, Turkers take multiple passes on each HIT. The results from your HITs will be returned in a CSV which links a Turkers ID to their answers, so we also wrote scripts to do some basic analysis of the answers given by each worker to see if their answers met some red flag criteria. This could include them always answering largely the same pattern or answer, or always disagreeing with the other Turkers assigned to the same HIT.

Even then, we still felt the need to manually check the HITs that returned results on the border of our problem space (i.e. an almost even amount of Turkers answering yes or no on a HIT). We would go through and manually check these answers to solidify the result.

And that’s it! At least it’s all of the general, not-too-specific things we have figured out. By no means is this everything that you need to know to successfully use Amazon Mechanical Turk, but it should make your experience as a requester a bit smoother.

Creating an iOS Share Extension for an Ionic App

$
0
0

At Curalate we build a mobile app that allows our clients to use our services on-the-go. It provides an ever-evolving subset of features from our web experience and we ship updates every month or so, but we don’t have any engineers that specialize in mobile development on our team! Ionic enables us to leverage the expertise of our full-stack engineers to ship a product on mobile using client technology that we’re already familiar with - AngularJS. We write well-factored JavaScript in Angular services once for the web, layer on unit tests, then leverage that same code to build an iOS app.

Fast forward to this summer - the Curalate dev team moved fast to launch Tilt to enable brands to create shoppable channels of vertical video. As a prerequisite, we wanted to make it easier for users to share images and videos with Curalate from any app by releasing an iOS share extension. We started looking for documentation about how to do this with Ionic but found that it was not supported as a first-class scenario. This is one area where Ionic doesn’t make things easier, but by sharing a guide to shipping an iOS share extension for an Ionic app we hope it’ll encourage more developers to try it.

Code Native

Since Ionic doesn’t have first class support for share extensions (iOS or otherwise), we have to roll up our sleeves and delve into native platform development to get the job done. Here’s how:

1. Generate a Share Extension with XCode

The easiest way to get started is to open the .xcodeproj generated by the “ionic platform add ios” command. From XCode, follow the instructions provided by Apple here to generate the boilerplate for a simple Share extension.

2. Choose Supported Content Types

Next, edit your extension project’s .plist file to declare the types of files you want to handle. For our app we enabled sharing images and videos using the following config:

<key>NSExtension</key><dict><key>NSExtensionAttributes</key><dict><key>NSExtensionActivationRule</key><dict><key>NSExtensionActivationSupportsImageWithMaxCount</key><integer>10</integer><key>NSExtensionActivationSupportsMovieWithMaxCount</key><integer>10</integer></dict></dict></dict>

3. Share Authentication with your App

We require users to be authenticated with their Curalate account before they upload. We used iOS keychain sharing (described here) by wrapping what we learned from this StackOverflow post in a Cordova plugin. Whenever a user logs in or out of the Curalate mobile app we update the information securely stored in iOS keychain, then retrieve that information when our iOS share extension is launched.

4. Create an App Group

Images and videos can be quite large, so supporting background upload was a priority for us. Per documentation here, iOS has a restriction that background uploads must create a shared container - here’s how.

5. Upload All-The-Things

Now that you’ve got an App Group and you’ve retrieved auth information from Keychain Sharing, let’s upload the files that your user wants to share - here’s the relevant excerpt from the code we used to configure the background upload:

#define kAppGroup @"group.com.myapp"-(void)uploadFileAtPath:(NSURL*)filePathURLfileName:(NSString*)fileName{// Retrieve the authentication headers from iOS Keychain
NSDictionary*httpHeaders=[SimpleKeychainload:@"com.myapp.keychain"];[configsetHTTPAdditionalHeaders:httpHeaders];// Form the HTTP request
NSMutableURLRequest*request=[[NSMutableURLRequestalloc]initWithURL:[NSURLURLWithString:@"https://app.myapp.com/upload"]];[requestsetCachePolicy:NSURLRequestReloadIgnoringLocalCacheData];[requestsetHTTPShouldHandleCookies:NO];[requestsetTimeoutInterval:120];[requestsetHTTPMethod:@"POST"];[requestsetValue:fileNameforHTTPHeaderField:@"fileName"];[requestsetValue:@"1"forHTTPHeaderField:@"basefile"];[requestaddValue:@"application/octet-stream"forHTTPHeaderField:@"Content-Type"];// Configure and execute a background upload
NSString*appGroupId=kAppGroup;NSString*uuid=[[NSUUIDUUID]UUIDString];NSURLSessionConfiguration*backgroundConfig=[NSURLSessionConfigurationbackgroundSessionConfigurationWithIdentifier:uuid];[backgroundConfigsetSharedContainerIdentifier:appGroupId];[backgroundConfigsetHTTPAdditionalHeaders:httpHeaders];self.backgroundSession=[NSURLSessionsessionWithConfiguration:backgroundConfigdelegate:selfdelegateQueue:[NSOperationQueuemainQueue]];NSURLSessionUploadTask*uploadTask=[self.backgroundSessionuploadTaskWithRequest:requestfromFile:filePathURL];[uploadTaskresume];}

With that your upload is underway!

In our initial testing we noticed some flakiness in uploads completing, and eventually sourced the issue back to the fact that we needed to implement handleEventsForBackgroundSession within our app. You don’t have to do anything special, just add this to your AppDelegate.m.

-(void)application:(UIApplication*)applicationhandleEventsForBackgroundURLSession:(NSString*)identifiercompletionHandler:(void(^)())completionHandler{completionHandler();}

Optimize Your Development Workflow

All of this is great, until you re-run “ionic platform add ios” and and it blows away all of the things you’ve just configured in XCode :( We haven’t found a good way to auto generate this stuff yet, so for the time being we’re checking in our iOS extension code and manually enabling the app group and keychain sharing capabilities whenever we need to re-add the iOS platform.

That said, we looked for a few solutions that didn’t ultimately pan out, these included:

  • Declaring the iOS app capabilities in Cordova’s config.xml or similar, namely:
    • Push Notifications
    • Keychain Sharing
    • App Groups
  • Configuring Ionic/Cordova to reference existing code for our iOS extension. Unfortunately the .xcodeproj structure contains a lot of generated interlinked keys that prevented us from going this route.
  • Using the Xcodeproj CocoaPod to generate our iOS extension from checked in source files. This looked promising, but ddoesn’t support Share extensions.

We’re continually looking for improvements to our development process, and will post updates here for any that we find!

Content Based Intelligent Cropping

$
0
0

Square pegs don’t fit in round holes, but what if you have power tools?

Digital images often don’t fit where we want them: advertisements, social networks, and printers all require that images be a specific aspect ratio (i.e., the ratio of the image’s width to height). Take Facebook ads for example: different aspect ratios are required depending on what kind of ad you wish to run. This is a large pain point for marketers: each piece of content must be manually cropped to fit the aspect ratio of the channel. Typically, images are either padded with white pixels (thus wasting valuable screen real estate) or arbitrarily cropped (possibly degrading the content).

But it doesn’t have to be this way! In this post, we present a technique that we use for intelligent cropping: a fully automatic method that preserves the image’s content. We’ve included some example code so you can explore on your own, and some real-world examples from Curalate’s products.

The following illustrates our approach:

  • The input to the algorithm is an image and a desired aspect ratio.
  • First, we use a variety of techniques to detect different types of content in the image. Each technique results in a number of content rectangles that are assigned a value score.
  • Second, we select the optimal region of the image as that which contains the content rectangles with the higest cumulative score.
  • Finally, we crop the input image to the optimal region.

The result is a cropped image of the desired aspect ratio fully containing the content in the image.

Prerequisites

To run these examples for yourself, you’ll need Python 2 with OpenCV, NumPy, and matplotlib installed. The images used for examples in this post may be downloaded here. This entire post is also available as a python notebook if you want to take it for a spin.

To start off, let’s load an image we’d like to use:

importcv2importurllibimportnumpyasnpimportmatplotlib.pyplotasplt%matplotlibinlinedefshowImage(img):plt.axis('off')plt.imshow(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))img=cv2.imread("input.jpg")showImage(img)

png

Let’s assume we’re creating a Facebook ad to drive traffic to our website. The recommended resolution is 1200x628 for target aspect ratio of 1.91.

The naive approach would just crop the center of the image:

desiredAspectRatio=1200/float(628)newHeight=img.shape[1]/desiredAspectRatiostart=img.shape[0]/2-newHeight/2naiveCrop=img[start:start+newHeight,:]showImage(naiveCrop)

png

Ugh. I wouldn’t click on that. Let’s do something intelligent!

Identifying Content in Images

Our first task is to detect different content in the image. Object detection is still an active area of research, though recent advances have started to make it feasible in many applications. Here we explore a few simple techniques that are built into OpenCV but you can use any detector you like.

Face Detection

If an image contains a face, it’s likely that the person is a key element in the image. Fortunately, face detection is a common task in computer vision:

gray=cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)cascade=cv2.CascadeClassifier("haarcascade_frontalface_default.xml")faceRegions=cascade.detectMultiScale(gray,minNeighbors=7,scaleFactor=1.1)

The result is a numpy array of rectangles containing the faces:

defdrawRegions(source,res,regions,color=(0,0,255),size=4):for(x,y,w,h)inregions:res[y:y+h,x:x+w]=source[y:y+h,x:x+w]cv2.rectangle(res,(x,y),(x+w,y+h),color,size)returnresfaded=(img*0.65).astype(np.uint8)showImage(drawRegions(img,faded.copy(),faceRegions))

png

Interest Points

Sometimes, we don’t know what we’re looking for in an image. Low-level image characteristics, however, often correspond to the interesting area of images. There are many common techniques for identifying interesting areas of an image, even ones that estimate visual saliency. Shi-Tomasi’s Good Features To Track is one technique commonly used to indicate interest points in an image. Detecting these interest points is also relatively simple using OpenCv:

interestPoints=cv2.goodFeaturesToTrack(gray,maxCorners=200,qualityLevel=0.01,minDistance=20).reshape(-1,2)interestPointRegions=np.concatenate((interestPoints,np.ones(interestPoints.shape)),axis=1).astype(np.int32)showImage(drawRegions(img,faded.copy(),interestPointRegions,(255,255,255),size=10))

png

Product Detection

Other times, we know a specific product is in an image and we want to make sure we don’t crop it out. We can achieve this by localizing an image of the product in our image of interest.

In our example, the product is:

productImage=cv2.imread("product.jpg")showImage(productImage)

png

We can locate the product in the image using instance retrieval techniques. First, we’ll estimate the transformation between the product and the target image:

flann=cv2.FlannBasedMatcher({'algorithm':0,'trees':8},{'checks':100})detector=cv2.SIFT()kpts1,descs1=detector.detectAndCompute(productImage,None)kpts2,descs2=detector.detectAndCompute(img,None)matches=[mfor(m,n)inflann.knnMatch(descs1,descs2,k=2)ifm.distance<0.8*n.distance]sourcePoints=np.float32([kpts1[m.queryIdx].ptforminmatches]).reshape(-1,2)destPoints=np.float32([kpts2[m.trainIdx].ptforminmatches]).reshape(-1,2)M,mask=cv2.findHomography(sourcePoints,destPoints,cv2.RANSAC,11.0)

The result is a set of correspondence points between the images:

defdrawMatches(img1,kpts1,img2,kpts2,matches):# combine both imagesout=np.zeros((max([img1.shape[0],img2.shape[0]]),img1.shape[1]+img2.shape[1],3),dtype='uint8')out[:img1.shape[0],:img1.shape[1]]=img1out[:img2.shape[0],img1.shape[1]:]=img2# draw the linesformatchinmatches:(x1,y1)=kpts1[match.queryIdx].pt(x2,y2)=kpts2[match.trainIdx].ptcv2.line(out,(int(x1),int(y1)),(int(x2)+img1.shape[1],int(y2)),(0,0,255),4)returnoutshowImage(drawMatches(productImage,kpts1,img,kpts2,np.array(matches)[[np.where(mask.ravel()==1)[0]]]))

png

We simply take the bounding box around the product’s location:

pp=destPoints[mask.ravel()==1]xmin=pp[:,0].min()ymin=pp[:,1].min()productRegions=np.array([xmin,ymin,pp[:,0].max()-xmin,pp[:,1].max()-ymin]).astype(np.int32).reshape(1,4)showImage(drawRegions(img,faded.copy(),productRegions,(0,255,0)))

png

Content Regions

In summary, we have detected faces, interest points, and products in the image. Together, these form the full set of content regions:

contentRectangles=np.concatenate((faceRegions,productRegions,interestPointRegions),axis=0)vis=faded.copy()drawRegions(img,vis,interestPointRegions,(255,255,255),size=10)drawRegions(img,vis,faceRegions)drawRegions(img,vis,productRegions,(0,255,0))showImage(vis)

png

Optimal Cropping

Now that we have detected the content regions in the image, we’d like to identify the best way to crop the image to a desired aspect ratio of 1.91. The strategy is simple: find the area of the image with the desired aspect ratio containing the highest sum of the content rectangle scores.

First, let’s assign a score to each content rectangle. For this example, we’ll just use the area of each rectangle.

contentScores=np.multiply(contentRectangles[:,2],contentRectangles[:,3])

Reducing to One Dimension

Now for the fun part: Depending on the input image and desired aspect ratio, the resulting crop will either have the same height as the input image and a reduced width, or the same width as the input image and a reduced height. The principal axis is the dimension of the input image that needs to be cropped. Let:

alpha=img.shape[1]/float(img.shape[0])

be the aspect ratio of the input image. If alpha > desiredAspectRatio, then the horizontal axis is the principal axis and the system crops the width of the image. Similarly, if alpha < desiredAspectRatio, then the vertical axis is the principal axis and the system crops the height of the image.

Projecting the content rectangles onto the principal axis simplifies our goal: the optimal crop is simply the window along the principal axis containing the highest sum of content region scores. The length of this window is the size of the final crop along the principal axis.

if(alpha>desiredAspectRatio):# the horizontal axis is the principal axis.finalWindowLength=int(desiredAspectRatio*img.shape[0])projection=np.array([[1,0,0,0],[0,0,1,0]])else:# the vertical axis is the principal axis.finalWindowLength=int(img.shape[1]/desiredAspectRatio)projection=np.array([[0,1,0,0],[0,0,0,1]])contentRegions=np.dot(projection,contentRectangles.T).T

Thus, the content rectangles are reduced from two dimensional rectangles to one dimensional regions.

Selecting the Optimal Crop

The optimal crop is the window of length finalWindowLength whose contentRegions’ scores sum to the largest possible value. We can use a sliding window approach to quickly and efficiently find such a crop.

First, we’ll define the inflection points for the sliding window approach. Each inflection point is a location on the number line where the value of the current window can change. There are two inflection points for each content region: one that removes the content region’s score when the window passes the region’s starting location, and one that adds a content region’s score when the window encapsulates it.

inflectionPoints=np.concatenate((contentRegions[:,0],contentRegions[:,0]+contentRegions[:,1]-finalWindowLength))inflectionDeltas=np.concatenate((-contentScores,contentScores))inflections=np.concatenate((inflectionPoints.reshape(-1,1),inflectionDeltas.reshape(-1,1)),axis=1)

Next, we’ll sort the inflection points by their locations on the number line, and ignore any outside the valid range:

inflections=inflections[inflections[:,0].argsort()]# Sort by locationinflections=inflections[inflections[:,0]>=0]# drop any outside our range

To implement our sliding window algorithm, we need only accumulate the sum of the inflection points’ values at each location, and then take the maximum:

inflections[:,1]=np.cumsum(inflections[:,1])optimalInflectionPoint=max(enumerate(inflections),key=lambda(idx,(s,v)):v)[0]

The optimalInflectionPoint contains a starting location that has the most value. In fact the range of pixels between that inflection point and the next one all have that same value. We’ll take the middle of that range for our starting point:

optimalStartingLocation=(inflections[optimalInflectionPoint,0]+inflections[optimalInflectionPoint+1,0])/2

Now that we know where the optimal crop begins on the principal axis, we can un- project it to get the final crop:

ifalpha>desiredAspectRatio:optimalCrop=[optimalStartingLocation,0,finalWindowLength,img.shape[0]]else:optimalCrop=[0,optimalStartingLocation,img.shape[1],finalWindowLength]

Awesome! Now we know where to crop the image! You can see below that the optimal crop indeed includes the product, the face, and a large number of the interest points:

result=img[optimalCrop[1]:optimalCrop[3]+optimalCrop[1],optimalCrop[0]:optimalCrop[2]+optimalCrop[0]]showImage(result)

png

Now that’s a good pic!

Disclaimer: The code above is meant as a demonstration. Optimization, handling of edge cases, and parameter tuning are left as an exercise for the reader 😉.

Result Gallery

Below are some example results. The desired aspect ratio is listed below the input image.

Uses in Curalate Products

One great place we use intelligent cropping is when displaying our clients’ images. Below is a screenshot showing some product images before intelligent cropping, and then after. Notice how the models’ faces, the shoe, and the bag were all cropped using the naive method. After intelligent cropping, our thumbnails are much more useful representations of the original images.

Before Intelligent CroppingAfter Intelligent Cropping

From Thrift To Finatra

$
0
0

There are a million and one ways to do (micro-)services, each with a million and one pitfalls. At Curalate, we’ve been on a long journey of splitting out our monolith into composable and simple services. It’s never easy, as there are a lot of advantages to having a monolith. Things like refactoring, code-reuse, deployment, versioning, rollbacks, are all atomic in a monolith. But there are a lot of disadvantages as well. Monoliths encourage poor factoring, bugs in one part of the codebase force rollbacks/changes of the entire application, reasoning about the application in general becomes difficult, build times are slow, transient build errors increase, etc.

To that end our first foray into services was built on top of Twitter Finagle stack. If you go to the page and can’t figure out what exactly finagle does, I don’t blame you. The documentation is lackluster and in and of itself is quite low-level. Finagle defines a service as a function that transforms a request into a response, and composes services with filters that manipulate requests/responses themselves. It’s a clean abstraction, given that this is basically what all web service frameworks do.

Thrift

Finagle by itself isn’t super opinionated. It gives you building blocks to build services (service discovery, circuit breaking, monitoring/metrics, varying protocols, etc) but doesn’t give you much else. Our first set of services built on finagle used Thrift over HTTP. Thrift, similiar to protobuf, is an intermediate declarative language that creates RPC style services. For example:

namespace java tutorial
namespace py tutorial

typedef i32 int // We can use typedef to get pretty names for the types we are using
service MultiplicationService
{
        int multiply(1:int n1, 2:int n2),
}

Will create an RPC service called MultiplicationService that takes 2 parameters. Our implementation at Curalate hosted Thrift over HTTP (serializing Thrift as JSON) since all our services are web based behind ELB’s in AWS.

We have a lot of services at Curalate that use Thrift, but we’ve found a few shortcomings:

Model Reuse

Thrift forces you to use primitives when defining service contracts, which makes it difficult to share lightweight models (with potentially useful utilities) to consumers. We’ve ended up doing a lot of mapping between generated Thrift types and shared model types. Curalate’s backend services are all written in Scala, so we don’t have the same issues that a company like Facebook (who invented Thrift) may have with varying languages needing easy access to RPC.

Requiring a client

Many times you want to be able to interact with a service without needing access to a client. Needing a client has made developers to get used to cloning service repositories, building the entire service, then entering a Scala REPL in order to interact with a service. As our service surface area expands, it’s not always feasible to expect one developer to build another developers service (conflicting java versions, missing SBT/Maven dependencies or settings, etc). The client requirement has led to services taking heavyweight dependencies on other services and leaking dependencies. While Thrift doesn’t force you to do this, this has been a side effect of it taking extra love and care to generate a Thrift client properly, either by distributing Thrift files in a jar or otherwise.

Over the wire inspection

With Thrift-over-HTTP, inspecting requests is difficult. This is due to the fact that these services use Thrift serialization, which unlike JSON, isn’t human-readable.

Because Thrift over HTTP is all POSTs to /, tracing access and investigating ELB logs becomes a jumbled mess of trying to correlate times and IP’s to other parts of our logging infrastructure. The POST issue is frustrating, because it’s impossible for us to do any semantic smart caching, such as being able to insert caches at the serving layer for retrieval calls. In a pure HTTP world, we could insert a cache for heavily used GETs given a GET is idempotent.

RPC API design

Regardless of Thrift, RPC encourages poorly unified API’s with lots of specific endpoints that don’t always jive. We have many services that have method topologies that are poorly composable. A well designed API, and cluster of API’s, should gently guide you to getting the data you need. In an ideal world if you get an ID in a payload response for a data object, there should be an endpoint to get more information about that ID. However, in the RPC world we end up with a batch call here, a specific RPC call there, sometimes requiring stitching several calls to get data that should have been a simple domain level call.

Internal vs External service writing

We have lot of public REST API’s and they are written using the Lift framework (some of our oldest code). Developers moving from internal to external services have to shift paradigms and move from writing REST with JSON to RPC with Thrift.

Overall Thrift is a great piece of technology, but after using it for a year we found that it’s not necessarily for us. All of these things have prompted a shift to writing REST style services.

Finatra

Finatra is an HTTP API framework built on top of Finagle. Because it’s still Finagle, we haven’t lost any of our operational knowledge of the underlying framework, but instead we can now write lightweight HTTP API’s with JSON.

With Finatra, all our new services have Swagger automatically enabled so API exploration is simple. And since it’s just plain JSON using Postman is now possible to debug and inspect APIs (as well as viewing requests in Charles or other proxies).

With REST we can still distribute lightweight clients, or more importantly, if there are dependency conflicts a service consumer can very quickly roll an HTTP client to a service. Our ELB logs now make sense and our new API’s are unified in their verbiage (GET vs POST vs PUT vs DELETE) and if we want to write RPC for a particular service we still can.

There are a few other things we like about Finatra. For those developers coming from a background of writing HTTP services, Finatra feels familiar with the concept of controllers, filters, unified test-bed for spinning up build verification tests (local in memory servers), dependency injection (via Guice) baked in, sane serialization using Jackson, etc. It’s hard to do the wrong thing given that it builds strong production level opinions onto Finagle. And thankfully those opinions are ones we share at Curalate!

We’re not in bad company – Twitter, Duolingo, and others are using Finatra in production.

Tracing High Volume Services

$
0
0

We like to think that building a service ecosystem is like stacking building blocks. You start with a function in your code. That function is hosted in a class. That class in a service. That service is hosted in a cluster. That cluster in a region. That region in a data center, etc. At each level there’s a myriad of challenges.

From the start, developers tend to use things like logging and metrics to debug their systems, but a certain class of problems crops up when you need to debug across services. From a debugging perspective, you’d like to have a higher projection of the view of the system: a linearized view of what requests are doing. I.e. You want to be able to see that service A called service B and service C called service D at the granularity of single requests.

Cross Service Logging

The simplest solution to this is to require that every call from service to service comes with some sort of trace identifier. Incoming requests into the system, either from public API’s or client side requests, or even from async daemon invoked timers/schedules/etc generates a trace. This trace then gets propagated through the entire system. If you use this trace in all your log statements you can now correlate cross service calls.

How is this accomplished at Curalate? For the most part we use Finagle based services and the Twitter ecosystem has done a good job of providing the concept of a thread localTraceId and automatically propagating it to all other twitter-* components (yet another reason we like Finatra!).

All of our service clients automatically pull this thread local trace id out and populate a known HTTP header field that services then pick up and re-assume. For Finagle based clients this is auto-magick’d for you. For other clients that we use, like OkHttp, we had to add custom interceptors that pulled the trace from the thread local and set it on the request.

Here is an example of the header being sent automatically as part of Zipkin based headers (which we re-use as our internal trace identifiers):

Notice the X-B3-TraceId header. When a service receives this request it’ll re-assume the trace id and set its SLF4j MDC field of traceId to be that value. We can now include in our logback.xml configuration to include the trace id like in our STDOUT log configuration below:

<appendername="STDOUT-COLOR"class="ch.qos.logback.core.ConsoleAppender"><filterclass="ch.qos.logback.classic.filter.ThresholdFilter"><level>TRACE</level></filter><encoder><pattern>%yellow(%d) [%magenta(%X{traceId})] [%thread] %highlight(%-5level) %cyan(%logger{36}) %marker - %msg%n</pattern></encoder></appender>

And we can also send the trace id as a structured JSON field to Loggly.

Let’s look at an example from our own logs:

What we’re seeing here is a system called media-api made a query to a system called networkinformationsvc. The underlying request carried a correlating trace id across the service boundaries and both systems logged to Loggly with the json.tid (transaction id) field populated. Now we can query our logs and get a linear time based view of what’s happening.

Thread local tracing

The trick here is to make sure that this implicit trace id that is pinned to the thread local of the initiating request properly moves from thread to thread as you make async calls. We don’t want anyone to have to ever remember to set the trace. It should just gracefully flow from thread to thread implicity.

To make sure that traces hop properly between systems we had to make sure to enforce that everybody uses an ExecutionContext that safely captures the callers thread local’s before executing. This is critical, otherwise you can make an async call and the trace id gets dropped. In that case, bye bye go the logs! It’s hyper important to always take an execution context and to never pin an execution context when it comes to async scala code. Thankfully, we can make any execution context safe by wrapping it up in a delegate:

/**
 * Wrapper around an existing ExecutionContext that makes it propagate MDC information.
 */classPropagatingExecutionContextWrapper(wrapped:ExecutionContext)extendsExecutionContext{self=>overridedefprepare():ExecutionContext=newExecutionContext{// Save the call-site state
privatevalcontext=Local.save()defexecute(r:Runnable):Unit=self.execute(newRunnable{defrun():Unit={// re-assume the captured call site thread locals
Local.let(context){r.run()}}})defreportFailure(t:Throwable):Unit=self.reportFailure(t)}overridedefexecute(r:Runnable):Unit=wrapped.execute(r)overridedefreportFailure(t:Throwable):Unit=wrapped.reportFailure(t)}classTwitterExecutionContextProviderextendsExecutionContextProvider{/**
   * Safely wrap any execution context into one that properly passes context
   *
   * @param executionContext
   * @return
   */overridedefof(executionContext:ExecutionContext)=newPropagatingExecutionContextWrapper(executionContext)}

We’ve taken this trace wrapping concept and applied to all kinds of executors like ExecutorService, and ScheduledExecutorService. Technically we don’t really want to expose the internals of how we wrap traces, so we load an ExecutionContextProvider via a java service loading mechanism and provide an API contract so that people can wrap executors without caring how they are wrapped:

/**
 * A provider that loads from the java service mechanism
 */objectExecutionContextProvider{lazyvalprovider:ExecutionContextProvider={Option(ServiceLoader.load(classOf[ExecutionContextProvider])).map(_.asScala).getOrElse(Nil).headOption.getOrElse(thrownewMissingExecutionContextException)}}/**
 * Marker interfaces to provide contexts with custom logic. This
 * forces users to make sure to use the execution context providers that support request tracing
 * and maybe other tooling
 */traitProvidedExecutionContextextendsExecutionContext/**
 * A context provider contract
 */traitExecutionContextProvider{defof(context:ExecutionContext):ProvidedExecutionContext...}

From a callers perspective they now do:

implicit val execContext = ExecutionContextProvider.provider.of(scala.concurrent.ExecutionContext.Implicits.global)

Which would wrap, in this example, the default scala context.

Service to Service dependency and performance tracing

Well that’s great! We have a way to safely and easily pass trace id’s, and we’ve tooled through our clients to all pass this trace id automatically, but this only gives us logging information. We’d really like to be able to leverage the trace information to get more interesting statistics such as service to service dependencies, performance across service hops, etc. Correlated logs is just the beginning of what we can do.

Zipkin is an open source tool that we’ve discussed here before so we won’t go too much into it, but needless to say that Zipkin hinges on us having proper trace identifiers. It samples incoming requests to determine IF things should be traced or not (i.e. sent to Zipkin). By default, we have all our services send 0.1% of their requests to Zipkin to minimize impact on the service.

Let’s look at an example:

In this Zipkin trace we can see that this batch call made a call to Dynamo. The whole call took 6 milliseconds and 4 of those milliseconds were spent calling Dynamo. We’ve tooled through all our external client dependencies with Zipkin trace information automatically using java dynamic proxies so that as we upgrade our external dep’s we get tracing on new functions as well.

If we dig further into the trace:

We can now see (highlighted) the trace ID and search in our logs for logs related to this trace

Finding needles in the haystack

We have a way to correlate logs, and get sampled performance and dependency information between services via Zipkin. What we still can’t do yet is trace an individual piece of data flowing through high volume queues and streams.

Some of our services at Curalate process 5 to 10 thousand items a second. It’s just not fiscally prudent to log all that information to Loggly or emit unique metrics to our metrics system (DataDog). Still, we want to know at the event level where things are in the system, where they passed through, where they got dropped etc. We want to answer the question of

Where is identifier XYZ.123 in the system and where did it go and come from?

This is difficult to answer with the current tools we’ve discussed.

To solve this problem we have one more system in play. This is our high volume auditing system that lets us write and filter audit events at a large scale (100k req/s+). The basic architecture here is we have services write audit events via an Audit API which are funneled to Kinesis Firehose. The firehose stream buffers data for either 5 minutes or 128 MB (whichever comes first). When the buffer limit is reached, firehose dumps newline separated JSON in a flat fi`le into S3. We have a lambda function that waits for S3 create events on the bucket, reads the JSON, then transforms the JSON events into Parquet which is an efficient columnar storage format. The Parquet file is written back into S3 into a new folder with the naming scheme of

year=YYYY/month=MM/day=DD/hour=HH/minute=mm/<uuid>.parquet

Where the minutes are grouped in 5 minute intervals. This partition is then added to Athena, which is a managed map-reduce around PrestoDB, that lets you query large datasets in S3.

What does this have to do with trace id’s? Each event emitted comes with a trace id that we can use to query back to logs or Zipkin or other correlating identifiers. This means that even if services aren’t logging to Loggly due to volume restrictions, we can still see how events trace through the system. Let’s look at an example where we find a specific network identifier from Instagram and see when it was data mined and when we added semantic image tags to it (via our vision APIs):

SELECTminute,app,message,timestamp,contextFROMcuralateauditevents."audit_events"WHEREcontext['network_id']='1584258444344170009_249075471'andcontext['network']='instagram'andday=18andhour=22orderbytimestampdesclimit100

This is the Athena query. We’ve included the specific network ID and network we are looking for, as well as a limited partition scope.

Notice the two highlights.

Starting at the second highlight there is a message that we augmented the piece of data. In our particular pipe we only augment data under specific circumstances (not every image is analyzed) and so it was important to see that some images were dropped and this one was augmented. Now we can definitely say “yes, item ABC was augmented but item DEF was not and here is why”. Awesome.

Moving upwards, the first highlight is how much data was scanned. This particular partition we looked through has 100MB of data, but we only searched through 2MB to find what we wanted (this is due to the optimization of Parquet). Athena is priced by how much data you scan at a cost of $5 per terabyte. So this query was pretty much free at a cost of $0.000004. The total set of files across all the partitions for the past week is roughly 21GB spanning about 3.5B records. So even if we queried all the data, we’d only pay $.04. In fact, the biggest cost here isn’t in storage or query or lambda, it’s in firehose! Firehose charges you $0.029 per GB transferred. At this rate we pay 60 cents a week. The boss is going to be ok with that.

However, there are still some issues here. Remember the target scale is upwards of 100k req/s. At that scale we’re dealing with a LOT of data through Kinesis Firehose. That’s a lot of data into S3, a lot of IO reads to transform to Parquet, and a lot of opportunities to accidentally scan through tons of data in our athena partitions with poorly written queries that loop over repeated data (even though we limit partitions to a 2 week TTL). We also now have issues of rate limiting with Kinesis Firehose.

On top of that, some services just pump so much repeated data that its not worth seeing it all the time. To that end we need some sort of way to do live filters on the streams. What we’ve done to solve this problem is leverage dynamically invoked Nashorn javascript filters. We load up filters from a known remote location at an interval of 30 seconds, and if a service is marked for filtering (i.e. it has a really high load and needs to be filtered) then it’ll run all of its audit events through the filter before it actually gets sent to the downstream firehose. If an event fails the filter it’s discarded. If it passes, the event is annotated with which filter name it passed through and sent through the stream.

Filters are just YML files for us:

name:"Filtername"expiration:<Optional DateTime. Epoch or string datetime of ISO formats parseable by JODA>js:|function filter(event) {// javascript that returns a boolean}

And an example filter may look like

name:"anton_client_filter"js:|function filter(event) {var client = event.context.get("client_id")return client != null && client == "3136"}

In this filter only events that are marked with the client id of my client will pass through. Some systems don’t need to be filtered so all their events pass through anyway.

Now we can write queries like

SELECTminute,app,message,timestamp,contextFROMcuralateauditevents."audit_events"WHEREcontains(trace_names,'anton_client_filter')andday=18andhour=22limit100

To get events that were tagged with my filter in the current partition. From there, we now can do other exploratory queries to find related data (either by trace id or by other identifiers related to the data we care about).

Let’s look at some graphs that show how dramatic this filtering can be

Here the purple line is one of our data mining ingestion endpoints. It’s pumping a lot of data to firehose, most of which is repeated over time and so isn’t super useful to get all the input from. The moment the graph drops is when the yml file was uploaded with a filter to add filtering to the service. The blue line is a downstream service that gets data after debouncing and other processing. Given its load is a lot less we don’t care so much that it is sending all its data downstream. You can see the purple line slow to a trickle later on when the filter kicks in and data starts matching it.

Caveats with Nashorn

Building the system out there were a few interesting caveats when using Nashorn in a high volume pipeline like this.

The first was that subtle differences in javascript can have massive performance impacts. Let’s look at some examples and benchmark them.

functionfilter(event){varanton={"136742":true,"153353":true}varmineable=event.context.get("mineable_id")returnmineable!=null&&anton[mineable]}

The JMH benchmarks of running this code is

[info] FiltersBenchmark.testInvoke  thrpt   20     1027.409 ±      29.922  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20  1484234.075 ± 1783689.007  ns/op

What?? 29 ops/second

Let’s make some adjustments to the filter, given our internal system loads the javascript into an isolated scope per filter and then re-invokes just the function filter each time (letting us safely create global objects and pay heavy prices for things once):

varanton={"136742":true,"153353":true}functionfilter(event){varmineable=event.context.get("mineable_id")returnmineable!=null&&anton[mineable]}
[info] FiltersBenchmark.testInvoke  thrpt   20  7391161.402 ± 206020.703  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    14879.890 ±   8087.179  ns/op

Ah, much better! 206k ops/sec.

If we use java constructs:

functionfilter(event){varanton=newjava.util.HashSet();anton.add("136742")anton.add("153353")varmineable=event.context.get("mineable_id")returnmineable!=null&&anton.contains(mineable)}
[info] FiltersBenchmark.testInvoke  thrpt   20  5662799.317 ± 301113.837  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    41963.710 ±  11349.277  ns/op

Even better! 301k ops/sec

Something is clearly up with the anonymous object creation in Nashorn. Needless to say, benchmarking is important, especially when these filters are going to be dynamically injected into every single service we have. We need them to be performant, sandboxed, and safe to fail.

For that we make sure everything runs its own engine scope in a separate execution context isolated from main running code and is fired off asynchronously to not block the main calling thread. This is also where we have monitoring and alerting on when someone uploads a non-performant filter so we can investigate and mitigate quickly.

For example, the discovery of the poorly performing json object came from this alert:

Conclusion

Tracing is hard and it’s incredibly difficult to tool through after the fact if you start to build service architectures without this in mind from the get go. Tooling trace identifiers through the system from the beginning sets you up for success in building more interesting debugging infrastructure that isn’t always possible without that. When building larger service ecosystems it’s important to keep in mind how to inspect things at varying granularity levels. Sometimes building custom tools to help inspect the systems is worth the effort, especially if they help debug complicated escalations or data inconsistencies.

Load Testing for Expected Increases in Traffic with Vegeta

$
0
0

At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday.

Why Load Test?

Curalate works directly with retail brands to drive traffic to their sites. The holiday shopping period is the most important time of the year for most of them, and we need to ensure that our experiences continue to operate at a high standard throughout.

More generally, though, load testing is critical for services and APIs, especially in cases where load is expected to increase. It uncovers potential points of failure, during business hours, and hopefully prevents people from needing to wake up at 2 a.m. on a weekend.

Creating a Test Plan

In cases of expected load increases, it’s important to understand as much as possible before diving into it. There are a few questions to ask:

  • Is there any data available so I can understand the expected load? Is it a yearly increase - are previous years a good indication? If it’s a brand new launch, what are the expectations?
  • What are the hard and soft dependencies of the service or API that I’m testing? What sort of caching is in place? Does a 10x increase on my service cause a 10x increase on everything downstream, as well?
  • Should we test against the active production environment, or is it feasible to spin up a staging environment with the same scaling behavior?
  • Depending on the breadth of dependencies, it may not be possible to spin up a new duplicated environment.
  • If I test against production, how can I ensure I don’t negatively affect live traffic?
  • Am I expecting an increase in load across services? If there are any core dependencies, what does the combined load look like at peak?
  • How much of a buffer do I provide against the expected peak?
  • Does my service have any rate limiting that I need to bypass or keep in mind? How do I simulate a live traffic without being throttled?

Getting Right to It

In our case, there were four main services that we were interested in testing against expected load, separated into on-site (APIs and services that are called directly from our client’s sites), and off-site (our custom built and Curalate-hosted services). This distinction works well for us, because we expected a 10x increase in on-site experiences, but 2-3x increase to off-site ones - brands focus on driving traffic to their own e-commerce site.

Now, there are many tools out there for load testing. For our purposes, I used Vegeta, for its robust set of options and extensibility. It was easy to script around to allow a steadily increasing request rate to either a single target or lazily generated targets. The output functionality is also well thought out. It supports top line latency stats along with some basic charting capabilities.

Let’s assume we had a service that we wanted to test up to 1000 RPS, both against a single target, and against multiple targets - to work around any caching in place.

1000 RPS Single and Multi-Target

The setup was fairly simple:

Spin up a couple of AWS EC2 m3.2xlarge instances.

SSH to the instances and create a load_testing folder, and fetch the Vegeta binary.

wget "https://github.com/tsenart/vegeta/releases/download/v6.3.0/vegeta-v6.3.0-linux-386.tar.gz"

Put together a simple, quick script to handle steadily increasing the request rate, and then hold steady at the max rate.

#!/bin/bashtarget=$1maxRate=$2rateInc=$3incDuration=$4startAt=$5currentRate=$startAthitType=$6while[$currentRate-le$maxRate]do
  if[$currentRate-eq$maxRate]then
    echo$target | ./vegeta attack -rate=$currentRate> reel-$maxRate-$currentRate-$hitType-test.bin
  else
    echo$target | ./vegeta attack -rate=$currentRate-duration=$incDuration> reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))done

Basically, if it hasn’t yet hit the max rate, run vegeta at the current rate for the specified duration, then increase the rate by the increment, and loop again. If the max rate is hit, don’t specify a duration - run until manually killed. The multi-targets script is similar, but reads from a targets.txt file.

#!/bin/bashmaxRate=$1rateInc=$2incDuration=$3startAt=$4currentRate=$startAthitType=$5while[$currentRate-le$maxRate]do
  if[$currentRate-eq$maxRate]then
    ./vegeta attack -rate=$currentRate-targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  else
    ./vegeta attack -rate=$currentRate-duration=$incDuration-targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))done

Aside: I was unable to get the -lazy flag to work properly with Vegeta, so I went with brute force and just generated a ton of targets to a file. I’m convinced it could have been more elegant, but sometimes the easy solution works just as well.

With the setup complete, it’s as simple as setting up whatever monitoring you want on a display or two, and fire off the scripts.

sh ./rate_increasing_multi.sh 1000 50 120s 50 uncached

Which says to increase up to 1000 RPS, 50 at a time, for 2 minutes at each rate, starting at 50 RPS.

For each results file generated, ./vegeta report -inputs "out.txt" will output something like (this example is for 250 RPS)

Requests      [total, rate]            66177, 249.98
Duration      [total, attack, wait]    4m24.783548697s, 4m24.731999487s, 51.54921ms
Latencies     [mean, 50, 95, 99, max]  64.885905ms, 57.516245ms, 107.88721ms, 730.867162ms, 2.309337436s
Bytes In      [total, mean]            943011144, 14249.83
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:66177
Error Set:

Load Testing and Results

As different tests are kicked off and rates increase, it’s necessary to keep an eye on any monitoring dashboards, or alerts that may fire, and bail out of the test early. From there, logging should help in diagnosing what failed, and tickets can be filed each step of the way. After those issues are resolved, you can pick back up testing until you hit your goal, and maintain it for long enough to be comfortable with the test.

It should go without saying, but when testing against a live, production environment, it’s always nice to give the current on-call engineers a heads up, and keep them in the loop the entire way through.

As for Curalate’s load testing, on Cyber Monday we experienced record-breaking traffic numbers - even exceeding our 10x estimates slightly - to our services, and the on-call engineers slept soundly through Thanksgiving weekend.

R&D At Curalate: A Case Study of Deep Metric Embedding

$
0
0

At Curalate, we make social sell for hundreds of the world’s largest brands and retailers. Our Fanreel product is a good example of this; it empowers brands to collect, curate, and publish social user-generated photos to their e-commerce site. A vital step in this pipeline is connecting the user generated content (UGC) to the product on our client’s web site. Automating this process requires cutting edge computer vision techniques whose implementation details are not always clear, especially for production use cases. In this post, I review how we leveraged Curalate’s R&D principles to build a visual search engine that identifies which of our clients’ products are in user generated photos. The resulting system allows our clients to quickly connect user generated content to their e-comm site, enabling the UGC to generate revenue immediately upon distribution.

Step 1: Do Your Homework

We start every R&D project by hitting the books and catching up on the relevant research. This lets us understand what is feasible, the (rough) computational costs, and any pitfalls of various techniques. In this case, our goal is to find which products are in any UGC image using only the product images from the client’s e-comm site. This is extremely difficult: UGC photos have dramatic lighting conditions, generally contain multiple objects or clutter, and may have undergone non rigid transformations (especially if it’s a garment). Knowing we had a difficult problem on our hands, we did an extensive literature review on papers from leading computer vision conferences, journals, and even arxiv to ensure we have a good understanding of the state of the art.

One approach stood out in the literature review: deep metric learning. Deep metric learning is a deep learning technique that learns an embedding function that, when applied to images of the same product, produces feature vectors that are close together in Euclidean space. This technique is perfect for our use case: we can train the system from existing pairs of UGC and product images in our platform to understand the complex transformations products undergo in UGC photos.

The figure above (from Song et. al.) shows a t-SNE visualization of a learned embedding of the Stanford Online Products dataset. Notice that images of similar products are close together: wooden furniture zoomed in on the upper left, and bike parts on the lower right. Once we’ve learned this embedding function, identifying the products in a UGC image can be achieved by finding which embedding vectors from the client’s product photos that are closest to that of the UGC.

Most techniques for deep metric learning start with a deep convolutional neural network trained on imagenet (i.e., a basenet), remove the final classification layer, add a new layer that performs a projection to the n-dimensional embedding space, and fine-tune it with an appropriate loss function. One highly cited work is Facenet by Schroff et. al., who propose a loss function that uses triplets of images. Each triplet contains an anchor image, a positive example that is the same class as the anchor, and a negative match that is a different class than the anchor image. Though more recent work has surpassed Facenet, in interest of speed (we are a startup!) we decided to take it for a spin since a tensorflow implementation was available online.

Step 2: Prototype and Experiment

The second phase of an R&D project at Curalate is the prototype phase. In this phase, we implement our chosen approach as fast as possible, and evaluate it on publically available data as well as our own. As with many things in a startup, speed is key here. Specifically, we need answers as fast as possible so we know what we need to build. This phase is designed to answer the question: will it work and, if so, how well? In addition, this phase is when we experiment with different implementation details of techniques we wish to implement. Hyper parameter tuning, architecture components, and comparing different algorithms all occur in this phase of R&D.

The big question we want to answer for our deep metric embedding project is: which basenet should we use? The Facenet paper used GoogLeNet inception models, but there have been many improvements since their publication. To compare different networks, we measure each of their performance on the Stanford Online Products dataset. We implemented Facenet’s triplet loss in MXNet so we can easily swap-out the underlying basenet.

We compared the following networks from the MXNet Model Zoo:

A secondary question we wished to answer with this experiment was how efficiently we could compute the embeddings. To explore this, we also evaluated two smaller, faster networks:

The figure above shows the recall-at-1 accuracy for all basenets. Not surprisingly, the more computationally expensive networks (i.e., Resnet-152 and SENet) have the highest accuracy. SENet, in particular, achieved a recall-at-1 of 71.6%, which is only two percentage less than the current state of the art.

One of the exciting results for us was squeezenet. Though it only achieved 60% accuracy, this network is extremely small (< 5MB) and computationally fast enough to run on a mobile phone. Thus we could sacrifice some accuracy for a huge savings in computational cost if we require it.

Step 3: Ship It

The final phase of an R&D project at Curalate is productization. In this phase, we leverage our findings from the prototype and literature phases to design and build a reliable and efficient production system. All code from the prototype phase is discarded or heavily refactored to be more efficient, testable, and maintainable. With deep learning systems, we also build a data pipeline for extracting, versioning, and snapshotting datasets from our current production systems.

For this project, we train the model on a P6000 GPU rented from paperspace. We again use MXNet so the resulting model can be deployed directly to our production web services (which are written in Scala). We opted to use Resnet-152 as a basenet to get a high accuracy result, and deployed the learned network to g2.2xlarge instances on aws.

The visual search system we built powers our Intelligent Product Tagging feature, which you can see in the video below. Using deep metric embedding, we vastly increased the accuracy of intelligent product tagging compared to non-embedded deep features.

Choosing a Deep Learning library for developing and deploying your App/Service

$
0
0

Interest in deep learning is growing and growing and, with it at peak hype right now, a lot of people are looking to find the best deep learning library to build their new app or bring their company into the modern age. There are many deep learning toolkits to choose from ranging from the long used, supported, and robust academic libraries to the new state-of-the-art, industry backed platforms.

At Curalate, we’ve been working on deep learning problems since 2014, meaning we’ve had the chance to watch the deep learning community and its open source libraries grow. We have also had the fortunate (unfortunate?) experience of using a few of the deep learning libraries in our production services and applications, and along the way, we have learned a lot about what to look for in a deep learning library to build reliable, production-ready applications and services. In this post, I’ll share our lessons learned knowledge in hopes it will help you in your search for the perfect deep learning library match. You might even find that your best fit is using more than one!

Important factors

The specifics needs of your application/service

The platform you are developing on and deploying to.

Develop in OSX? Linux? Windows? Plan on having your application run in a web browser? A smart phone? A massive multi-node GPU cluster? It’s not surprising that each of the libraries have prioritized different environments and some will work much better for your specific situation.

The specific deep net architecture you are trying to implement

If you are just trying to implement a typical, pre-trained classification net, this factor may not be as important for you. Some libraries are more performant and appropriate for certain types of deep nets (LSTMs, RNNs), but more on this later.

API language requirements

If you already have a code base written in language A, you probably would like to keep it that way without having to figure out some convoluted way to fit a deep net interface in language B into it. Luckily, it seems that most of the common languages are covered at this point in at least one of the libraries, or in an external community project.

Codebase Quality

Is the code base actively maintained?

How healthy is the project in terms of maintainers? Is there a large group/company committing time and resources to the libraries development? If you find a bug or issue with the library, how long is it going to take for it to get addressed?

Release status of the library itself

Is the library or a certain feature/API you are going to need still considered to be in an Alpha or Beta state? Has the library been used in enough to have most of the kinks ironed out?

Ease of Use

Train to production pipeline

Your model training code and production code do not have to run in the same environments or even the same language. Can you train your model with a quick-to-prototype language in a documented, version-controlled, repeatable way so you can research new and different models for your application? Then can you deploy your saved model in a fairly quick and painless fashion? That may be through the same library with a different language API, using a library’s prebuilt production-serving framework, or even converting your model from one library to another that is better suited for your target platform.

Keras support

Does the library have support for being used as a backed for Keras? Keras is not a deep learning library per se, but a library that sits on top of other deep learning libraries and provides a single, easy to use, high-level interface to write and train deep learning models. Where it lacks in optimizations, it is great for beginners with great documentation and a modular, object oriented design.

Dynamic vs Static computation

Now we could write a whole blog post on this topic alone, but to keep it brief, do you want to work with a static computation graph API that follows a symbolic programming paradigm? Or do you want a dynamic computation graph API that follows an imperative programming paradigm?

  • Static Computation Graphing
    • You define the deep net once, and uses a session to execute ops in the net many times.
    • The library can optimize the net before you use it, so the nets end up being more efficient with memory and speed.
    • Good for fixed size net (feed-forward, CNNs)
    • Leads to the API being more verbose and harder to debug domain specific language (DSL)
    • Offers better over loading and model management in regards to system resources.
  • Dynamic Computation Graphing
    • Nets are built and rebuilt at runtime, and executed line by line how you define them. This lets you use standard imperative language (think Python) statements, features, and control structures.
    • Tends to be more flexible and useful for when the net structure needs to change at runtime, like in RNNs
    • Makes debugging easy since an error is not thrown in a single call to execute the net after its compiled, but at the specific line in the dynamic graph at run time.

Support

Documentation

How good is the documentation? Are there coding examples that cover most of the use cases you need? Are you used to getting your documentation in a certain style from a specific company?

Community support

How large is the community? Just because a deep learning library is really good does not mean people are actually using it. Are you going to be able to find 3rd party blog posts, code samples, and tutorials using the library? If you run into a problem, what is the chance you are going to find someone on Stack Overflow with the answer to your problem?

Research

Does the research community actively use the library to develop state-of-the-art deep learning models and solutions? A lot of state-of-the-art discoveries made by the academic community require modification to the deep learning libraries themselves and it’s pretty common for research groups to release their source code for conference papers to the public. Most of these new models will be released as pretrained models and listed in a Model Zoo specific to the library. Porting these solutions between libraries is not a trivial task if you are not comfortable reimplementing the research paper.

Performance

Performance with specific network structures

How fast does your planned network structure run on each of the deep learning libraries? Will you be able to train and prototype your models faster on one vs another? If you are deploying to a service, how many requests per second can you expect to run through the library?

Scalability

How well does the library scale when you start providing it with more resources to meet your production load? Can you save money by using a more efficient scaling library over another? (Cloud GPU instances can be reallyexpensive)

The Libraries

Caffe, with its unparalleled performance and well-tested C++ codebase, was basically the first mainstream, production-grade deep learning library. Caffe is good for implementing CNNs, image processing, and for fine-tuning pre-trained nets. In fact, you can do all of these things with writing little to no code. You just place your training/validation data (mainly pictures) in a specific folder, set up config files for the deep net and its training parameters, and then call a precompiled Caffe binary that trains your net.

Being first to market means that a lot of early research and models were written with Caffe, and the research that built off of that forked and continued to use the same code base. Because of this, you will find a lot of state-of-the-art work, even to this day, still using Caffe despite its limitations. A lot of these models can be found in the Caffe Model Zoo, which is one of the first and largest (if not the largest) model zoos.

But now we have to start talking about its limitations. Caffe was built and designed around an original intended use case: conventional CNN applications. Because of this, Caffe is not very flexible. Overall, it’s not very good for RNNs and LSTM networks. Even with it’s adaption of CMake, building the library can still be a pain (especially for non Linux environments). It has little support for multiple GPUs (training only) and can only be deployed to a server environment. The configuration files to define the deep net structure are very cumbersome. The prototxt for ResNet-152 is 6775 lines long!

In Caffe, the deep net is treated as a collection of layers, as opposed to nodes of single tensor operations. Layers can be thought of as a composition of multiple tensor operations. These layers are not very flexible and there are a lot of them that duplicate similar logic internally. Because Caffe does not support auto differentiation, if you want to develop new layer types, you have to define the full forward and backwards gradient updates. You can define these layers in Caffe’s Python interface, but unlike other libraries where the Python interface is accelerated by their underling C implementations, Caffe Python layers run in Python.

So should you use Caffe? If you are looking to reimplement some specific model from a research paper from 2015 using existing, open source code, it is not a bad Library. If you are looking for raw performance and not opposed to using a C++ library and API on a GPU server for your service/app, Caffe is still one of the fastest libraries around for fully connected networks.

But because of its limitations and technical debt, a lot of the community and its efforts have moved on from Caffe in some form or another. Caffe is a special case when it comes to model converters, in that it is the best supported library with converters to almost all other deep learning libraries making it easier to move your work off of it. The creator of Caffe has been hired by Google to work on their deep learning library TensorFlow, and now by Facebook to create a successor to Caffe in the appropriately named Caffe2.

Torch and PyTorch are related by much more than just their name. Torch was one of the original, academic-created deep learning libraries. While it may not have as much research citing it for its use in the results, it still has a very large community around it. Many of the researchers who originally worked on Torch moved to Facebook. Unsurprisingly, Facebook has since developed the successor to Torch in the form of PyTorch. PyTorch and Torch use the same underlying C libraries, TN, THC, THNN, and THCUNN, which provide them with very similar performance characteristics. When it comes to typical deep learning architectures, Torch offers some of the fastest, but not the fastest, performance around with GPU scaling efficiency that matches the best.

Where Torch and PyTorch differ is in their interface, API, and graphing paradigms. Torch was written with a LUA API interface, which can be a major barrier of entry for most people. While you can do research and development in LUA, it doesn’t have the massive community backing and vast open source libraries like Python does, so it can be quite limiting. Torch uses a static graph paradigm like Caffe’s at the time. Also like Caffe, it does not have any auto-differentiation capabilities, meaning if you want to implement new tensor operations for your deep net you have to write the backwards gradient calculations, and it has a pretty substantial model zoo of pre-trained models.

PyTorch was made with the goal of fixing or modernizing various issues with Torch, to create probably one of the best currently available libraries for doing research and development. PyTorch, as the name suggests, has a very well designed Python API. It supports both dynamic graph programming and auto differentiation for all of the easy to debug and prototype goodness. PyTorch also has its own visualization dashboard called Visdom, which while more limited than TensorBoard (more on this later), is still very helpful for development.

So should you use Torch or PyTorch? For specifically research, and development of new models, PyTorch is probably currently the best option. Even though PyTorch is still very new, most people in the deep learning field would agree that you should use it over classic Torch. Not to say Torch does not have its advantages. Because of its age, it has a much larger backlog of research citing it for its use, and is more stable than PyTorch, but both of these advantages will be lost over time. If you are looking for a library to deploy into any kind of production environment, then you should probably look elsewhere.

TensorFlow, without a doubt, is currently the biggest player in the deep learning field and for good reason. TensorFlow is Google’s attempt to build a single deep learning framework for everything deep learning related. There is very little that TensorFlow does not do well. Because it was created by Google, it was built with massive distributed computing in mind, but it also had mobile development capabilities in the form of TensorFlow Mobile and TensorFlow Light. Its documentation is also considered one of the best. Their documentation covers multiple API languages that TensorFlow supports, and if you consider the interfaces made by 3rd parties in the community, it even has APIs for C#, Haskell, Julia, Ruby, Rust, and Scala. Speaking of that community, TensorFlow has the largest community out of any of the deep learning libraries and currently has the most research activity.

From the beginning, TensorFlow was made with a clear static graph API that was easy to use, but as interests and needs are changing in the machine learning field, it recently gained support for dynamic graph functionality in the form of TensorFlow Fold. TensorFlow has Keras support, making it very easy for beginners and even has its own custom version built into the Python API.

When Google first released TensorFlow, they also released TensorBoard. A data visualization tool that was created to help you understand the flow of tensors through your model for debugging, optimization, and just understanding the the complex and confusing nature of deep learning models. You can use TensorBoard to visualize your TensorFlow model, plot summary metrics about the execution of your model, and show additional data like images that pass through it.

Now what about deploying your models once you have finished training them? Well Google also has a solution for that in TensorFlow Serving, a flexible, high-performance serving system for ML models, designed for production environments. It comes in the form of modular C++ libraries, binaries, and docker/k8 containers that can be used as an RPC server or a set of libraries. There are even Google CloudML services set up with it to get your model up in production in no time. TensorFlow Serving’s main goal is to optimize for throughput with little to no down time. It includes a built-in scheduler that aims for the efficiency of mini-batching requests through the model and can manage multiple models at once running on shared hardware. Currently the API interface only supports prediction, but will support regression, classification, and multi-inference soon.

Now TensorFlow is not perfect. Both Serving and Fold are still in their early days of development, so they might not want to be something you would rely on. All of the APIs outside of the Python API are not covered by their API stability promises. But the biggest issue when it comes to TensorFlow when compared to the other libraries is performance.

There is no real way to get around the issue; TensorFlow is just slower and more of a resource hog when compared to the other libraries. Looking at performance across your typical deep net architectures you can expect to see other libraries perform up to twice as fast as TensorFlow at similar batch sizes. You should avoid TensorFlow in general if you need performant Recurrent nets (RNNs) or Long Short Term Memory nets (LSTMs). TensorFlow is even the worst at scaling efficiency when compared to the other libraries despite its focus on distributed computing.

So should you use TensorFlow? We wouldn’t blame you if you did and would probably suggest it for 80% of the possible use cases out there. Especially if you are new to the deep learning field and want to work with a library and ecosystem that has solutions for almost everything you could possibly need. But, if you are willing to put in the extra time and effort, you can find a much more performant and equally-featured experience with other libraries.

CNTK, the Microsoft Cognitive Tooklit, was originally created by MSR Speech researchers several years ago but has evolved into much more. It is a unified framework for building Deep nets, Recurrent net (RNNs), Long Short Term Memory nets (LSTMs), Convolution nets (CNNs), and Deep Structured Semantic Models (DSSMs). It can pretty much work for all types of deep learning applications from speech/text to vision.

CNTK supports distributed training like TensorFlow and Torch. It even supports a proprietary, commercially-licensed, 1-bit Stochastic Gradient Decent algorithm that significantly improves distributed performance. Thanks to CNTK’s early focus on language models, when it comes to running RNNS and LSTMs, it is 5-10 times better than the other libraries when running these dynamic network structures.

The biggest reason to use CNTK is if you or your company traditionally works with Microsoft software and products. CNTK is one of the few libraries to have first class support for running on Windows with additional support for running on Linux and NO support for OSX. It has direct support for deploying to a Microsoft Azure production environment and APIs that properly supports Microsoft’s languages of choice. Its model zoo is even set up in a very “MSDN documentation” fashion.

The main downside to CNTK is that it lacks support from both the general research and software dev community. Microsoft may be using it internally for a lot of their services and probably has the reliability to support it, but it is just having trouble gaining market share (like many of Microsofts recent endeavors).

So should you use CNTK? If you are used to developing in Visual Studio and need an API for your .NET application, there probably is no better fit. But there are better options out there for most OSX/Linux devs with better all-around support. Also, if you are trying to do research and development that is not specific to LSTMs or RNNs, there are more appropriate libraries.

MXNet is one of the newest players in the deep learning field but has been gaining ground fast. Originally created at the University of Washington and Carnegie Mellon University, it has been adopted by both The Apache Foundation and Amazon Web Services as their deep learning library of choice and has put their development efforts behind it.

MXNet supports almost all of the features the rest of the other libraries support. It has the largest selection of officially supported languages for its APIs, and it can run on everything from a web browser, a mobile phone, to a massive distributed server farm. In fact, Amazon has found that you can get up to an 85% scaling efficiency with MXNet. In most other cases, MXNet has some of the best performance when running with typical deep learning architectures.

MXNet supports both static graph programming and dynamic graph programming with the raw MXNet and Gluon APIs respectively. The Gluon API is also MXNet’s clear, concise, and simple API for deep learning created in collaboration with AWS and Microsoft in the same spirit as Keras, but MXNet also supports Keras if you prefer it. MXNet also has its own serving framework for getting your trained MXNet models into production and has extra support for running on AWS. It even has its own TensorBoard implementation that provides much of the same functionality as the TensorFlow equivalent.

MXNet does have notable weaknesses that make working with it a little more annoying. The documentation could be much better. The APIs have gone through a few changes before the first 1.0 release and the documentation reflects this, which can get a little confusing in some places. In terms of community support, its not the worst or the best, but somewhere in the middle. There is a notable amount of people using it and research, and there are plenty of usage examples for different net types along with their model zoo.

So should you use MXNet? If you are willing to put the time in and deal with some of the pain points from it being a younger deep learning library, it is probably the best option for 80% of use cases along with TensorFlow. Especially we would suggest it over TensorFlow if performance is a big concern of yours. If you are looking for the most flexible library to give you as many options as possible in your train to production pipeline with a native API for your production code, it’s probably the best option.

The Other Libraries

Now the previous 6 deep learning libraries covered are by no means that only options available to you. They are just the biggest players and arguably the most relevant for 2018. There are many more available to you to choose from that may better fit your specific needs (Deployment destination, non-english documentation/community, hardware, etc.). We will try to briefly cover them here and provide a jumping off point if you want to dig into one of them deeper.

Theano

  • Python API
  • University of Montreal
  • Future work on the project has stopped, May it rest in peace
  • Watches: 573, Star: 8041, Forks: 2426, Median Issue Resolution: 12 days, Open issues: 19%*
  • Research Citations: 1,080
  • Makes you do a lot of things from scratch, which leads to more verbose code.
  • Single GPU support only
  • Numerous open-source deep-libraries have been built on top of Theano, including Keras, Lasagne and Blocks
  • No real reason to use over TensorFlow unless you are working with old code.

Caffe2

  • C++, Python APIs
  • Facebook
  • Watches: 552, Stars: 7631, Forks: 1821, Median Issue Resolution: 55 Days, Open issues: 33%*
  • Caffe2 is facebooks second entry into the deep learning library ecosystem.
  • It is built with a focus more on mobile and industrial-strength production applications over development and research.
  • Where Caffe only supported single GPU training, Caffe2 is built to run utilizing both multiple GPUs on a single host and multiple hosts with single to multiple GPUs.

CoreML

  • Swift, Objective-C APIs
  • Apple
  • Closed source
  • Not a full DL library (you can not use it to train models at the moment), but mainly focused on deploying pre-trained models optimized for Apple devices
    • If you need to train your own model, you will need to use one of the above libraries
    • Model converters available for Keras, Caffe, Scikit-learn, libSVM, XGBoost, MXNet, and TensorFlow

Paddle

  • Python API
  • Baidu
  • Watches: 558, Star: 6580, Forks: 1756, Median Issue Resolution: 7 days, Open issues: 24%*
  • One of the newest libraries available
  • Chinese documentation with an English translation
  • Has the potential to become a big player in the market

Neon

  • Python API
  • Intel
  • Watches: 351, Stars: 3437, Forks: 778, Median Issue Resolution Time: 28 days, Open issues: 16%*
  • Written with Intel MKL accelerated hardware in mind (Intel Xeon and Phi processors)

Chainer

  • Python API
  • Preferred Networks
  • Watches: 310, Stars: 3595, Forks: 949, Median Issue Resolution Time: 31 days, Open issues: 13%*
  • Research Citations: 207
  • Dynamic computation graph
  • Smaller company effort with a Japanese and English community

Deeplearning4j

  • Java, Scala APIs
  • Skymind
  • Watches: 792, Stars: 8527, Forks: 4120, Median Issue Resolution Time: 19 days, Open issues: 21%*
  • Written with Java and the JVM in mind
  • Keras Support (Python API)
  • DL4J can take advantage distributed computing frameworks including Hadoop and Apache Spark.
  • On multi-GPUs, it is equal to Caffe in performance.
  • Can import models from Tensorflow
  • Uses ND4J (Numpy for the JVM)

DyNet

  • C++, Python APIs
  • Carnegie Mellon University
  • Watches: 178, Stars: 2189, Forks: 527, Median Issue Resolution Time: 4 days, Open issues: 16%*
  • Dynamic computation graph
  • Small user community

MatConvNet

  • Matlab APIs
  • Watches: 113, Stars: 959, Forks: 633, Median Issue Resolution Time: 96 days, Open issues: 53%*
  • a MATLAB toolbox implementing Convolutional Neural Networks (CNNs) for computer vision applications

Darknet

  • Python, C APIs
  • Watches: 520, Stars: 6276, Forks 3072, Median Issue Resolution Time: 55 days, Open issues: 78%*
  • Very small open source effort with a laid back dev group
  • not useful for production environments

Leaf

  • Rust API
  • autumnai
  • Watches: 195, Stars: 5229, Forks: 265, Median Issue Resolution Time: 131 days, Open issues: 58%*
  • Support for the lib looks to be dead

TLDR

Choose either TensorFlow or MXNet for probably about 80% of use cases (TensorFlow if you prioritize community support and documentation, MXNet if you need performance). Look at PyTorch if you are mainly looking for something to develop/train new models. If you love Microsoft and are developing for a .NET environment in Windows and Visual Studio, try out CNTK. Look into OpenML for just deploying models to Apple devices specifically and Deeplearning4j if you really like to keep things JVM focused.

* Numbers taken at time of writing, expected to change.

Viewing all 26 articles
Browse latest View live