An AI that can make you smile or cry

Andrew Schoen
7 min readMay 31, 2017

--

If you’re in a quiet place or can put on headphones, I encourage you to press play on the track below. It’s only 32 seconds long. It was written by an AI.

An AI developed by my friend’s company, Jukedeck, was asked to write something “emotive” and “cinematic.” This is what it came up with.

In this piece, I will discuss the intersection of AI and creative expression. If you’re someone who likes to listen to music while reading, the below piece from Jukedeck’s AI is designed to accompany this article (which should be about a 5 minute read from this point forward).

Like essentially every VC, I’ve been thinking a lot about AI. It will impact (in many cases, is already impacting) virtually every industry. While I’ve spent plenty of time thinking about many of the traditional AI-related questions — future of labor, rogue ASI, simulation paradox, etc. — one quandary that is less widely discussed and more near-term is what happens when AI becomes competent at creative expression.

We traditionally think of visual arts, music, and storytelling as uniquely human pastimes. Instinctively, we conceptualize technology as best suited for automating routine processes, but relegate creative expression — so utterly non-routine and unpatterned — to the sole domain of human pursuit.

An AI that successfully generates rich, emotionally expressive content would force us to reflect more deeply, or at least more viscerally, on the nature of human consciousness — certainly more so than an AI that simply automates filing taxes or driving a car. To some extent, automating creative expression is a narrow application of the famous Turing Test (a test, developed by Alan Turing in 1950, of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human).

Here’s my bet for the first AI generated content that will make you smile or cry: music.

(Probably composed by an AI developed by my friend’s company, Jukedeck.)

Where we are Today:

Lots of relevant application-layer work shows great promise in commercializing AI in key content areas like computer vision and natural language processing. The technologies that undergird these applications — algorithm generation, modeling, selection, optimization, and hardware implementation — are improving at an accelerating rate.

A few intriguing examples:

  • In computer vision, Deep Dream (a program which uses AI [convolutional neural networks to be specific] to find and enhance patterns in images via algorithms) created some truly captivating images. Basically, Deep Dream’s computer vision AI, trained to recognize objects within images, is asked what it “thinks” it sees in a picture. The AI’s potential predictions are then applied back to the image (the image is edited so that the patterns the AI was seeing are actually enhanced). Loop through this process several times, and the results look fairly trippy:
Original image (top). A Deep Dream AI trained to recognize buildings and landscapes output (bottom).
This dreamscape is what this same Deep Dream AI saw in an image of random white noise (white noise in upper left).
  • In Natural Language Processing and writing, an amazing analysis, leveraging AI-fueled data mining and sentiment analysis, revealed that the emotional arcs of stories are dominated by six basic shapes.
  • Generative AI as applied to storytelling most likely will yield the ability to generate written content in an automated, scalable fashion. The early use cases for this will primarily be relegated to marketing copy, blog and comment content generation, and perhaps eventually turbocharging the conversion of any outline (plus style suggestions around tone etc.) into long-form prose.
  • Some recent work by a team of fellow Cornellians explores visually what can be done in the category of AI style transfer (AI style transfer is where an AI analyzes a given item and determines its unique characteristics [its “style”]; the AI then modifies subsequent items to match the given style). AI style transfer is an area I’m following closely.
In the photos above, a primary image (far left) and a style image (center) are given to an AI. The AI’s job is to modify the primary image so that it fits the characteristics of the style image. The output is displayed on the far right. This type of process isn’t solely relegated to images.
  • It’s nice that AIs can pull off “tricks” like the above, but how might their abilities evolve over time? How might AIs learn to optimize and assemble their capabilities in novel ways? This is where I think genetic algorithms (a well-known metaheuristic in computer science, statistical optimization, and AI) could be a good piece of the technical approach, if not a high-level philosophical one. Here’s a cool demonstration I put together to illustrate how genetic algorithms work:
This video visually illustrates a genetic algorithm optimization technique. This simulation utilizes a basic physics engine and a simple simulated 2D topology to evolve vehicle designs. It first assembles components (shapes like circles and squares) at random and tests the randomly spawned vehicle designs. It then utilizes the performance (distance traveled) of these simulated vehicles as the basis for selecting vehicle designs to be passed on to the next generation (plus a degree of random mutation). Through this process, the algorithm automatically evolves vehicles and optimizes their performance.

So, why do I think music will be the first AI generated content that will make you smile or cry?

Intuitively, it’s because music strikes the perfect balance of emotionally rich expression yet is sufficiently grounded in pattern and math for AI to be viable.

To be clear, I think AI generated written content and visual content will commercialize earlier and more easily than audio content, but it will likely be commercialized in a more, well, “commercial” context (marketing content on the NLP side and AI-powered editing tools embedded in photo and video editing suites, e.g., Adobe, on the visual side).

For AI-composed music to reach human-level quality, I believe it’ll need to make use of two key advantages:

  1. Scale + Optimization + Feedback Loops: AI can already generate some pretty good sounding music across a variety of styles. But, crossing the threshold from a something that sounds 98% great but is missing real emotional depth to something that truly resonates is extremely challenging. I have a strong suspicion that to reach parity with human composers, AI music generation will need to take an evolutionary approach at first: an AI can generate content en masse, once released into the wild, the performance of this determines which content survives and forms the basis for recombination into future content unit and models. This creates a critical feedback loop, in which a genetic algorithm framework approach to optimization can be implemented.
  2. Real-time Responsiveness: One key advantage machine generated content has relative to human content is on-the-fly adaptability. Non-static content will fundamentally alter content consumption. As the man-machine interface improves, the inputs that drive the responsiveness will become more robust, less latent, and more interesting. Imagine game music that adapts on the fly to what the player is doing and seeing. Imagine music that adapts in real-time to the individual preferences of a given listener. Looking ahead, VR / AR are particularly interesting in that a lot of new user input methods are being invented or incorporated inside of them: motion & gesture tracking, eye tracking, facial expression tracking, brain wave function, etc. This all increases the user’s usable cognitive output, with rapid sampling rates and low latency. Music software today is capable of taking simple inputs — for example, pressing a single key on a piano — and generating dynamic backup music that adapts to the input. That said, this key-press input method is crude; it is the UX equivalent of a command line interface. With all of the amazing new human input devices that are being invented (particularly inside VR), I can’t wait to see the GUI equivalent. Because AI-generated music (and other content) can respond dynamically to all kinds of human inputs in real time, it can create truly unparalleled, unique new forms of content to experience.

While this discussion strictly bifurcates human composition from AI composition, one item I didn’t include is the significant room that will exist for human + AI collaboration in content creation. AI tools may become an integral part of the composition process and will likely be a staple within music creation software. For example, DJ’s may be able to press an “auto mix” button to blend two songs together or smoothly transition between them. Or, imagine a DJ who knows they want to play Song A and then Song C, but is looking for the perfect Song B to bridge the two. Or, imagine using AI style transfer (like in the images above) to take a certain song and automatically generate a remix in your favorite genre. There are plenty of synergistic ways for AI to collaborate with and enhance human creative output! (Someone even used AI to design a sauce to go with their favorite dish.)

Conclusions:

For all of human history, cognition has faced an I/O problem: input bandwidth is extremely high (our five senses ingest information at an incredibly high rate), but output bandwidth is constrained: typing sucks (it’s extremely slow and inefficient); I’d argue that speaking and singing and dancing and body language are among the most high-bandwidth outputs, but they’re still nothing compared to our ability to intake, process, and synthesize information via our five senses. But output — expression — remains the critical bottleneck.

We could see technology seriously shift that dynamic during our lifetimes. Computing, broadly speaking, has already started this process (computers famously as the “bicycle of the mind”), but the curve is just now beginning to accelerate as user output methods become more seamless. Adaptive AI content generation coupled with new forms of human output capture (e.g., voice, gesture, eye tracking, facial expression tracking, brainwave mapping) could dramatically magnify our ability to express and to create, and foundationally alter the nature of the content we produce and consume.

If AI is progressing this rapidly across a multitude of domains — increasingly able to generate emotionally rich, meaningful content, to synthesize stunning visuals, and even to elucidate the very narratives and story arcs that undergird human engagement — how long before AI is able to simulate a reality that is just as meaningful as the one we currently inhabit? Assuming we’re not in a simulation (or even if we are), how might the AI revolution augment this reality?

All of this is going to take a while. A long while, most likely. With many ups and downs. But it will be absolutely fascinating to watch (and listen)!

Bonus: Here are some other great tracks by Jukedeck’s AI!

--

--

Andrew Schoen
Andrew Schoen

Written by Andrew Schoen

Venture Capital Investor at NEA (New Enterprise Associates). Co-Founder of Flicstart. Schwarzman Scholar.

No responses yet