Machine Learning Music Generation Through Deep Neural Networks

Article by Ramya Vidiyala | October 01, 2020

Deep Learning has improved many aspects of our lives, in ways both obvious and subtle. Deep learning plays a key role in processes such as movie recommendation systems, spam detection, and computer vision. Though there is ongoing discussion around deep learning as a black box and the difficulty of training, there is huge potential for it in a wide variety of fields including medicine, virtual assistants, and ecommerce.

One fascinating area in which deep learning can play a role is at the intersection of art and technology. To explore this idea further, in this article we will look at machine learning music generation via deep learning processes, a field many assume is beyond the scope of machines (and another interesting area of fierce debate!).

Contents

  • Music Representation for Machine Learning Models
  • Music Dataset
  • Data Processing
  • Model Selection
    • Many-Many RNN
    • Time Distributed Dense Layer
    • Stateful
    • Dropout layers
    • Softmax layer
    • Optimizer
  • Generation of music
  • Summary

 

Music Representation for Machine Learning Models

We will be working with ABC music notation. ABC notation is a shorthand form of musical notation that uses the letters A through G to represent musical notes, and other elements to place added values. These added values include sharps, flats, the length of a note, the key, and ornamentation.

This form of notation began as an ASCII character set code to facilitate music sharing online, adding a new and simple language for software developers designed for ease of use. Figure 1 is a snapshot of the ABC notation of music.

A snapshot of music in ABC notation, that will be used for machine learning music generation.
Figure 1: A snapshot of music in ABC Notation

Lines in part 1 of the music notation show a letter followed by a colon. These indicate various aspects of the tune such as the index, when there is more than one tune in a file (X:), the title (T:), the time signature (M:), the default note length (L:), the type of tune (R:) and the key (K:). The lines following the key designation represent the tune itself.

 

Music Dataset

In this article we’ll use the open-sourced data available on ABC version of the Nottingham Music Database. It contains more than 1000 folk tunes, the vast majority of which have been converted to ABC notation.

 

Data Processing

The data is currently in a character-based categorical format. In the data processing stage, we need to transform the data into an integer-based numerical format, to prepare it for working with neural networks.

How we will process musical notation into numerical data for machine learning music generation.
Figure 2: Snapshot of simple data processing

 

Here each character is mapped to a unique integer. This can be achieved using a single line of code. The ‘text’ variable is the input data.

 

char_to_idx = { ch: i for (i, ch) in enumerate(sorted(list(set(text)))) }

 

To train the model, we convert the entirety of text data into a numerical format using the vocab.

 

T = np.asarray([char_to_idx[c] for c in text], dtype=np.int32)

 

Model Selection for Machine Learning Music Generation

In traditional machine learning models, we cannot store a model’s previous stages. However, we can store previous stages with Recurrent Neural Networks (commonly called RNN).

An RNN has a repeating module that takes input from the previous stage and gives its output as input to the next stage. However, RNNs can only retain information from the most recent stage, so our network needs more memory to learn long-term dependencies. This is where Long Short Term Memory Networks (LSTMs) come to the rescue.

LSTMs are a special case of RNNs, with the same chain-like structure as RNNs, but a different repeating module structure.

Figure 3: The workings of basic LSTM

 

RNN is used here because:

  1. The length of the data doesn’t need to be fixed. For every input, the data length can vary.
  2. Sequence memory is stored.
  3. Various combinations of input and output sequence lengths can be used.

In addition to the general RNN, we’ll customize it to our use case by adding a few tweaks. We’ll use a ‘character RNN’. In character RNNs, the input, output, and transition output are in the form of characters.

Figure 4: Overview of a Character RNN

 

Many-Many RNN

As we need our output generated at each timestamp, we’ll use a many-many RNN. To implement a many-many RNN, we need to set the parameter ‘return_sequences’ to true so that each character is generated at each timestamp. You can get a better understanding of it by looking at figure 5, below.

Figure 5: Structure of Many-Many RNN

In the figure above, the blue units are the input, the yellow are the hidden units, and the green are the output units. This is a simple overview of a many-many RNN. For a more detailed look at RNN sequences, here’s a helpful resource.

 

Time Distributed Dense Layer

To process the output at each timestamp, we create a time distributed dense layer. To achieve this we create a time distributed dense layer on top of the outputs generated at each timestamp.

 

Stateful

The output from the batch is passed to the following batch as input by setting the parameter stateful to true. After combining all the features, our model will look like the overview depicted in figure 6, below.

Figure 6: Overview of the model architecture

 

The code snippet for the model architecture is as follows:

 

model = Sequential()
model.add(Embedding(vocab_size, 512, batch_input_shape=(BATCH_SIZE, SEQ_LENGTH)))
for i in range(3):
    model.add(LSTM(256, return_sequences=True, stateful=True))
    model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

 

I highly recommend playing around with the layers to improve the performance.

 

Dropout Layers

Dropout layers are a regularization technique that consists of a fraction of input units to zero at each update during the training to prevent overfitting. The fraction is determined by the parameter used with the layer.

 

Softmax Layer

The generation of music is a multi-class classification problem, where each class is a unique character from the input data. Hence we are using a softmax layer on top of our model and categorical cross-entropy as a loss function.

This layer gives the probability of each class. From the list of probability, we select the one with the largest probability.

Figure 7

 

Optimizer

To optimize our model, we use Adaptive Moment Estimation, also called Adam, as it is a very good choice for RNN.

A summary of our machine learning music model
Figure 8: Snapshot of the model summary

 

Generating Music

Till now we created an RNN model and trained it on our input data. This model learned patterns of input data during the training phase. Let’s call this model the ‘trained model’.

The input size used in the trained model is the batch size. And for generation of music via machine learning, the input size is a single character. So we create a new model which is similar to the trained model, but with input size of a single character which is (1,1). To this new model, we load the weights from the trained model to replicate the characteristics of the trained model.

 

model2 = Sequential()
model2.add(Embedding(vocab_size, 512, batch_input_shape=(1,1)))
for i in range(3):
    model2.add(LSTM(256, return_sequences=True, stateful=True))
    model2.add(Dropout(0.2))
model2.add(TimeDistributed(Dense(vocab_size)))
model2.add(Activation(‘softmax’))

 

We load the weights of the trained model to the new model. This can be achieved using a single line of code.

 

model2.load_weights(os.path.join(MODEL_DIR, ‘weights.100.h5’.format(epoch)))
model2.summary()

 

A summary of our machine learning music generation model
Figure 9 : Snapshot of summary

 

In the process of music generation, the first character is chosen randomly from the unique set of characters, the next character is generated using the previously generated character and so on. With this structure, we generate music.

Figure 10: Overview of generation architecture

 

Here is the code snippet that helps us achieve this.

 

sampled = []
for i in range(1024):
  batch = np.zeros((1, 1))
  if sampled:
     batch[0, 0] = sampled[-1]
  else:
     batch[0, 0] = np.random.randint(vocab_size)
  result = model2.predict_on_batch(batch).ravel()
  sample = np.random.choice(range(vocab_size), p=result)
  sampled.append(sample)
print("sampled")
print(sampled)
print(''.join(idx_to_char[c] for c in sampled))

 

Here are a few generated pieces of music:

 

We generated these pleasant samples of music using machine learning neural networks known as LSTMs. For every generation, the patterns will be different but similar to the training data. These melodies can be used for a wide variety of purposes:

  • To enhance artists’ creativity through inspiration
  • As a productivity tool to develop new ideas
  • As additional tunes to artists’ compositions
  • To complete an unfinished piece of work
  • As a standalone piece of music

However, this model can still be improved. Our training data consisted of a single instrument, the piano. One way we could enhance our training data is by adding music from multiple instruments. Another would be to increase the genres of music, their rhythms, and their timing signatures.

At present, our model generates a few false notes and the music is not exceptional. We could reduce these errors and increase the quality of our music by increasing our training dataset as detailed above.

 

Summary

In this article, we looked at how to process music for use with neural networks, the in-depth workings of deep learning models like RNN & LSTMs, and we also explored how tweaking a model can result in music generation. We can apply these concepts to any other system where we generate other formats of art, including generating landscape paintings or human portraits.

Thanks for reading! If you would like to experiment with this custom dataset yourself, you can download the annotated data here and see my code at Github.


 

If you’d like to read more of Ramya’s technical articles, be sure to check out the related resources below. You can also sign up to the Lionbridge AI newsletter for technical articles delivered straight to your inbox.

Subscribe to our newsletter for more technical articles
The Author
Ramya Vidiyala

Ramya is a data nerd and a passionate writer who loves exploring and finding meaningful insights from data. She writes articles on her Medium blog about ML and data science where she shares her experiences to help readers understand concepts and solve problems. Reach out to her on Twitter (@ramya_vidiyala) to start a conversation!

    Welcome!

    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.