How to combine Neural Network and Audio like Classify?

One thing we have to consider before applying deep neural network is the size of the data. If the data size is very small, which is the case here (only 18 examples), a very deep neural network may not converge well.

There are several ways to deal with small data set. One common way is to use transfer learning (see example here) which leverage on pretrained network and only train small a small portion of the large network. Another way is to apply data augmentation to generate more data. Another way is to to use a shallow neural network and work with dimension reduced data. I will demonstrate the last one, which is also how Classify did in the first place in your example.

First of all, we convert the audio into spectrograms.

windInstrument = ExampleData[{"Sound", #}] & /@ {"AltoFlute","AltoSaxophone", "BassClarinet", "BassFlute", "Flute","FrenchHorn", "Oboe", "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = ExampleData[{"Sound", #}] & /@ {"Cello","CelloPizzicato", "DoubleBass", "DoubleBassPizzicato", "OrganChord","Viola", "Violin"};

audioToSpectrogram[a_] := Module[{data},
  data = SpectrogramArray[a, Automatic, Automatic, HannWindow];
  ImageResize[
   ImageAdjust@Image[Reverse@
     Transpose@Abs[data[[All, 1 ;; Dimensions[data][[2]]/2]]]], {266, 
    277}]
  ]

spectrogramData = 
  audioToSpectrogram /@ Flatten[{windInstrument, nonwindInstrument}];

We then construct a dimension reduction function from our data. The original spectrogram data of 266 by 277 is reduced to a vector of size 17

dr = DimensionReduction[spectrogramData, 17, 
  Method -> "Linear"]

Now construct our training data from the dimension reduced data

trainingData = 
  RandomSample@
   MapAt[dr[audioToSpectrogram[#]] &, 
    Flatten[{Thread[windInstrument -> "wind"], 
      Thread[nonwindInstrument -> "nonwind"]}], {All, 1}];

Construct and train the neural network, we use only two layers

net = NetChain[{10, Tanh, 2, Tanh, SoftmaxLayer[]}, "Input" -> 17, 
  "Output" -> NetDecoder[{"Class", {"wind", "nonwind"}}]];

trained = 
 NetTrain[net, trainingData, 
  Method -> {"SGD", "L2Regularization" -> 0.1}, 
  MaxTrainingRounds -> 500]

evaluate on test data

trained@
   dr@
     audioToSpectrogram[ExampleData[{"Sound", #}]] & /@ {"Clarinet", 
  "Piano", "Bassoon"}
(* {"wind", "nonwind", "wind"} *)

In 11.3, NetEncoder support Audio object.

enter image description here

windInstrument = 
  ExampleData[{"Sound", #}] & /@ {"AltoFlute", "AltoSaxophone", 
    "BassClarinet", "BassFlute", "Flute", "FrenchHorn", "Oboe", 
    "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = 
  ExampleData[{"Sound", #}] & /@ {"Cello", "CelloPizzicato", 
    "DoubleBass", "DoubleBassPizzicato", "OrganChord", "Viola", 
    "Violin"};

(*Convert Sound to Audio that fits NetTrain's Input*)
trainingData = Join[Thread[Audio /@ windInstrument -> "windInstrument"], 
                    Thread[Audio /@ nonwindInstrument -> "nonwindInstrument"]];

enter image description here

Let's construct network.

Here I use AudioMelSpectrogram, this one and AudioMFCC are common features in speech tasks.

net = NetChain[{LongShortTermMemoryLayer[30], 
   LongShortTermMemoryLayer[10], SequenceLastLayer[], 2, 
   SoftmaxLayer[]},
  "Input"  -> NetEncoder["AudioMelSpectrogram"], 
  "Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]

enter image description here

net = NetTrain[net, trainingData];
types = {"Clarinet", "Piano", "Bassoon"};
net[Audio@ExampleData[{"Sound", #}]] & /@ types
(*{"windInstrument","nonwindInstrument","windInstrument"}*)

However now, there are more focus on raw Audio itself(NetEncoder["Audio"]) such as WaveNet and SampleRNN.

net = NetChain[{ConvolutionLayer[32, 80, "Interleaving" -> True], 
   LongShortTermMemoryLayer[10], SequenceLastLayer[], 2, 
   SoftmaxLayer[]}, "Input" -> NetEncoder["Audio"], 
  "Output" -> 
   NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]

But it use so much memory, so it can't be done in real application.

And the features from xslittlegrass's answer is similar to NetEncoder["AudioSpectrogram"]

PS:

This way support out-of-core training, that means it can be used in real applications:

enc = NetEncoder["AudioMelSpectrogram"];
file = "ExampleData/rule30.wav";
a1 = Import[file];
(*in-core Audio object*)
a2 = Audio[file];
(*out-of-core Audio object*)
Dimensions /@ NetEncoder["AudioMelSpectrogram"][{a1, a2}]
(*{{215,40},{215,40}}*)
ByteCount /@ {a1, a2}
(*{161128,424}*)