The background of the auto tagging and clustering is well explained in this post, however, the code developed to perform it is not generic. I would rather prefer to use the opensource product like Rapid Miner to perform the same thing with much lesser effort and with much more flexibility.
The second problem is the generic nature of problem solving. Being from oil and gas background and handful experience of handling data, I believe that music don’t play well unless domain expertise and problems are not well blended in the initial stage.
I would love to craft something closer to domain problem such as extracting the scout information captured during the well drilling processes and cluster them to only those phrases that reflect the drilling efficiency, incidents, mistakes and observations.