While I agree that schema:caption seems like a good predicate to use, the schema.org folks seem to have defined it a bit oddly: it’s not used on the general MediaObject class, but only on the AudioObject, ImageObject and VideoObject classes. A MusicVideoObject or a DataDownload, for example, apparently shouldn’t have a caption.
With the extended representation, I guess that should be fine, because each MediaInfo would be an instance of AudioObject, ImageObject or VideoObject. (Well… what happens for other media types?) But in the basic representation, it’s only an instance of MediaObject… is that an issue?