Document Metadata Extractor
About the Document Metadata Extractor
Extracts metadata from public documents such as: pdf, doc, xls, ppt, docx, pptx, xlsx.
The metadata may contain: author name, username, company name, software version, document path, creation date, etc.
The Metadata Extractor connects to the target URL, downloads the document(s) found, parses them and extracts all metadata identified.
The tool can extract metadata from multiple documents at once if the target URL points to a web page which contains links to the wanted documents (all of them will be searched for metadata).
- Document(s) URL: This is the url of the document(s) that will be downloaded and parsed for metadata. If the URL points to a web page which contains links to multiple documents, all of them will be downloaded and extracted.
What is document metadata?
Whenever you create or modify a document (pdf, office, etc), the editor application automatically embeds information inside the document about the document author, creation date, modification date, the type and version of editor software (ex. Microsoft Office 2013), the path on disk where it was saved, company name, etc.
The type of saved metadata is not standard and it depends on the application which creates/edits the document, on the type of document and whether was manually removed by the document author.
How to find public documents exposed in websites?
The easiest way to find URLs to public documents is to use search engines such as Google, Bing, Yahoo, etc. They already have this information because they have already crawled all public websites.
For instance, to find various types of documents with Google, you can use search expressions such as the following:
How document metadata can be used by attackers?
The metadata information embedded inside documents can be used in multiple scenarios by hackers. Here are some examples:
- Author names can be used to mount phishing attacks against company's employees.
- Usernames can be used to try brute-force authentication attacks against company's external facing applications (webmail, vpn, blogs, etc).
- Software type and version is useful to map the technologies used internally by an organization. Further attacks can be tailored against these technologies.
- Document creation/modification date could indicate that the author still works for the company.
- Other custom metadata may reveal additional interesting information.