Is the origin of a file traceable? If it is how can I sanitise it?

Is the origin of a file traceable? If it is how can I sanitise it?

The short answer is it depends:

  • If the file contained your name, address, telephone number, and social security number it would not be very difficult to trace it back to you ...

  • A lot of applications leave identifying information of some kind - known as Metadata - in files in addition to the obvious visible data in the file itself.

  • Metadata can usually be removed from files (the removal method depends of the type of the file).

  • Uploading a file will send only the primary data stream, and leave alternate data streams and filesystem-resident metadata behind.

  • As pointed out by Andrew Morton some organisations make small grammatical (or other) changes to each copy of a document before it gets distributed.

    By doing this copies can be tracked to particular individuals if the copy gets stolen (or passed on). This, of course, is very difficult to defeat.

  • Read on for more information about the kind of sensitive and hidden data that can be associated with different kinds of files and how to clean (sanitise) them.


Are plain text files safe to use?

As pointed out by Uwe Ziegenhagen, even Windows plain text files (as well as any other file type) on a NTFS file system can potentially contain metadata, in the form of Alternate Data Streams. See also How To Use NTFS Alternate Data Streams.

Alternate data streams allow files to be associated with more than one data stream. For example, a file such as text.txt can have a ADS with the name of text.txt:secret.txt (of form filename:ads) that can only be accessed by knowing the ADS name or by specialized directory browsing programs.

Alternate streams are not detectable in the original file's size but are lost when the original file (i.e. text.txt) is deleted, or when the file is copied or moved to a partition that doesn't support ADS (e.g. a FAT partition, a floppy disk, or a network share). While ADS is a useful feature, it can also easily eat up hard disk space if unknown either through being forgotten or not being detected.

This feature is only supported if files are on an NTFS drive.

Source UltraEdit File Open Dialog.


Viewing and Deleting Alternate Data Streams

Notes:

  • Any file on an NTFS file system can have an alternate data stream attached to it (not just text files).
  • For more information about the potential security issues associated with alternate data streams see Hidden Threat: Alternate Data Streams

Notepad and and Word can be used (from the command line) to open and read alternate data streams. See this answer NTFS alternate data streams by nishi for more information.

UltraEdit can open alternate data streams from within the program itself.

AlternateStreamView can be used to delete alternate data streams:

AlternateStreamView is a small utility that allows you to scan your NTFS drive, and find all hidden alternate streams stored in the file system.

After scanning and finding the alternate streams, you can extract these streams into the specified folder, delete unwanted streams, or save the streams list into a text, HTML, CSV or XML file.

enter image description here

Source AlternateStreamView by Nirsoft


How about images?

As pointed out by Scott, images can also contain concealed data (a file, message, another image, or a video, using steganography:

Steganography includes the concealment of information within computer files. In digital steganography, electronic communications may include steganographic coding inside of a transport layer, such as a document file, image file, program or protocol.

Media files are ideal for steganographic transmission because of their large size. For example, a sender might start with an innocuous image file and adjust the color of every 100th pixel to correspond to a letter in the alphabet, a change so subtle that someone not specifically looking for it is unlikely to notice it.

Source steganography

This, of course, is very difficult to remove.

See also Steganography - A Data Hiding Technique and Stenography Software


What about Excel spreadsheets or Word documents?

By default office documents contain personal information:

  • This information can be removed, see the link below.

Word:

  • Consider using a plain text file, created with notepad or other editor, instead of a word document

Spreadsheet:

  • Consider using a CSV file, created with excel and saved as CSV, or create a CSV directly with another program such as notepad.

Word documents can contain the following types of hidden data and personal information:

  • Comments, revision marks from tracked changes, versions, and ink annotations

    If you collaborated with other people to create your document, your document might contain items such as revision marks from tracked changes, comments, ink annotations, or versions. This information can enable other people to see the names of people who worked on your document, comments from reviewers, and changes that were made to your document.

  • Document properties and personal information

    Document properties, also known as metadata, include details about your document such as author, subject, and title. Document properties also include information that is automatically maintained by Office programs, such as the name of the person who most recently saved a document and the date when a document was created. If you used specific features, your document might also contain additional kinds of personally identifiable information (PII), such as e-mail headers, send-for-review information, routing slips, and template names.

  • Headers, footers, and watermarks

    Word documents can contain information in headers and footers. Additionally, you might have added a watermark to your Word document.

  • Hidden text

    Word documents can contain text that is formatted as hidden text. If you do not know whether your document contains hidden text, you can use the Document Inspector to search for it.

  • Document server properties

    If your document was saved to a location on a document management server, such as a Document Workspace site or a library based on Microsoft Windows SharePoint Services, the document might contain additional document properties or information related to this server location.

  • Custom XML data

    Documents can contain custom XML data that is not visible in the document itself. The Document Inspector can find and remove this XML data.

Note:

  • The Word Document Inspector won't detect white-colored text or images with steganography (a concealed a file, message, image, or video)

Source Remove hidden data and personal information by inspecting documents


What if I use a PDF file, obtained from someone else?

PDFs are not safe:

  • They can contain viruses, see Can a PDF file contain a virus?

  • They can contain JavaScript. If the JavaScript was to "phone home" every time the PDF was opened there could be a nice trail including your IP address.

  • PDFs can also contain hidden information:

    PDF has also been frequently used as a distribution format for files originally created in Microsoft Office because hidden data and metadata can be sanitized (or redacted) during the conversion process.

    Despite this common use of PDF documents, users who distribute these files often underestimate the possibility that they might contain hidden data or metadata. This document identifies the risks that can be associated with PDF documents and gives guidance that can help users reduce the unintentional release of sensitive information.

Source Hidden Data and Metadata in Adobe PDF Files:
Publication Risks and Countermeasures, a document written by the NSA


How can I check the PDF file to make sure it doesn't contain any sensitive information?

You can follow the advice given by the NSA to sanitise your PDF.

  • I have summarised the basic steps you need to follow.
  • Detailed step by step instructions with screen shots are available from the link below.

This paper describes procedures for sanitizing PDF documents for static publication. Sanitization for the purpose of this document means removing hidden data and dynamic content not intended for publication (for example, the username of the author or interim editing comments embedded in the file but not visible on any pages).

Hidden data includes:

  • Metadata

  • Embedded Content and Attached Files

  • Scripts

  • Hidden Layers

  • Embedded Search Index

  • Stored Interactive Form Data

  • Reviewing and Commenting

  • Hidden Page, Image, and Update Data

  • Obscured Text and Images

  • PDF (Non-Displayed) Comments

  • Unreferenced Data

...

Detailed Sanitization Procedure

  1. Sanitize Source File

    If the application that generated the source file has a sanitization utility, it should be applied before converting to PDF.

  2. Configure Security Settings

    • Ensure that all applicable Acrobat updates have been downloaded and installed
    • Disable JavaScript
    • Verify that the trust manager settings are set appropriately
  3. Run Preflight

    Preflight ensures that the file contents are compatible with the destination version, and applies ‘fixups’ as necessary.

  4. Run the PDF Optimizer

    • If the PDF file contains other attached files, a warning message appears. Click ‘OK’ to continue. The attached files will be removed during PDF optimization.
    • Document tags pose a hidden data risk. This procedure (specifically the checked option for ‘Discard document tags’) removes them from the sanitized PDF.
  5. Run the Examine Document Utility

    • This helps to find text hidden behind objects as well as any other areas that might have been missed in the previous steps.

Source Hidden Data and Metadata in Adobe PDF Files:
Publication Risks and Countermeasures, a document written by the NSA


But I have antivirus software!

Even antivirus software is not guaranteed to catch everything. See zero day exploit:

A zero-day (also known as zero-hour or 0-day) vulnerability is a previously undisclosed computer-software vulnerability that hackers can exploit to adversely affect computer programs, data, additional computers or a network.

It is known as a "zero-day" because once the flaw becomes known, the software's author has zero days in which to plan and advise any mitigation against its exploitation (for example, by advising workarounds or by issuing patches)

Source zero day


What about my USB drive? Do I need to worry about that?

You cannot guarantee your USB flash drive is safe.

USB peripherals, such as thumb drives, can be reprogrammed to steal the contents of anything written to the drive and to spread the firmware-modifying code to any PCs it touches. The net result could be a self-replicating virus that spreads through sparing thumb drives, much like the rudimentary viruses that spread by floppy disk decades ago.

Source Why your USB device is a security risk


It depends on the file type. For example, all Microsoft Office applications (Word, Excel, etc) store the following information in the file:

  • computer name (were the file was saved)
  • name of the Author (by default, name of the person to whom Microsoft Office is registered, but this can be easily changed)
  • date when the file was credited
  • date when the file was last saved

Above information is usually called file metadata.

If you save the document as a plain-text file, ie document.TXT (opens with Notepad), then no metadata will be saved.

Treat carefully :)