How should I more securely store files obtained by mirroring a website?

The file types you listed and the goals you presented result in a huge attack surface. Assuming a sophisticated adversary, this gives a vastly increased possibility for exploitation. PDFs and PPT/PPTX are particularly problematic. If you cannot limit yourself to far fewer file types, you will need to isolate your activities, either using privilege separation or remote virtualization.

Privilege separation

This is a solution in the case that you need to perform this activity on your local computer. While it would be difficult to individually sandbox every single application you will be using, you can create a new user on your computer with few privileges:

  • Disable access to su and sudo from the new user, and do not use them as that user.
  • Do not su to your lesser user from root to avoid TTY pushback attacks.
  • Use iptables to disable network access for that user.
  • Set resource limits to reduce the amount of damage an exploited application can do.
  • Use Wayland instead of Xorg if possible, or Xorg with systemd-logind to run it as non-root.
  • Enable and use the Secure Attention Key when finished and switching to a new session.
  • Scan for and remove all unnecessary setuid or setgid files, as well as setcap files.
  • Use an auditing framework like auditd to monitor potentially malicious activities.
  • Apply general system hardening such as sysctl tweaks or hardening patches.

Depending on the level of sophistication of the adversary, this may not be enough, and even for an adversary of moderate capability, this is rather incomplete, but it is a starting point. As you have to assume (most likely correctly) that the applications you are using to access these files are vulnerable to arbitrary code execution, the question turns into How can I safely run untrusted code?, which is of course extremely broad.

Virtual Private Servers

An easier solution would be to use a VPS. You can run applications remotely on the VPS rather than on your own computer and interact with it that way. Even if the VPS is completely compromised, the attack surface area is reduced to that of your SSH client and your terminal, which is fairly small. As you will not be able to directly view images over SSH (at least not safely), you may want to convert them to a very simple (and difficult to exploit) image format before transferring it to your local computer and viewing it. An example of such a format would be the PPM pixel map format. This also works for viewing PDF files, as they can be readily converted into image files using various utilities.

The host of a VPS is able to access and modify anything on the VPS. If this is an issue for you (for example, if the files are extremely sensitive, or integrity is of utmost importance), you may not want to use a VPS. This is unlikely to be an issue, and as the website you have downloaded from is already (presumably) public, there should be no confidentiality issue. You can increase the confidentiality and integrity of the data by instead using a dedicated server, although that would be more expensive.

You should keep a local backup of these files, in the case that the VPS is shut down, so that you can restore it to another VPS later. The local files should be stored in an "inert" form that will not be susceptible to exploitation of any indexers or thumbnail generators you may have on your system. This can be done for example by putting the files in an archive such as tar.

As you will need to use command line utilities rather than graphical ones, you will need to find ways to access these files over SSH. Some examples which you can do remotely on a VPS:

  • You can convert PNG, TIFF, JPEG, and (non-animated) GIF to PPM to safely view it locally.
  • XLS/XLSX can be converted to CSV, which is fine on command line. There is a good vim plugin.
  • RTF, DOC, DOCX, and ODF can be converted to images, which can be securely viewed locally.
  • I am not aware of any way to view PPT/PPTX in command line, though you can analyze them.
  • HTML, CSS, and JS can be viewed in a text editor or run on a remote text-based browser.
  • PHP can be viewed in a text editor or executed using a command-line PHP interpreter.