Homebrew Temporal Upscaler
futonkris
May 17, 2026

-Why an Upscaler
Generally when we hear about AI these days, it is usually about LLMs - This new frontier model can do this many parameters or perform this well on this benchmark - And whilst LLMs are cool and interesting, I just find them boring. So this kinda got me thinking, what is a fun and interesting form of AI or more broadly ML.
I landed on two ideas; A temporal upscaler and stock market predictor trained with sentiment analysis (i know stock market ML is really overdone and generally pointless, but I am more curious about the sentiment analysis portion - sneak peak of what is to come I suppose).
-What is Super Resolution
Anyone who has ever dabbled in even a little bit of pc gaming would know what upscaling is. DLSS, FSR and XESS are in my opinion, truly transformative uses of AI - the ability to beat native rendering at times with just a quarter of the pixel count blew my mind. Super resolution largely works by taking in a lower resolution input and outputting it at a higher resolution. It does this by employing a neural network that's been trained on a huge collection of sharp images, downscales it and re-upscales it allowing the model to learn the different patterns of a high resolution image. Do this enough times and the model becomes good enough to take a low resolution input and scale it up several times with no clear visual loss (Frontier models like DLSS 4.5 or FSR 4)
-The Idea
So here goes my attempt at "replicating" those models. The premise I wanted to do was simple - I have a bunch of 720p clips captured from my steam deck that I wanted to upscale to 4k. Given that this is a video clip, I wanted to ensure temporal stability as such the model would be:
- Temporal (Prev frame, current frame, future frame)
- 3x Upscale
- CNN based (Transformer seemed like it would need more data and more training/inference time, I also have no experience with it)
- Optical flow Computation (Both RAFT and manual compute)
Through a brief and somewhat lazy search, I settled on the 24k REDS triplets as my dataset, it was 720p like my source file and seemed like decent enough data to train a small model. Now came hardware, in the early phases of this project I had originally trained using my 9070xt but had to inevitably switch to a Runpod VM with a RTX Pro 6000 when a driver update nuked my ROCm and pytorch support (thank you AMD).
-First Attempts
Initially I had opted for a simple single directional and manually computed optical flow but as the generated output felt lackluster, I decided to give it some more oomf by using RAFT (A pretrained model used for computing flow) and to make it bi-directional. The rationale was simple, Raft would simply be a better fit given my lack of data and fairly rudimentary attempt at calculating flow and bi-direction gives me more temporal stability (Reduce ghosting, flickering, artifacts and maybe smooth out some motions). So without further ado, here goes my first iteration without degradation.


And yea..... This was not an impressive showing. The model failed to really enhance any sharpness in the characters bodies or foliage and somehow even preserved compression artefacts. The only noticeable wins were some sharper clarity in the UI, fine lines in the armour and temporal stability. My model's lack of performance was further put on display when I upscaled my same video with Esrgan and was completely blown out of the water. The only real noticeable edge my model had over Esrgan was temporal stability (Makes sense given Esrgans' spatial nature) and the preservation of some fine detail


Degradation Run
Looking into the differences between Esrgan and my model, I found out that Esrgan heavily degrades training data to deal with compression artifacts, smoothing out details etc. Not able to think of anything else to do, I figured I would also add a degradation layer - A mild one given my relatively small data size - Back into the pipeline and 7 hours later my model had generated a new upscaled footage.

-Evaluation
Whilst the sharpness and fidelity of my model still persists, I found that the compression issues were largely mitigated and even looked a bit clearer. That said, the UI sharpness did seem to falter compared to previously but overall it was a much cleaner image.
From the beginning the two clearest problems were my model architecture and my dataset. REDS is firstly fairly small but more importantly it was IRL content, not gameplay (one could argue the CNN model is really just learning what a 3x super resolution looks like so the nature of the video type did not matter, but I thought it would make a discernable difference). Given that I really could not find any large datasets, I decided to make my own. I pulled youtube videos at 4k, downresed it to 1440p (To get rid of some youtube compression artefacts) and simply cut the video up into REDS style folders. Next I had to make model changes, namely adding a discriminator and adding GAN loss. The short version of what GAN does is that a generator (In my case my normal model) outputs something and another network (the discriminator) tries to guess which one is a "fake" image and which is real. The idea is that when the discriminator wins, the generator has to create new plausible detail instead of a smoothed out pixel average (My current model).
2nd run
So here we go again, I retrained the model with the new dataset and the benchmark numbers increased a bit (36 PSNR - 39 PSNR), but there was an incredibly obvious drawback. My upscaled video now had a very clear purple tinting/artefacting on gray surfaces (the sky) - This could be due to the data not being diverse enough or my degradation pipeline shifted the hues of the training data. To fix this I could either handle colour correction, add more diverse data or train without deg, but at this point i am 30 usd in and counting, so onwards we go.
My discriminator architecture was ripped straight from Esrgan (U-Net with spectral normalisation - per pixel map) with 4 million parameters. Once my generator completed training, I moved onto my GAN training. To spare the long and horrendously boring details, my model improved slightly. It managed to commit more to texture elements but the fidelity I was looking for was not present (Expected but still disappointing)




Looking at the difference between epoch 1 and 15 we can see better subtle texture reconstruction and a less harsher upscale. The fidelity still fails to approach native which I believe is now a depth problem (Esrgan was trained on much more data than my model, a more concrete deg pipeline and obviously had a longer training stage). So where does this leave us? I made a model that beats bicubic upscaling (Not very difficult to do), fails in fidelity goals (Compared to Esrgan) but preserves - to a decent extent - temporal stability. Here I will leave all comparison images for you guys to decide





To view the images in full res or the upscaled videos, do check out the github page.