kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Spatiotemporal Vision Transformer-based Network for Deepfake Detection
KTH, School of Electrical Engineering and Computer Science (EECS).
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Spatiotemporal Vision Transformer-baserat nätverk för Deepfake detektering (Swedish)
Abstract [en]

Manipulated videos and images, known as deepfakes, are more common than ever. They can be used to spread disinformation on social media and harm individuals. With more advanced deep learning methods on the horizon and harder-to-detect deepfakes, advancement in deepfake detection is crucial to detect manipulated videos. Thus, this thesis presents three variants of a novel deep-learning model with the purpose of improving the detection of deepfakes. The proposed models consist of a spatiotemporal convolutional neural network used as a feature extractor and a vision transformer used as a classifier. The performance, robustness, and knowledge transferability of the models were tested. The models were trained and evaluated on multiple datasets and compared to a state-of-the-art baseline model. The robustness of the models was tested by applying several perturbation techniques: Gaussian noise, blur, increased brightness, and change in compression rate. The knowledge transferability was tested by evaluating the models when trained on one dataset and tested on another. Lastly, the impact of the size of the temporal domain was investigated. The best-performing proposed model achieved competitive results with the state-of-the-art in terms of performance and robustness. It also showcased slightly better knowledge transferability, although neither model performed well when tested on unseen manipulation techniques. In summary, this thesis presents a novel spatiotemporal deep learning model achieving similar performance as current state-of-the-art methods when applied to deepfake detection. However, the proposed combined model did not display any increase in performance compared to already existing methods.

Abstract [sv]

Manipulerade videor och bilder, kända som deepfakes, är vanligare än någonsin. De kan användas för att sprida desinformation på sociala medier och skada individer. Med mer avancerade metoder för djupinlärning vid horisonten och svårare deepfakes att upptäcka är framsteg inom deepfake-detektering avgörande för att upptäcka manipulerade videor. Därav presenterar denna avhandling varianter av en ny modell för djupinlärning med syftet att förbättra detektering av deepfakes. Den föreslagna modellen består av ett spatiotemporalt convolutional nerual netowrk som används som en funktionsextraktor och en vision transformer som en klassificerare. Modellens prestanda, robusthet och kunskapsöverföringsbarhet testades. Modellerna tränades och utvärderades på flera datamängder och jämfördes med en state-of-the-art basmodel. Ro- bustheten hos modellerna testades genom att tillämpa flera störningstekniker: Gaussiskt brus, oskärpa, ökad ljusstyrka och ändrad kompressionsnivå. Deras kunskapsöverföringsbarhet testades genom att utvärdera modellen när den tränades på en datamängd och testades på en annan. Den bästa förlagna modellen uppnådde konkurrenskraftiga resultat jämfört med state-of-the-art när prestanda och robusthet jämfördses. Den visade också ett litet övertag gällande dess kunskapsöverföringsbarhen, men ingen av modellerna presterade bra när de testades på osedda manipulationstekniker. Sammanfattningsvis presenterar denna avhandling en ny spatiotemporal djupinlärningsmodell som uppnår liknande prestanda som nuvarande state- of-the-art metoder när den tillämpas på deepfake-detektion.

Place, publisher, year, edition, pages
2024. , p. 49
Series
TRITA-EECS-EX ; 2024:1002
Keywords [en]
Vision Transformer, Deepfake, Convolutional Neural Network, Deep learning
Keywords [sv]
Vision Transformer, Deepfake, Convolutional Neural Network, Djupinlärning
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-361653OAI: oai:DiVA.org:kth-361653DiVA, id: diva2:1947149
Supervisors
Examiners
Available from: 2025-03-27 Created: 2025-03-25 Last updated: 2025-03-27Bibliographically approved

Open Access in DiVA

fulltext(1446 kB)43 downloads
File information
File name FULLTEXT02.pdfFile size 1446 kBChecksum SHA-512
e6ee0d84ff444699b7c27baa77c703d396dc6aacd9cab5c03c6ad55fc61012f1039fc015d51468722c7f60533dce975cda9662bb9417bb379457deab4bbeabec
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 43 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 275 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf