kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
From Pixels to Prices with ViTMAE: Integrating Real Estate Images through Masked Autoencoder Vision Transformers (ViTMAE) with Conventional Real Estate Data for Enhanced Automated Valuation
KTH, School of Electrical Engineering and Computer Science (EECS).
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Från pixlar till priser med ViTMAE : Integrering av bostadsbilder genom Masked Autoencoder Vision Transformers (ViTMAE) med konventionell fastighetsdata för förbättrad automatiserad värdering (Swedish)
Abstract [en]

The integration of Vision Transformers (ViTs) using Masked Autoencoder pre-training (ViTMAE) into real estate valuation is investigated in this Master’s thesis, addressing the challenge of effectively analyzing visual information from real estate images. This integration aims to enhance the accuracy and efficiency of valuation, a task traditionally dependent on realtor expertise. The research involved developing a model that combines ViTMAE-extracted visual features from real estate images with traditional property data. Focusing on residential properties in Sweden, the study utilized a dataset of images and metadata from online real estate listings. An adapted ViTMAE model, accessed via the Hugging Face library, was trained on the dataset for feature extraction, which was then integrated with metadata to create a comprehensive multimodal valuation model. Results indicate that including ViTMAE-extracted image features improves prediction accuracy in real estate valuation models. The multimodal approach, merging visual and traditional metadata, improved accuracy over metadata-only models. This thesis contributes to real estate valuation by showcasing the potential of advanced image processing techniques in enhancing valuation models. It lays the groundwork for future research in more refined holistic valuation models, incorporating a wider range of factors beyond visual data.

Abstract [sv]

Detta examensarbete undersöker integrationen av Vision Transformers (ViTs) med Masked Autoencoder pre-training (ViTMAE) i bostadsvärdering, genom att addressera utmaningen att effektivt analysera visuell information från bostadsannonser. Denna integration syftar till att förbättra noggrannheten och effektiviteten i fastighetsvärdering, en uppgift som traditionellt är beroende av en fysisk besiktning av mäklare. Arbetet innefattade utvecklingen av en modell som kombinerar bildinformation extraherad med ViTMAE från fastighetsbilder med traditionella fastighetsdata. Med fokus på bostadsfastigheter i Sverige använde studien en databas med bilder och metadata från bostadsannonser. Den anpassade ViTMAE-modellen, tillgänglig via Hugging Face-biblioteket, tränades på denna databas för extraktion av bildinformation, som sedan integrerades med metadata för att skapa en omfattande värderingsmodell. Resultaten indikerar att inklusion av ViTMAE-extraherad bildinformation förbättrar noggranheten av bostadssvärderingsmodeller. Den multimodala metoden, som kombinerar visuell och traditionell metadata, visade en förbättring i noggrannhet jämfört med modeller som endast använder metadata. Denna uppsats bidrar till bostadsvärdering genom att visa på potentialen hos avancerade bildanalys för att förbättra värderingsmodeller. Den lägger grunden för framtida forskning i mer raffinerade holistiska värderingsmodeller som inkluderar ett bredare spektrum av faktorer utöver visuell data.

Place, publisher, year, edition, pages
2024. , p. 51
Series
TRITA-EECS-EX ; 2024:50
Keywords [en]
Computer Vision, Transformer, Self-supervised Learning, Masked Autoencoder, Automated Real Estate Appraisal
Keywords [sv]
Datorseende, Transformer, Self-supervised Learning, Masked Autoencoder, Automatiserad Bostadsvärdering
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-345806OAI: oai:DiVA.org:kth-345806DiVA, id: diva2:1853073
External cooperation
Echo State AB
Supervisors
Examiners
Available from: 2024-05-08 Created: 2024-04-20 Last updated: 2025-01-27Bibliographically approved

Open Access in DiVA

fulltext(4092 kB)242 downloads
File information
File name FULLTEXT01.pdfFile size 4092 kBChecksum SHA-512
375b28923fca4f478d57af539a2b9ae2c4891a2968c7e4b99e98c583ff26df16cf31745a12f489c9929c7d23406ed232ac75feffd95510127278ab6ef97edb78
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 242 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 376 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf