A Multimodal Geospatially-Aware Image Retrieval Framework for Non-Geotagged Image Localization Using Contrastive Vision-Language Learning

Contenido principal del artículo

S. Pratap Singh
Dr. Ch. Bindu Madhuri
Dr. P. Satheesh

Resumen

The large-scale growth of digital image collections across mobile platforms, online media, and public repositories has created significant demand for intelligent retrieval systems capable of understanding visual content together with its geographic context. Existing image retrieval approaches mainly rely on semantic feature similarity and often neglect spatial relationships, reducing their effectiveness for geospatial reasoning and location inference tasks. This work presents GeoCLIP-BLIP, a multimodal framework for retrieving and localizing non-geotagged images through combined semantic and geographic representation learning. The proposed approach integrates CLIP to extract semantic visual embeddings, a lightweight geographic encoding module to capture spatial information from coordinate data, and BLIP to generate descriptive captions that improve interpretability. Using a geo-referenced image database, the framework identifies visually related samples and estimates the probable geographic location of an input query image through similarity-based ranking. Retrieved results are further presented through an interactive map interface for intuitive spatial visualization. Experimental evaluation shows that the proposed framework achieves better retrieval relevance and geographic consistency than conventional CLIP-based retrieval methods. By combining semantic feature extraction, spatial embedding fusion, and caption-based explanation, GeoCLIP-BLIP provides an efficient solution for multimodal geospatial image retrieval and localization of non-geotagged images.

Detalles del artículo

Sección
Articles