Abdirashid Omar

Building FINCPA: Turning Financial Law into Compliance Data

2026-06-04T00:00:00+09:00

FINCPA was built as a hackathon prototype for the JB Financial Group Fin AI Challenge. The product idea was simple to explain but tricky to implement:

Can we help review financial advertising without letting a language model become the legal judge?

That question matters because financial compliance is not only a text-generation problem. If an AI system says “this ad looks fine,” a reviewer needs to know why. Which rule was checked? Which legal clause supports it? Which disclosure was missing? Which field in the ad caused the finding? Could the decision be reproduced tomorrow?

The architecture we proposed answers those questions with a compiler-style workflow. The law is converted into structured compliance data first. The runtime input is converted into a structured representation second. A deterministic engine compares the two. Only after that does the LLM help with explanation and conservative rewrite suggestions.

The Main Design Choice

A weaker design would be:

user ad → LLM → compliance decision

That is easy to demo, but hard to trust. It makes the model responsible for interpreting law, applying rules, and explaining itself in one step.

FINCPA uses a different flow:

law → structured rules
ad → structured facts
rules + facts → deterministic finding
finding + evidence → LLM explanation
reviewer → final approval

This separation is the whole project. The LLM is useful, but it is not the first judge. The system first asks: what can we prove from structured legal data and structured input fields?

Scope: One Legal Chapter First

The prototype focuses on Chapter 4 of the Financial Consumer Protection Act, especially customer-facing conduct and financial-product advertising. Instead of trying to cover every regulation in Korea, the project builds one defensible legal core:

one official law PDF,
one chapter scope,
clause-level parsing,
obligation decomposition,
rule candidate compilation,
MVP rule freeze,
runtime review examples.

That may sound narrow, but narrow is good for a hackathon prototype. A smaller scope lets the team show traceability. Every rule can point back to source text.

Building the Legal Dataset

The first layer is deterministic parsing. The source PDF is trimmed to the Chapter 4 range, then segmented into article and paragraph-level records. The current parse produces 60 clause records across the Chapter 4 scope.

In abstract form, we can think of the source law as a sequence of clauses:

\[C = \{c_1, c_2, \ldots, c_n\}\]

where each clause record contains metadata such as article reference, paragraph marker, source text, normalized text, and source path.

For FINCPA:

\[n = 60\]

The important point is that this layer does not ask an LLM to make a compliance judgment. It only creates clean legal units.

From Clauses to Obligations

Legal clauses are often dense. One paragraph can contain several operational requirements, exceptions, or prohibited behaviors. So the next step decomposes clauses into smaller obligation units.

We can describe that as a mapping:

\[g(c_i) = \{o_{i1}, o_{i2}, \ldots, o_{im}\}\]

where each $o_{ij}$ is an obligation or rule-relevant unit derived from clause $c_i$.

In the current FINCPA pipeline:

\[|O| = 109\]

This is where the legal text starts becoming useful for computation. Instead of treating the law as paragraphs, the system treats it as operational units: required disclosure, prohibited expression, required process, required record, or required response.

Compiling Candidate Rules

Once obligations exist, the system can compile rule candidates. Each candidate connects a legal obligation to things the runtime system can actually inspect.

A useful rule row needs at least:

legal basis,
product scope,
channel scope,
rule family,
logic type,
detection target,
candidate SIR fields,
evaluation hint.

At this point the project has 109 Layer 3 rule/SIR candidates.

But candidates are not enough. A product demo needs a frozen MVP. So Layer 4 applies a deterministic selection rule:

\[R_{mvp} = \{r_k \in R_{candidate} \mid ready\_for\_v1(r_k)=yes\}\]

That produces:

\[|R_{mvp}| = 76\]

and a frozen SIR schema with:

\[|F_{sir}| = 29\]

This is the moment the project becomes a real data system. The rule pack is no longer just notes about the law. It is a machine-readable compliance artifact.

What Is SIR?

SIR means Structured Intermediate Representation. It is the normalized form of the user input.

A financial ad is messy text:

Fast personal loan. 3% guaranteed. Apply in seconds.

The runtime needs something more explicit:

{
  "product_type": "loan",
  "seller_identity": "present",
  "loan_conditions": "present",
  "loan_rate_basis": "not_evidenced",
  "loan_interest_timing": "not_evidenced",
  "prohibited_claim_signal": "present"
}

SIR is the bridge between human language and legal rules. It lets the engine ask precise questions:

Is the product type known?
Is the seller identified?
Are required cost disclosures present?
Is a prohibited certainty phrase detected?
Is the required warning missing?

Deterministic Review

Each rule can be evaluated against the SIR fields. A simple required-presence rule can be written as:

\[fail(r_k, x) = \begin{cases} 1, & \text{if required field } f_k \text{ is not evidenced in } x \\ 0, & \text{otherwise} \end{cases}\]

For prohibited-presence rules:

\[fail(r_k, x) = \begin{cases} 1, & \text{if prohibited signal } p_k \text{ is present in } x \\ 0, & \text{otherwise} \end{cases}\]

The final decision can be simplified as:

\[d(x) = \begin{cases} \text{non-compliant}, & \sum_k fail(r_k, x) > 0 \\ \text{review}, & \sum_k uncertain(r_k, x) > 0 \\ \text{compliant}, & \text{otherwise} \end{cases}\]

The actual system keeps more detail than this, including applicable rule counts, missing SIR fields, triggered citations, failed rule IDs, escalation flags, and reviewer packets.

Example: Guaranteed Investment Return

One runtime example tests an investment ad with a guaranteed-return signal. The engine marks the case as non-compliant and escalates it. It triggers rules connected to advertising obligations, including missing investment-warning evidence and a prohibited claim signal.

The useful part is not only the label. The useful part is the evidence trail:

final decision: non_compliant
escalation: true
applicable rules: 3
failed rules: 2
missing SIR field: investment_warning
triggered citations: Financial Consumer Protection Act Article 22 paragraph 3 and 4

This is what makes the system reviewable. A compliance officer does not have to accept a black-box sentence. They can inspect which field was missing, which rule failed, and which legal citation was attached.

Why Dashboards Matter

The repository includes two dashboard views.

The first dashboard is for the legal compilation pipeline. It shows how a clause moves through parsing, Layer 1 metadata, Layer 2 obligation decomposition, Layer 3 rule/SIR candidate design, and the Layer 4 freeze.

The second dashboard is for runtime review. It shows a new input moving through:

prompt/input,
runtime schema,
SIR extraction,
active rules,
triggered law,
final result.

This was important for the hackathon because the project has many JSON and JSONL artifacts. A dashboard makes the pipeline explainable to judges, teammates, and future reviewers.

Where the LLM Fits

The LLM is still useful. It can turn a technical review result into language a human can act on:

reviewer summary,
plain-language rationale,
remediation actions,
conservative rewrite suggestion.

But the LLM receives the structured finding after the deterministic engine has already created it. That makes the system safer:

\[LLM\_input = \{original\_text, SIR, failed\_rules, citations, evidence\}\]

The model explains the decision. It does not invent the legal basis.

Limitations

FINCPA is a prototype, not a production compliance system. The current scope focuses on Chapter 4 and does not yet include the full stack of presidential decrees, supervisory regulations, enforcement cases, and product-specific advertising guidance.

The SIR extractor is also an MVP. Real deployment would need better entity extraction, stronger Korean financial-language coverage, more robust OCR/document input, reviewer feedback loops, and ongoing legal updates.

Still, the prototype shows the right shape: legal data should be traceable, rules should be auditable, and LLMs should be placed where they help without becoming the hidden judge.

Conclusion

FINCPA is a data engineering story disguised as a compliance AI project.

The core work is not just building a dashboard or calling an API. The core work is turning law into structured data, turning ads into structured facts, comparing them deterministically, and packaging the result so a human reviewer can make the final decision with evidence.

That is the lesson I like most from this project: in high-stakes domains, good AI starts with good structure.

You can explore the code, dashboards, and artifacts here:

GitHub: rashiedomar/FinTech
Legal compilation dashboard
Runtime flow dashboard

ArtStyleNet: Finding Similar Artworks with Deep Features

2026-06-03T00:00:00+09:00

ArtStyleNet is a small computer vision project about a subjective question:

Can a model find paintings that feel stylistically similar?

Art recommendation is not the same as object classification. If we classify a painting only by artist name, we miss the softer visual signals that make artworks feel related: color palette, texture, brushstroke density, composition, contrast, and mood. ArtStyleNet explores that space using deep visual features.

The Dataset

The project uses the Dacon Artist Classification dataset: 5,910 paintings from 51 artists. The metadata includes artist, genre, nationality, and years, but the main pipeline focuses on images.

That choice is important. The project asks whether visual appearance alone can reveal useful style groupings.

The input is an artwork image:

\[X \in \mathbb{R}^{H \times W \times 3}\]

The goal is not only to predict an artist label, but to represent the image in a feature space where stylistic similarity can be compared.

Feature Extraction

The first step uses a pretrained ResNet50 model as a feature extractor. Instead of training a full model from scratch, the project uses ImageNet-pretrained visual filters and removes the final classification layer.

In simple form:

\[z = f_{\theta}(X)\]

where $X$ is the painting and $z$ is the extracted feature vector. This vector is not a human description, but it captures visual patterns learned by the CNN: edges, shapes, textures, color transitions, and higher-level image structure.

For style similarity, this is useful because we can compare two paintings by comparing their feature vectors:

\[\text{sim}(i,j) = \frac{z_i^\top z_j}{\|z_i\|_2 \|z_j\|_2}\]

High similarity means the two images are close in the model's visual representation.

Reducing the Feature Space

Deep features are high-dimensional, so the project applies PCA before clustering. PCA finds directions of maximum variance:

\[Z_{\text{PCA}} = ZW_k\]

where $Z$ is the feature matrix and $W_k$ keeps the top $k$ principal components.

This step makes the feature space easier to model and visualize. It also removes some noise, which matters because artistic style is subtle: the system should not cluster paintings only because of small irrelevant pixel-level differences.

Clustering Style

After PCA, the project applies LDA-style topic modeling to group artworks into latent visual topics. In text analysis, LDA groups documents by word topics. Here, the same idea is used more creatively: artworks can be treated as having mixtures of latent style components.

For an artwork $i$, the model can assign a topic distribution:

\[\theta_i = [p(t_1|i), p(t_2|i), \ldots, p(t_K|i)]\]

The dominant topic is:

\[t_i^* = \arg\max_k p(t_k|i)\]

This makes it possible to count how many artworks belong strongly to each topic, inspect representative artworks, and see which artists have more stylistic variability.

What the Project Shows

The useful part of ArtStyleNet is the pipeline shape:

load paintings,
extract visual features with ResNet50,
reduce dimensions with PCA,
discover latent clusters,
visualize topic distributions and representative artworks.

This is a good foundation for content-based recommendation. A user may not know the artist or period they want, but they might know the kind of visual feeling they like. A feature-based recommender can start from one artwork and retrieve visually similar ones.

Limitations

The project is exploratory. ResNet50 was trained on natural images, not art history, so its features are useful but not perfect. PCA is linear, while artistic style may be nonlinear. LDA is also borrowed from topic modeling, so future work could compare it with t-SNE, UMAP, autoencoders, contrastive learning, or CLIP embeddings.

Metadata could also improve the system. Genre, nationality, artist period, and year may help separate true style similarity from accidental visual similarity.

Conclusion

ArtStyleNet is a compact project, but it asks a good question: how can deep learning help people explore art visually?

The answer is not to replace human taste. The answer is to build a visual search layer: a model that maps paintings into a feature space, groups related works, and helps users discover artworks they may not have found by name or category alone.

You can read the code and fork the project here:

GitHub: rashiedomar/ArtStyleNET

From CCTV to Safer Crosswalk Timing

2026-06-02T00:00:00+09:00

Traffic signals are often designed around a simplified assumption: pedestrians cross at a standard walking speed. But that assumption is not equally safe for everyone.

In South Korea, elderly pedestrians can walk much slower than the timing used in many standard crossing calculations. The project in rashiedomar/crosswalk-cctv starts from that public-safety problem and turns it into a computer vision system: detect the crosswalk region from overhead CCTV, use that region as the area of interest, and eventually estimate pedestrian crossing speed so signal timing can adapt to slower walkers.

The full research roadmap has four phases:

Crosswalk segmentation from CCTV.
Pedestrian detection, tracking, and speed estimation.
Adaptive timing control and safety validation.
Edge deployment on Jetson-style hardware.

This article focuses on Phase 1, which is already complete: building a robust crosswalk segmentation model that transfers from first-person-view data to overhead CCTV. The result is strong: 98.5% IoU on CCTV validation images, using only 241 manually labeled CCTV images plus 1,000 high-confidence pseudo-labels selected from 5,926 unlabeled AI-Hub CCTV images.

That number matters, but the more interesting story is how the project got there.

Why Crosswalk Segmentation Comes First

Adaptive signal timing needs a reliable definition of where crossing happens. Before estimating pedestrian speed, the system must know the crosswalk region:

where pedestrians enter,
where they leave,
which pixels belong to the legal crossing zone,
and which moving objects should be ignored because they are outside the crosswalk.

If the crosswalk mask is wrong, every downstream step becomes fragile. A pedestrian tracker may detect people, but without a reliable region of interest it cannot tell whether a person is actually crossing, waiting, passing near the curb, or walking on the sidewalk.

So Phase 1 is a segmentation problem:

Given an image $X \in \mathbb{R}^{H \times W \times 3}$, predict a binary mask:

\[\hat{Y} \in [0,1]^{H \times W}\]

where each pixel $\hat{Y}_{ij}$ estimates whether pixel $(i,j)$ belongs to the crosswalk.

The target mask is:

\[Y_{ij} = \begin{cases} 1, & \text{if pixel } (i,j) \text{ is crosswalk} \\ 0, & \text{otherwise} \end{cases}\]

Once the mask is reliable, the later pedestrian-speed pipeline can restrict analysis to the crosswalk area:

\[\text{ROI}(X) = X \odot \hat{Y}\]

where $\odot$ means element-wise masking.

That sounds simple, but the camera viewpoint changes everything.

The Domain Gap

The first model was trained on first-person-view crosswalk images. In that domain, crosswalks are usually seen from a driver or street-level perspective. The stripes are large, close, and often occupy a predictable region of the image.

The target domain is overhead CCTV. In CCTV footage, crosswalks are smaller, farther away, angled, partially occluded by cars or buses, affected by rain or night lighting, and often surrounded by road markings that look similar.

The FPV model performed well on its own domain, reaching about 93.05% IoU on the FPV test set. But when tested directly on CCTV, the confidence collapsed.

This is the classic domain adaptation problem. The model did not only learn “crosswalkness.” It also learned the appearance distribution of the training domain:

\[P_{\text{train}}(X, Y) \neq P_{\text{test}}(X, Y)\]

In the source domain:

\[(X_s, Y_s) \sim P_s\]

In the target CCTV domain:

\[(X_t, Y_t) \sim P_t\]

The task is the same, but the data distribution changes:

\[P_s(X) \neq P_t(X)\]

That mismatch is enough to make a high-performing source model fail. The project therefore uses the FPV model as a starting point, not as the final detector.

Stage 1: FPV Baseline

The first stage trained a U-Net model with a ResNet34 encoder on 3,300 FPV crosswalk images. U-Net is a natural starting point because it combines encoder features with decoder upsampling and skip connections.

For segmentation, the model learns a function:

\[f_\theta(X) = \hat{Y}\]

where $\theta$ are the model parameters.

The basic evaluation metric is Intersection over Union:

\[\text{IoU}(Y,\hat{Y}) = \frac{|Y \cap \hat{Y}|} {|Y \cup \hat{Y}|}\]

For binary masks, this can be written as:

\[\text{IoU} = \frac{TP} {TP + FP + FN}\]

where $TP$ is the number of correctly predicted crosswalk pixels, $FP$ is the number of false crosswalk pixels, and $FN$ is the number of missed crosswalk pixels.

The FPV baseline reached:

best validation IoU: 92.44%
test IoU: 93.05%
training epochs: 30

This is a strong result inside the FPV domain. But it does not solve the CCTV problem. A model can look excellent on a source domain while still being unreliable in the deployment domain.

That is why the next stage matters.

Testing the Source Model on CCTV

When the FPV-trained model was tested on overhead CCTV frames, the model struggled. On one CCTV set of 175 frames, the mean confidence was only about 0.044. Night frames were especially weak, with an average confidence close to 0.0016, while day frames averaged about 0.106.

The AI-Hub CCTV evaluation showed the same pattern. Across 5,926 CCTV images, mean confidence was only about 0.0545, with most images below 10% confidence.

This failure is useful. It proves that the deployment domain needs its own adaptation step.

The lesson is not “the FPV model is bad.” The lesson is that viewpoint matters. A first-person crosswalk and an overhead CCTV crosswalk are visually different objects from the perspective of the model.

The visual comparison makes the problem even clearer. FPV examples tend to show crosswalk stripes as near-field objects with strong perspective expansion. CCTV examples compress the same structure into a distant, oblique region of the frame. The lane markings, bus lanes, reflections, headlights, and road arrows become distractors.

This is one reason crosswalk segmentation is a better first target than pedestrian speed estimation. If the system cannot first localize the crosswalk under viewpoint shift, any attempt to estimate speed inside the scene will mix true crossing motion with irrelevant road and sidewalk motion.

Stage 2: CCTV Adaptation

The second stage adapts the model to CCTV. The key difficulty is annotation cost. Manually labeling CCTV segmentation masks is slow, and the project starts with only 241 labeled CCTV images:

201 training samples
40 validation samples

The first CCTV adaptation pass fine-tuned a DeepLabV3 model with a ResNet50 backbone. DeepLabV3 is useful here because atrous convolution and multi-scale context help segmentation models capture structure across different receptive fields.

The loss combines binary cross-entropy and Dice-style overlap:

\[\mathcal{L} = \lambda_{\text{BCE}}\mathcal{L}_{\text{BCE}} + \lambda_{\text{Dice}}\mathcal{L}_{\text{Dice}}\]

Binary cross-entropy handles pixel-wise classification:

\[\mathcal{L}_{\text{BCE}} = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right]\]

Dice loss focuses on region overlap:

\[\mathcal{L}_{\text{Dice}} = 1 - \frac{2\sum_i y_i\hat{y}_i + \epsilon} {\sum_i y_i + \sum_i \hat{y}_i + \epsilon}\]

The first CCTV fine-tuning pass reached:

best validation IoU: 88.9%
final train IoU: 97.0%
epochs: 30

That is already usable, but the gap between train and validation suggests a familiar problem: the labeled dataset is small. The model can learn the 241 labeled images, but the target CCTV distribution is wider than those labels.

The project therefore uses semi-supervised learning.

Data Preparation Matters

Before the model can learn anything useful, the project has to convert video and image sources into a consistent segmentation dataset.

That preparation step is not glamorous, but it is where many applied computer vision projects succeed or fail. Crosswalk segmentation needs image-mask pairs that agree spatially. If a frame is resized, cropped, padded, or augmented, the mask must receive the same transformation.

The dataset workflow in the repository is notebook-driven:

extract frames from video sources,
prepare FPV training images,
visualize FPV and CCTV differences,
train the FPV baseline,
test the FPV model on CCTV,
fine-tune and pseudo-label CCTV data.

The important design decision is keeping the experiment stages separate. The project does not hide everything inside one giant script. It preserves the story of the research:

Build a source-domain baseline.
Measure how badly it transfers.
Add limited labeled target data.
Use target-domain unlabeled data.
compare iteration 1 and iteration 2.

That structure makes the result easier to trust. If someone only reports the final 98.5% IoU, we do not know how much was learned from source data, how much came from manual labels, and how much came from pseudo-labels. Here, each stage has its own artifacts.

Architecture Choice

The final adaptation uses DeepLabV3 with a ResNet50 backbone. That is a reasonable choice for CCTV crosswalk segmentation because the object has both local and global structure.

The local structure is the stripe pattern. A model must identify repeated white bars, edges, and road-paint texture.

The global structure is the crosswalk polygon. A crosswalk is not just a set of stripes; it is a coherent region across the road. It has orientation, width, continuity, and geometric plausibility.

DeepLab-style models help because they can combine fine visual cues with broader context. In practice, the model needs to answer questions like:

Are these stripes part of a legal crosswalk or just lane markings?
Does the predicted region form a plausible crossing zone?
Does the mask remain stable under shadows, vehicles, and camera angle?
Can the model see the full crosswalk even when part of it is occluded?

For this phase, a binary segmentation model is the right abstraction. The output is not a bounding box, because a crosswalk is not always rectangular in image coordinates. The output is not a classification label, because the exact region matters for later speed estimation. A dense mask is the useful representation.

Pseudo-Labeling

Pseudo-labeling is the bridge between scarce manual labels and abundant unlabeled data.

Let the labeled set be:

\[\mathcal{D}_L = \{(x_i, y_i)\}_{i=1}^{n}\]

and the unlabeled set be:

\[\mathcal{D}_U = \{u_j\}_{j=1}^{m}\]

After training an initial model $f_{\theta_1}$ on $\mathcal{D}_L$, we generate predicted masks for the unlabeled images:

\[\tilde{y}_j = f_{\theta_1}(u_j)\]

But we should not trust every prediction. The project filters pseudo-labels using a confidence score and geometric validation.

The idea is:

\[q_j = \frac{c_j + g_j}{2}\]

where:

$c_j$ is prediction confidence,
$g_j$ is geometric validity,
and $q_j$ is the final pseudo-label quality score.

The geometric check is important because crosswalk masks have physical structure. They should not occupy almost none of the frame, and they should not cover almost the whole frame. The project uses a reasonable crosswalk-area constraint, roughly checking whether the predicted mask occupies a plausible percentage of the image.

In simplified form:

\[g_j = \begin{cases} 1, & r_{\min} \leq \frac{|\tilde{y}_j|}{H W} \leq r_{\max} \\ 0, & \text{otherwise} \end{cases}\]

The repository describes the accepted area ratio range as about 5% to 40% of the frame.

From 5,926 unlabeled CCTV images, the system selected the top 1,000 high-confidence pseudo-labels using a threshold of 0.7. The top pseudo-label scores were extremely high:

top score: 0.988
median score: 0.988
confidence score around 0.976
geometric validation: 1.0 for the top samples

This step increases the effective training set from 241 CCTV images to 1,241 images:

\[\mathcal{D}_{\text{train}} = \mathcal{D}_L \cup \{(u_j, \tilde{y}_j): q_j > \tau\}\]

The key is not just adding more data. It is adding more target-domain data: new lighting, roads, camera angles, vehicles, road widths, shadows, and crosswalk geometries.

Why Confidence Alone Is Not Enough

One nice detail in this project is that pseudo-label selection is not based on raw model confidence alone.

Raw confidence can be misleading. A segmentation model can be confidently wrong, especially after domain shift. It may draw a large mask over a road area, produce a tiny blob near a bright lane marking, or hallucinate a crosswalk where none exists. If we accept those masks because the probabilities are high, pseudo-labeling becomes error amplification.

That is why geometric validation matters. Crosswalks have expected shape and scale. They usually occupy a meaningful but limited part of the frame. They are often elongated and road-aligned. A simple geometry rule will not solve all errors, but it can reject many obviously bad pseudo-labels.

This gives the pseudo-labeling step two checks:

Does the model believe the mask?
Does the mask look physically plausible?

That is a strong pattern for applied AI. Confidence should be combined with domain knowledge whenever possible.

Iteration 2: Training with Pseudo-Labels

The second iteration retrains on the combined dataset:

241 manually labeled CCTV images
1,000 pseudo-labeled CCTV images
1,241 total target-domain examples

The learning rate is reduced to stabilize adaptation. The model has already learned the rough CCTV crosswalk concept, so the next stage is refinement and generalization.

The result is the main achievement of Phase 1:

Iteration 1 validation IoU: 88.9%
Iteration 2 validation IoU: 98.5%
improvement: +9.6 percentage points
inference time: 12.98 ms
throughput: 77.03 FPS

The final visualization shows why the result is convincing. The predicted mask aligns closely with the manual ground-truth crosswalk region, even under overhead perspective and real CCTV conditions.

This is exactly what a Phase 1 module should provide: a stable, fast crosswalk mask that later components can use as the spatial foundation for pedestrian tracking.

Reading the Result Carefully

The final result figure is impressive, but it is worth reading it like a researcher instead of only like a scoreboard.

The top row shows an original CCTV image, a manual ground-truth mask, and a model prediction. The prediction nearly overlaps the target crosswalk region, and the figure reports an IoU around 0.99 for that example.

The bottom plots tell the training story. Iteration 1 improves quickly but plateaus below the final result. After pseudo-labeling is added, iteration 2 starts from a much stronger place and pushes validation IoU close to the training IoU. This means the additional target-domain data did not only help the model memorize. It helped it generalize better across the validation samples.

The result is also practical because the prediction is spatially clean. A noisy mask with many disconnected blobs would be hard to use for tracking. A clean crosswalk polygon can be post-processed into a stable region of interest.

Why the Result Works

The performance jump is not magic. It comes from three design choices working together.

First, transfer learning gives the model a useful starting point. The FPV source task still teaches stripes, road texture, crosswalk shape, and segmentation boundaries. Even if the viewpoint is wrong, the learned representation is better than random initialization.

Second, small labeled CCTV fine-tuning anchors the model to the target viewpoint. The 241 manual masks teach the overhead camera geometry.

Third, pseudo-labeling expands the target domain. The extra 1,000 pseudo-labels expose the model to many more CCTV conditions without requiring full manual annotation.

The training process can be summarized as:

\[\theta_s = \arg\min_{\theta} \sum_{(x,y)\in\mathcal{D}_s} \mathcal{L}(f_\theta(x), y)\]

for the source FPV model, then:

\[\theta_1 = \arg\min_{\theta} \sum_{(x,y)\in\mathcal{D}_L} \mathcal{L}(f_\theta(x), y)\]

for the first CCTV adaptation, and finally:

\[\theta_2 = \arg\min_{\theta} \left[ \sum_{(x,y)\in\mathcal{D}_L} \mathcal{L}(f_\theta(x), y) + \sum_{(u,\tilde{y})\in\mathcal{D}_P} w(u)\mathcal{L}(f_\theta(u), \tilde{y}) \right]\]

where $\mathcal{D}_P$ is the pseudo-labeled set and $w(u)$ can be interpreted as a confidence weight. Even when the implementation uses selected pseudo-labels rather than explicit continuous weights, the idea is the same: high-confidence pseudo-labels are allowed to influence training.

This is data-efficient domain adaptation.

Real-Time Requirement

A research model is not enough for an adaptive signal system. The system must run fast enough to support live CCTV.

At 512 by 512 resolution, the final model runs at about 77 FPS, corresponding to roughly 12.98 ms per frame. That exceeds a 30 FPS real-time target:

\[\text{FPS} = \frac{1000}{t_{\text{ms}}}\]

With $t_{\text{ms}} = 12.98$:

\[\text{FPS} \approx \frac{1000}{12.98} \approx 77.0\]

That speed matters because Phase 2 will add more computation: pedestrian detection, tracking, trajectory smoothing, and speed estimation. If segmentation already consumes the full time budget, the later system will fail. A fast segmentation module leaves room for the rest of the pipeline.

In deployment terms, this means the segmentation model can run as a front-end perception module. It does not need to recompute a brand-new crosswalk mask at every frame if the camera is fixed. A practical system could compute or update the mask periodically, stabilize it across time, and then use it as a fixed region for tracking. That would save compute for pedestrian detection and trajectory estimation.

For fixed CCTV, the crosswalk itself is mostly static. The hard part is not that the crosswalk moves; it is that lighting, shadows, weather, vehicles, and occlusions change. So the segmentation module can be used in two ways:

as a one-time or periodic crosswalk locator,
and as a robustness check when the scene changes significantly.

This is important for edge deployment. On a Jetson-like device, every millisecond matters.

Toward Speed Estimation

Once the crosswalk mask is stable, Phase 2 can estimate walking speed.

For a tracked pedestrian, suppose their position in image coordinates at time $t$ is:

\[p_t = (x_t, y_t)\]

A tracker such as DeepSORT can maintain an identity across frames:

\[\mathcal{T}_k = \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}\]

To estimate physical speed, image movement must be converted into meters. With a calibration function $H$ or a pixel-to-meter scale, positions can be projected into ground-plane coordinates:

\[P_t = H(p_t)\]

Then walking speed can be estimated as:

\[v = \frac{\|P_{t_b} - P_{t_a}\|_2} {t_b - t_a}\]

The signal timing problem is then straightforward. If the crosswalk length is $L$ and a pedestrian walks at speed $v$, the required crossing time is:

\[T_{\text{required}} = \frac{L}{v} + T_{\text{safety}}\]

The safety issue appears when the assumed design speed is too high:

\[T_{\text{standard}} = \frac{L}{v_{\text{standard}}}\]

If $v_{\text{elderly}} < v_{\text{standard}}$, then:

\[T_{\text{required}} > T_{\text{standard}}\]

That is the entire motivation of the project in one equation. Slower pedestrians need more crossing time. A vision system can estimate when that extra time is needed.

The next research challenge is not only measuring speed, but measuring it reliably. A pedestrian may pause, turn, walk diagonally, start late, or be occluded by a vehicle. A simple two-point speed estimate can be noisy. A stronger version would estimate speed over a track:

\[v_k = \frac{1}{n-1} \sum_{i=1}^{n-1} \frac{\|P_{t_{i+1}} - P_{t_i}\|_2} {t_{i+1}-t_i}\]

Then the system can smooth speed over time:

\[\bar{v}_t = \beta v_t + (1-\beta)\bar{v}_{t-1}\]

This kind of smoothing prevents one noisy frame from causing unstable signal decisions.

What Makes This Project Strong

The strongest part of this project is that it does not jump directly to signal control. It builds the perception foundation first.

A weaker version of the project would try to detect pedestrians everywhere in the frame and estimate speed immediately. But without a crosswalk mask, the system would have too much ambiguity. People on sidewalks, people waiting near the curb, cyclists, reflections, and vehicles could all interfere with the logic.

This project begins with the spatial prior:

Where is the crosswalk?

Once that is known, later modules can ask better questions:

Is a person inside the crosswalk?
How long have they been crossing?
Are they moving slower than expected?
Are they likely to remain in the crosswalk when the signal changes?
How much extra green time is needed?

This is good system design. Perception, tracking, speed estimation, and control are separated into stages.

Limitations

The Phase 1 result is strong, but it should be interpreted carefully.

First, the final model is evaluated on the available CCTV validation set. More cities, weather conditions, camera heights, road geometries, and nighttime scenes would be needed before claiming broad deployment readiness.

Second, pseudo-labeling works best when the first model is already good enough. If the first fine-tuned model produces systematically wrong masks, pseudo-labeling can amplify errors.

Third, segmentation accuracy does not guarantee tracking accuracy. Phase 2 still needs robust pedestrian detection and identity tracking under occlusion, crowding, shadows, vehicles, and low-light conditions.

Fourth, signal timing is a control problem, not only a vision problem. It must consider traffic rules, pedestrian signals, vehicle flow, fairness, and safety margins.

These limitations do not weaken the project. They define the next research steps.

What I Would Improve Next

If I were extending this project, I would keep the staged design and add a few evaluation layers.

First, I would create a held-out CCTV benchmark split by condition: day, night, rain, heavy traffic, low traffic, bus occlusion, wide intersection, narrow intersection, and unusual camera angle. A single validation number is useful, but condition-specific metrics tell us where the model is fragile.

Second, I would add temporal stability metrics. Since CCTV is video, the mask should not flicker frame by frame. Even if per-frame IoU is high, unstable edges can hurt downstream tracking. A simple temporal consistency score could compare consecutive predictions:

\[\text{TC}_t = \text{IoU}(\hat{Y}_t, \hat{Y}_{t-1})\]

High temporal consistency would mean the crosswalk region remains stable unless the scene truly changes.

Third, I would add uncertainty maps. If the model is uncertain near the boundary or under occlusion, later stages should know that. A tracker can behave differently when the crosswalk ROI is high-confidence versus partially uncertain.

Fourth, I would connect the segmentation output to a small tracking prototype. Even a simple YOLO + DeepSORT baseline inside the predicted ROI would validate that Phase 1 provides the right representation for Phase 2.

Fifth, I would document failure cases as carefully as successes. The best projects show where the model breaks. Night reflections, bus occlusion, unusual crosswalk paint, construction zones, and wet roads are not edge cases for deployment; they are normal urban conditions.

Lessons

There are a few lessons I would take from this work.

The first lesson is that domain shift is real. A model that works in FPV does not automatically work in CCTV.

The second lesson is that small labeled datasets can still be powerful if used strategically. The 241 CCTV labels were enough to bootstrap a useful target-domain model.

The third lesson is that pseudo-labeling is most useful when filtered. The important contribution is not generating 5,926 masks. It is selecting the 1,000 masks that are confident and geometrically plausible.

The fourth lesson is that real-time performance should be measured early. A safety system cannot wait until the end to discover that it is too slow.

Conclusion

The crosswalk CCTV project is a strong example of applied computer vision research because it connects model design to a real public-safety workflow.

The system starts with a concrete social problem: elderly pedestrians may need more crossing time than standard signal assumptions provide. It then builds a technical foundation: crosswalk segmentation from CCTV. It handles the domain gap from FPV to overhead camera views, uses a small manually labeled CCTV set, expands it with confidence-filtered pseudo-labels, and reaches 98.5% IoU at 77 FPS.

That makes Phase 1 ready to support the next stage: pedestrian tracking and speed estimation inside the detected crosswalk region.

You can read the code, inspect the notebooks, and fork the project here:

GitHub: rashiedomar/crosswalk-cctv

Debugging Vision Agents

2026-06-01T00:00:00+09:00

Vision-language models are becoming very good at describing images, answering visual questions, reading charts, counting objects, and supporting multimodal workflows. But when a model gives an answer, the most important engineering question often comes after the answer:

Why did it say that?

If a Vision AI agent says there are three damaged buildings in a satellite image, did it actually focus on the damaged buildings? If it says the red car is on the left, did it inspect the car or guess from the prompt? If two models disagree, is one missing a visual detail, or are they using different assumptions about the task?

That is the motivation behind Vision Agent Debugger: a small but useful tool for making Vision AI behavior easier to inspect. The project combines a React frontend, a FastAPI backend, CLIP-based heatmap generation, multi-model comparison, reasoning-step extraction, error detection, and cost tracking.

It is not a complete interpretability research framework. It is something more practical: a debugging surface for people who build with vision models and want to see more than a final text response.

The Problem

Most Vision LLM interfaces hide the process. You upload an image, write a prompt, wait for a model, and receive an answer. That is fine for demos, but it is weak for development.

For real projects, especially remote sensing, public data dashboards, field monitoring, damage assessment, surveillance review, medical imaging support, and industrial inspection, we need to ask sharper questions:

Did the model attend to the correct region?
Did the prompt cause the model to focus on the wrong visual concept?
Did two models disagree on the object, count, location, or conclusion?
Did the model fail because of perception, reasoning, or API/configuration issues?
How much did the analysis cost?

Debugging a vision agent means separating the final answer from the evidence path. The answer is only one artifact. A useful debugger should also expose visual focus, reasoning traces, model disagreement, and failure states.

What the Project Builds

Vision Agent Debugger gives a user three core workflows:

Upload an image and prompt.
Generate a visual heatmap for the image-prompt pair.
Compare responses from multiple vision models side by side.

The backend exposes several endpoints:

/api/generate-heatmap: creates a heatmap from an image and prompt.
/api/analyze-image: runs selected models and returns responses, reasoning steps, errors, and cost.
/api/compare-models: compares Gemini, Claude, and GPT-4V-style responses.
/api/debug-agent: combines heatmap generation with model comparison.

The frontend wraps those outputs into a debugging interface: upload panel, prompt box, heatmap overlay, model comparison cards, reasoning step views, error warnings, and total cost display.

The stack is intentionally simple:

React, Vite, Tailwind CSS, and Canvas-style visual overlay patterns on the frontend.
FastAPI on the backend.
CLIP for image-text representation.
Gemini, Claude, and OpenAI-compatible vision calls for multi-model analysis.

This is a good architecture for a prototype because each part has a clear job. The frontend does interaction and presentation. The backend does model orchestration. CLIP creates a visual explanation signal. The model analyzer standardizes responses into a comparable format.

A Small Example

The repository includes a test image with multiple cars in different colors. A prompt might ask something like:

Which cars are visible in the image?

or:

Count the red cars.

The original image is simple, but that is useful for debugging because the expected answer is visually obvious.

The heatmap output gives a spatial debugging layer. In the current prototype, it is a smooth visual approximation of where the system is emphasizing the image-prompt relationship.

This kind of image is not a final scientific explanation by itself. But as a developer tool, it is useful because it makes the debugging conversation more concrete. Instead of only asking “was the answer correct?”, we can ask “does the visual evidence look aligned with the task?”

The Math Behind the Heatmap

A vision-language debugger needs some way to connect text to image regions. CLIP gives us a useful starting point because it maps images and text into a shared embedding space.

Let an image be $I$, and let the prompt be $t$. CLIP has an image encoder $f_I$ and a text encoder $f_T$:

\[z_I = f_I(I), \qquad z_T = f_T(t)\]

Both embeddings are usually normalized:

\[\hat{z}_I = \frac{z_I}{\|z_I\|_2}, \qquad \hat{z}_T = \frac{z_T}{\|z_T\|_2}\]

The image-text similarity can be measured using cosine similarity:

\[s(I,t) = \hat{z}_I^\top \hat{z}_T\]

That gives one global score. But a heatmap needs spatial scores. For a Vision Transformer such as CLIP ViT-B/32, the image is divided into patches. With a 224 by 224 image and a patch size of 32, we get a 7 by 7 grid:

\[N = 7 \times 7 = 49\]

Ideally, each patch has an embedding $p_i$. A prompt-aware patch score can be written as:

\[a_i = \frac{\exp(\tau \cdot \hat{p}_i^\top \hat{z}_T)} {\sum_{j=1}^{N} \exp(\tau \cdot \hat{p}_j^\top \hat{z}_T)}\]

Here, $a_i$ is the attention-like weight for patch $i$, and $\tau$ is a temperature parameter that controls how sharp the distribution becomes.

After computing patch scores, we can reshape them into a 7 by 7 map:

\[A = \text{reshape}(a_1,\ldots,a_{49})\]

Then we normalize the map:

\[M = \frac{A - \min(A)}{\max(A) - \min(A) + \epsilon}\]

Finally, we resize $M$ to the original image size, convert it into a color map, and overlay it on the image:

\[O = (1-\alpha)I + \alpha C(M)\]

where $C(M)$ is the colored heatmap and $\alpha$ is the overlay opacity.

That is the clean mathematical version. The current repository implementation is an early engineering approximation: it uses CLIP embeddings and generates a smoothed spatial map rather than extracting full transformer patch attention end to end. That is okay for a prototype, but it is important to say clearly. A production research version should replace the approximation with patch-level CLIP similarity, attention rollout, Grad-CAM-style gradients, or segmentation-aware region scoring.

The important idea is still the same: the debugger turns a hidden image-text relationship into a visible spatial artifact.

Model Comparison as a Debugging Tool

Heatmaps help with visual focus, but they do not solve the whole problem. Vision agents also fail through reasoning, phrasing, counting, ambiguity, or hallucination.

That is why the project includes multi-model comparison.

For each model $m$, the debugger stores a response object:

\[R_m = (y_m, S_m, e_m, c_m)\]

where:

$y_m$ is the model answer.
$S_m$ is the extracted reasoning-step list.
$e_m$ is the error state.
$c_m$ is the estimated cost.

Once all models return, the tool can compare them:

\[\mathcal{R} = \{R_{\text{Gemini}}, R_{\text{Claude}}, R_{\text{GPT}}\}\]

The current implementation checks API failures and simple contradiction patterns. For example, if one model says an object is visible and another says no object is present, that disagreement becomes a warning.

This is simple, but the workflow is powerful. A disagreement is often the most useful signal during debugging. If three models agree, the answer may still be wrong, but confidence rises. If one model disagrees strongly, the user knows where to inspect.

Future versions could make this much stronger by using semantic entailment:

\[d_{ij} = 1 - \cos(g(y_i), g(y_j))\]

where $g(\cdot)$ is a sentence embedding model. Large disagreement scores could trigger deeper review, especially for high-stakes visual tasks.

Reasoning Steps

The debugger also extracts reasoning steps from model responses. It looks for numbered lists, bullet points, or sentence structure, then turns the output into a sequence:

\[S_m = [s_1, s_2, \ldots, s_k]\]

This does not reveal the model’s true internal reasoning. It reveals the explanation the model produced. That distinction matters. But even explanation traces are useful when debugging prompts and workflows.

For example, suppose a model answers a counting task incorrectly. The extracted steps may show whether it:

described the image generally,
identified the correct object type,
confused color or position,
skipped occluded instances,
or made a counting error at the end.

That helps the developer decide what to fix. The prompt might need more constraints. The image might need cropping. The model might need a better visual grounding step before reasoning.

Error Detection

The project’s error detector handles two practical cases:

API errors, such as missing keys or failed model calls.
Contradictions between model responses.

This may sound small, but it is a useful engineering layer. In multimodal apps, the failure mode is often not “the model is bad.” Sometimes the API key is missing. Sometimes one provider returns an error. Sometimes the frontend shows an incomplete response. Sometimes cost or token limits change behavior.

By making errors visible, the tool turns silent failure into an inspectable state.

Cost Tracking

Cost is part of debugging too. If a system calls three large vision models every time a user uploads an image, the workflow may become expensive quickly.

The debugger estimates cost per model and total cost:

\[C_{\text{total}} = \sum_{m \in \mathcal{M}} c_m\]

This makes it easier to compare accuracy, latency, and cost together. A cheap model may be enough for broad image descriptions. A stronger model may be worth the price for detailed visual reasoning. A debugger should help the developer see those tradeoffs directly.

What I Like About This Project

The strongest part of Vision Agent Debugger is not that each component is perfect. It is that the project has the right shape.

A good Vision AI debugging tool should combine four views:

what the image contains,
where the model may be looking,
what the model says,
and how different models disagree.

This repo already has those pieces. It is easy to imagine extending it into a stronger research or production tool.

For example:

Replace the current heatmap approximation with patch-level CLIP similarity.
Add bounding-box or segmentation overlays.
Add prompt version tracking.
Add semantic disagreement scoring across model outputs.
Add task-specific metrics for counting, detection, chart reading, or remote sensing.
Save debug sessions so users can compare failures over time.
Add side-by-side original, heatmap, model answer, and ground-truth annotation.

For remote sensing, the same idea becomes even more interesting. A user could upload a flood image, satellite crop, or urban change pair and ask: did the model focus on the flooded area, the changed buildings, the roads, or the irrelevant background?

That is where debugging becomes research infrastructure.

Conclusion

Vision agents should not be treated as magic boxes. If they are going to help with real visual decisions, we need tools that expose where they looked, what they answered, how they explained themselves, where they failed, and how much the process cost.

Vision Agent Debugger is a practical first step in that direction. It combines CLIP heatmaps, model comparison, reasoning traces, error detection, and cost tracking into one workflow. It is honest as a prototype and useful as a foundation.

The next step is to make the visual explanations more faithful and the disagreement detection more semantic. But even now, the project points in the right direction: Vision AI systems become more useful when their behavior can be inspected.

You can read the code, run the app, and fork the project here:

GitHub: rashiedomar/vision-agent-debugger