How we cut web fuzzing FPs by 50% using Machine Learning

As much as we enjoy our work as security practitioners, no one’s giving us back the hours lost to the void of false positives.

Between the pressure to deliver actionable remediation plans, the noise many tools create, and our personal quality standards, things can get… frustrating - fast.

How do we know this?

From 3 sources:

Talking to our customers almost on a daily basis
Eating our own dog food (aka using our own product)
From our colleagues from offensive security services battle-testing Pentest-Tools.com in every engagement (and not holding back on feedback).

That’s why detection accuracy has always been our north star across every part of our product. If you’re a customer, you’ve seen this in our monthly updates, in our benchmarks, and, most importantly, in your findings.

Most recently, we focused our obsession with quality on a key capability in our web vulnerability tools: web fuzzing.

Like you, we battled the complexities of dynamic content, multilingual sites, and the endless variations of "not found."

So we reimagined web application classification, moving beyond endless, simple rules to build something smarter: our own Machine Learning classifier that can discern signal from noise - fast and at scale. You can read all about how the initiative started in the first part of this story.

This new capability led to a game-changing 50% reduction in fuzzing false positives and a whopping 92% accuracy (up from 75%).

We’re excited to share the behind-the-scenes details with you and see what it looks like when you take for a spin!

The frustrating reality: when good tools go noisy

We don’t know a single security specialist who hasn’t felt the sting of traditional web fuzzing and scanning tools. While they promise speed, the truth is often a different story.

Traditional web fuzzing and scanning tools rely on rule-based or regex-heavy classifiers. And, while these methods are fast, they’re brittle. They often mislabel pages, fail to detect dynamic content, and don’t generalize well across different web stacks or languages.

That creates several painful challenges:

Missed sensitive content (false negatives)
Too many hours wasted reviewing false positives
Time spent manually adjusting fuzzing tool filters.

These problems are bound to push anyone’s buttons. But it doesn’t have to be this way.

This is why we turned to ML to inject more intelligence and automation into the heart of our scanning pipeline.

Our answer: a smarter, more nuanced classification model

Since it’s not our style to simply complain about a problem, we focused on building a better solution – one that truly understood the nuances of web application vulnerability assessment.

First, we stepped back and looked at the core of the issue:

How do we actually understand what a web server is throwing back at us during a fuzzing session?

Instead of relying on blunt, easily fooled rules, we reimagined the problem as a classification Machine Learning task.

We trained an ML model to differentiate between various types of responses, each with its own implication during a security assessment:

The "a-ha!" moment (HIT)

This is the gold! Content that screams potential value or directly matches what we were looking for. Imagine the satisfaction of instantly pinpointing login portals, admin dashboards, exposed code snippets, configuration files, backups, development artifacts, debugging information, accidental data leaks, or even those oh-so-critical exposed API keys. Our model is trained to recognize these "wins" with high precision, even considering redirects to give you the full picture.

The "dead end" (MISS)

Those are the 404 pages in all their countless forms. Whether it's a standard not found page or a creatively disguised "resource not here" message, our model quickly identifies these and helps you move on.

The "maybe worth a second look" (PARTIAL HIT)

These are the tricky ones – responses that aren't immediately a hit or a miss, but could still hold clues. Think of firewall responses, generic landing pages, or unrelated content that could become relevant in a broader attack scenario. Our model flags these for potential later investigation, preventing you from completely overlooking something potentially valuable.

The "Needs More Eyes" (INCONCLUSIVE)

The modern web is dynamic. Lots of content loads with JavaScript. Our model is smart enough to recognize when a page needs to be rendered with a browser like Chrome to truly understand its content. This crucial insight prevents misclassification and ensures you don't dismiss something important simply because its initial HTML is misleading.

By automatically triaging HTML responses this way, Pentest-Tools.com helps you prioritize what's important, skip what isn’t, and reduce manual review time dramatically.

What this means for you: less grind, more impact

So, we built this intelligent classifier. But what does that actually mean for your day-to-day work?

It boils down to this: more automation where it counts, leading to tangible benefits across the board.

Smarter automation powered by Machine Learning

Consultants that use Pentest-Tools.com for web app assessment reclaim significant chunks of their time. By letting our ML model handle the initial, often tedious, result classification, you can drastically reduce the hours spent reviewing raw scan data and jump straight to crafting insightful, client-ready reports - the ones that prove their expertize and experience.

Our Machine Learning classifier also helps MSPs/MSSP deliver consistently high-quality results across all clients without their team getting bogged down in manual triage. With greater efficiency comes the ability to handle more engagements - without sacrificing accuracy.

For internal security teams, our model makes navigating the complexities of large infrastructures becomes less daunting. While initial vulnerability triage works efficiently in the background - across thousands of assets - these time-pressed teams can focus on in-depth analysis and remediation of the truly critical issues that expose their organizations.

Real numbers, real impact on risk

This isn't just theoretical. We put our ML classifier to the test in real-world scenarios, comparing its performance against our previous, non-ML methods.

The results speak for themselves:

We saw a 50% reduction in the volume of fuzzing results from our Website Vulnerability Scanner. That's half the noise you have to comb through!
Our URL Fuzzer also experienced a significant 20% decrease in irrelevant findings.

Beyond just volume, the accuracy skyrocketed.

Our model boosted the F1-score from 75% (our previous best) to an impressive 92%, meaning fewer false positives and a much higher precision in identifying actual vulnerabilities.

This translates directly into tangible advantages for you and your team:

Use this data to demonstrate clear, measurable improvements in your organization's risk posture and to prove the value of your assessments.
Focus your team's energy on addressing actual vulnerabilities, not chasing down ghost exposures.
Build stronger trust with stakeholders by providing them with cleaner, more accurate, and ultimately more reliable data about web app security risks.

Seamless integration, immediate value

This powerful HTML classifier isn't some standalone tool you need to wrestle with. It's baked right into Pentest-Tools.com, working seamlessly through the Website Vulnerability Scanner and URL Fuzzer you already know. We are also gradually adding it into more of our pentesting tools, where error or 404 page detection is crucial.

There's no complicated setup – just smarter, more accurate scanning from the moment you run your next assessment.

Under the hood: how we made it so smart

Building a truly intelligent system isn't magic; it's meticulous engineering and a healthy dose of data science obsession.

Here's a peek at how we crafted our ML-powered HTML classifier to achieve that 92% accuracy.

Cleaning up the mess: our preprocessing power

Think of raw HTML as a messy kitchen after a cooking frenzy. It's full of tags, irrelevant bits, and inconsistencies.

To feed our ML model the good stuff, we built a preprocessing pipeline that meticulously transforms this chaos into order.

This pipeline works by:

Extracting and normalizing tags: we pull out the essential HTML tags and ensure they're presented in a consistent format.
Removing the unnecessary: we intelligently filter out irrelevant elements that would only confuse the ML model.
Handling the curveballs: our preprocessing is designed to carefully handle edge cases and unexpected structures, ensuring a smooth and consistent input for the classifier, no matter how quirky the original HTML.
Keeping things private: we took extra care to anonymize any references to specific domains or resources during this process.
Ensuring a fair learning environment: to minimize bias and ensure our model learns generalizable patterns, we meticulously curated our training data, using clever techniques (heuristics) to remove duplicates and ensure a diverse range of examples. We wanted our model to learn from a balanced and representative dataset.

The result of this heavy-duty preprocessing?

Not only does it dramatically speed up the training and the real-time analysis (inference), but it also guides our model away from latching onto misleading patterns. It focuses on the core semantic meaning of the HTML.

The brains of the operation: our ML models

For the core intelligence, we didn't just pick any off-the-shelf solution.

We fine-tuned powerful LLaMA models (specifically versions 3.1 and 3.2, with 8 billion and 3 billion parameters respectively).

To make these large language models efficient for our needs, we used clever techniques like parameter-efficient LoRA training combined with quantization.

Think of it as giving a brilliant but resource-intensive expert the specific training they need for the task, while also making them incredibly efficient in their work.

This approach allows our model to understand the nuances of security-relevant content across various languages and a multitude of web formats without requiring massive computational power.

The proof in the pudding: our performance metrics

We're not just making claims; we have the numbers to back them up.

As we mentioned earlier, our classifier achieved a 92% weighted F1 score when compared to the 75% of our previous methods in the Website Scanner.

Multiclass Precision Recall Curve ML model vs. Website Vulnerability Scanner Beyond the overall accuracy, our model also gained a crucial new ability: it can now intelligently identify when a page relies heavily on JavaScript for rendering its content.

This allows our scanners to use tools like a headless Chrome Browser for a more accurate classification, a feature that was almost impossible with purely static analysis.

Seeing beyond the code: qualitative insights

While the numbers are compelling, our analysis went deeper. We discovered edge cases that traditional, rule-based systems could only dream of handling.

Speaking your language (literally)

One of the most impressive capabilities of our ML model is its multilingual understanding. It can recognize patterns indicative of a "Not Found" page in numerous languages – something that would require an ever-growing and complex set of specific rules in a traditional system.

Here are just a few examples our model correctly identified in the wild (and hadn't even seen during its training!):

Page not found in Spanish (🇪🇸- Page not found.)

Unfortunately, the page you requested could not be found (🇩🇪- Unfortunately, the page you requested could not be found.)

The post does not exist or has been deleted (🇨🇳- The post does not exist or has been deleted.)

No file specified (🇮🇩- No file specified.)

Nothing found (Persian - Nothing found.)

The page you are trying to reach does not exist (🇧🇬- The page you are trying to reach does not exist.)

It looks like this page went to space! (🇵🇹- It looks like this page went to space!)

This ability to understand context beyond specific keywords in English is a significant leap in handling the diverse landscape of the modern web.

Decoding the "soft" signals

"Not Found" doesn't always come in a standard 404 package. Sometimes it's a "200 OK" with a JSON body saying "error," or a blank page with an image.

Our model has learned to recognize a very large number of these subtle cues, the "soft 404s" that often slip past traditional scanners. Here is a small sample of real-life examples we discovered while building this ML classifier:

500 - Internal Error

response_code=500, response_content='{"status":false,"error":"Internal Server Error","message":"This page dosent exist!"}

No content, just an image

… <img alt="File has been removed" src="404-remove.png"/> …

400 - Error Code

response_code=400, response_content='"No such file"'

300 - Multiple Choices

<html><head><title>300 Multiple Choices</title></head><body><h1>Multiple Choices</h1>The document name you requested (<code>/.eslintrc</code>) could not be found on this server.However, we found documents with names similar to the one you requested.

Invalid file name

response_code=200, response_content='file=api/v1/seriesInvalid filename!

Oops

response_code=200, response_content='Ooops.'

The famous “I’m a teapot” response

We also found a very nice easter egg: the April Fools' jokes of 1998 and 2014.

This ability to understand the intent behind the response, regardless of the status code and phrasing, drastically improves the accuracy of our fuzzing efforts.

The verdict: smarter scanning for a saner security workflow

Through this deep dive, we've shared not just what we achieved, but also a glimpse into how we tackled the persistent problem of noisy fuzzing results.

Along the way, we learned some valuable lessons that we believe are worth sharing.

The power of proof

We'll admit, even we were a little skeptical about the real-world impact of applying Machine Learning to this challenge. Seeing a tangible 50% reduction in false positives in our Website Scanner wasn't just a metric – it was a powerful validation.

It proved that with the right approach, significant, quantifiable improvements in detection accuracy are not just possible, but achievable.

Leveraging existing brilliance

Reinventing the wheel rarely makes sense.

Fine-tuning existing, powerful models like LLaMA gave us a significant head start, especially when it came to understanding the nuances of multiple languages and providing context based on security knowledge right out of the box. This saved us considerable time and resources compared to building something entirely from scratch.

The modern web demands more

We also gained a deeper appreciation for the complexities of modern web applications. The rise of JavaScript-heavy sites and Single-Page Applications (SPAs) means that static analysis alone often falls short. This realization led to the crucial "INCONCLUSIVE" classification, guiding our scanners to employ browser-based rendering when necessary for accurate analysis – a capability we believe is essential for truly effective fuzzing today.

Why Machine Learning (and not AI)

While security automation has moved beyond a nice-to-have to become a fundamental necessity, we believe in handling it with the security mindset’s sharpest weapon: our critical thinking.

Since you’re probably wondering why we didn’t talk about AI so far, let’s address this.

Just like you, we’re also saturated with often-overhyped promises of general AI. And so, we intentionally avoid this term because AI can often mask a lack of real substance - and that’s not us.

We're not chasing hype; we're building solutions that deliver real accuracy.
Our approach to automation at Pentest-Tools.com is grounded in genuine engineering.

We prefer to focus on rigorously trained Machine Learning models that deliver demonstrable results - because automation without precision creates more work, not less.

This journey to 92% accuracy isn't the finish line, but a significant milestone in our relentless pursuit of higher quality and more actionable security insights for those on the front lines.

Ready to experience the difference that smarter web vulnerability scanning can make?

Take Pentest-Tools.com for a spin and see our ML classifier in action!

We think you'll like what you find (and what you don't!).

Keep learning about our ML classifier

Download technical brief

How we cut web fuzzing FPs by 50% using Machine Learning

The frustrating reality: when good tools go noisy

Our answer: a smarter, more nuanced classification model

The "a-ha!" moment (HIT)

The "dead end" (MISS)

The "maybe worth a second look" (PARTIAL HIT)

The "Needs More Eyes" (INCONCLUSIVE)

What this means for you: less grind, more impact

Smarter automation powered by Machine Learning

Real numbers, real impact on risk

Seamless integration, immediate value

Under the hood: how we made it so smart

Cleaning up the mess: our preprocessing power

The brains of the operation: our ML models

The proof in the pudding: our performance metrics

Seeing beyond the code: qualitative insights

Speaking your language (literally)

Decoding the "soft" signals

The verdict: smarter scanning for a saner security workflow

The power of proof

Leveraging existing brilliance

The modern web demands more

Why Machine Learning (and not AI)

Keep learning about our ML classifier

Get fresh security research

Next Article

Previous Article

Related articles

Beyond the hype: how we built a machine-learning classifier (and refused to call It AI)

Year in review: 2024 on Pentest-Tools.com

How and why we built the Kubernetes Vulnerability Scanner

Suggested articles

Year in review: 2023 on Pentest-Tools.com

Behind the scenes – an interview with Adrian Furtuna, our founder and CEO

Year in review: 2021 on Pentest-Tools.com

Discover our ethical hacking toolkit and all the free tools you can use!

One of EMEA's fastest-growing tech companies.

48,000+ security folks are here. Are you?

More than demos - real faces, real insight.

Security leaders trust what they can prove