Question:

How to find out how many pages a website has?

by  |  earlier

0 LIKES UnLike

I've got a website for translation so I'd want to know how many pages it contains. I mean only HTML files.

I've tried Xenu Sleuth and DRKSpider but I can't figure out how to read their reports and I just don't think they're giving that page count I need. Please point any other method or application. Please help.

Thanks in advance. God bless you.

 Tags:

   Report

2 ANSWERS


  1. The problem here is lack of information.  You simply can't definitively know exactly how many HTML files they have unless you have read access to the filesystem their webserver is using.  Once you've got that, you can use a lot of approaches.  If it's a UNIX or Linux based file system, and you have login access, you can run this in their base webserver directory (you may need to run in multiple locations depending on their setup):

    find . | grep -i html$ | wc

    You can count them manually with "find .", or with FTP, but for that you'll obviously again need some sort of access.

    Without access, you're just guessing.  And there are a lot of ways to guess very well, but most of them depend on things like:

    1) The site having links to all its HTML pages

    2) People visiting the entire range of pages

    3) Pages being static (dynamic pages can be difficult to track, depending)

    If they give you the HTTP logs, you can look through them to find unique requests.  With the default apache log and Linux/UNIX, you can do something like thi:

    cut -d " " -f 7 access_log | grep -i html$ | sort | uniq | wc

    If they've got Webalizer installed (or some other sort of HTTP log analysis), it probably has exactly the sorts of things you're looking for, although we'd need to know what they were using to generate the data.

    And you could always set up some sort of spider to crawl their site, too, but that requires some knowledge about how their site is set up and where to start.  And what platforms you have access to.

    But any way you slice it, these last few techniques aren't fool-proof.  They may have some web pages that simply aren't linked to directly on their site, or aren't frequently visited (thereby not showing up in recent logs).  For any kind of definitive answer, you need to get at least read access to their webserver.

    DaveE


  2. FTP to the site and count the html files.

Question Stats

Latest activity: earlier.
This question has 2 answers.

BECOME A GUIDE

Share your knowledge and help people by answering questions.