
Identify the key properties of a web crawler describe in

Use Crawler Java Assignment

Review, fix and run the crawler.

Add code for additional requiments.

Make sure you crawler does the following.

Test your crawler only on the data in:


Make sure that your crawler is not allowed to get out of this directory!!! Yes, there is a robots.txt file that must be used. Note that it is in a non-standard location.

The required input to your program is N, the limit on the number of pages to retrieve and a list of stop words (of your choosing) to exclude.

Perform case insensitive matching.

You can assume that there are no errors in the input. Your code should be robust under errors in the Web pages you're searching. If an error is encountered, feel free, if necessary, just to skip the page where it is encountered.

1. Identify the key properties of a web crawler. Describe in detail how each of these properties is implemented in your code.

2. Use your crawler to list the URL of all pages in the test data and report all out-going links of the test data. [10 points] display the contents of the tag</p> <p style="text-align: justify;">3. Implement duplicate detection, and report if any URLs refer to already seen content.</p> <p style="text-align: justify;">4. Use your crawler to list all broken links within the test data.</p> <p style="text-align: justify;">5. How many graphic files are included in the test data?</p> <p style="text-align: justify;">6. Have your crawler save the words from each page of type (.txt, .htm, .html). Make sure that you do not save HTML markup. Explain your definition of "word". In this process, give each page a unique document ID.</p> <p style="text-align: justify;">Implement Stemming</p> <p style="text-align: justify;">7. Report the 20 most common words with its document frequency. words or stemmed words?</p> <p><strong>Attachment:-</strong> <a href="https://secure.tutorsglobe.com/Atten_files/409_crawler_project.zip" target="_blank">crawler_project.zip</a></p></p> </div> <div id="viewreadmore" class="link"> <a id="readmore" href="javascript:void(0);" class="read-more-trigger mar_top10" onclick="changeheight(this)">View Complete Question</a> </div> <div id="DivSolution"> <h4> Solution Preview : </h4> <div class="seprator"> </div> <p> </p> <div class="downloadfiles"> <h5> Prepared by a verified Expert</h5> <h6> JAVA Programming: Identify the key properties of a web crawler describe in</h6> <h5> Reference No:- TGS02238162</h5> <input type="submit" name="getPaid" value="Purchase Solution File" id="getPaid" class="btn btn-success btn-lg btn-block-sm mar_btm20" /> <p> Now Priced at $70 (50% Discount)</p> </div> <div style="text-align: justify"></div> </div> </div> <div class="row"> <div class="col-sm-12 reviewbox"> <div id="PlnRated"> <div class="row recomded"> <div class="recomdedbox col-sm-2 col-xs-12"> <p class="inner"><i class="fa fa-thumbs-o-up"></i> Recommended <b>(90%)</b></p> </div> <div class="recomdedbox col-sm-2 col-xs-12"> <p class="inner rating"><i class="fa fa-star"></i> Rated <b>(4.3/5)</b></p> </div> </div> </div> <div class="row "> <div class="panel-group review" id="accordion" role="tablist" aria-multiselectable="true"> <div class="panel-heading" role="tab" id="headingTwo"> <h4 class="panel-title"> <a class="collapsed" role="button" data-toggle="collapse" data-parent="#accordion" href="#collapseTwo" aria-expanded="false" aria-controls="collapseTwo"> Have a Question? (oR Write a Review) </a> </h4> </div> <div id="collapseTwo" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingTwo"> <div class="panel-body"> <div class="col-sm-12"> <div class="row search searchbg message"> <span id="RequiredFieldValidator1" style="visibility:hidden;">Write atleast 100 words!!</span> <textarea name="txtcomments" id="txtcomments" maxlength="1000" ValidationGroup="Review" placeholder="Write your review" class="form-control" rows="6"></textarea> <div class="pull-right mar_top20"> <input type="submit" name="btnReviewSubmit" value="Submit" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("btnReviewSubmit", "", true, "Review", "", false, false))" id="btnReviewSubmit" class="btn btn-primary pull-right" /> </div> </div> </div> </div> </div> </div> </div> </div> </div> <div class="user-comments-area hidden-xs"> <h4 class="text-uppercase mar_btm20"> <i class="fa fa-question-circle"></i>   Recent Questions Asked JAVA Programming</h4> <ul class="user-comments-list"> <table id="dlMaterials" cellspacing="0" style="width:100%;border-collapse:collapse;"> <tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_0" class="studenthdname" href="https://www.tutorsglobe.com/question/according-to-a-recent-study93--of-high-school-dropouts-are-52238158.aspx">According to a recent study93 of high school dropouts are</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_0">according to a recent study93 of high school dropouts are 16- to 17-year-olds in addition65 of high school dropouts</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_1" class="studenthdname" href="https://www.tutorsglobe.com/question/why-would-it-be-important-to-occasionally-check-your-52238159.aspx">Why would it be important to occasionally check your</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_1">assignment web scenerio foruminstructions discuss the following below1 the role of css in htmla advantages of style</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_2" class="studenthdname" href="https://www.tutorsglobe.com/question/design-a-database-diagram-for-a-database-that-stores-52238160.aspx">Design a database diagram for a database that stores</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_2">sql server 2012 assingment1 design a database diagram for a database that stores information about the downloads</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_3" class="studenthdname" href="https://www.tutorsglobe.com/question/write-an-essay-on-the-effects-of-internet-usage-or-lack-52238161.aspx">Write an essay on the effects of internet usage or lack</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_3">write an essay on the effects of internet usage or lack thereof on your daily life following the steps diane wood took</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_4" class="studenthdname" href="https://www.tutorsglobe.com/question/identify-the-key-properties-of-a-web-crawler-describe-in-52238162.aspx">Identify the key properties of a web crawler describe in</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_4">use crawler java assignmentreview fix and run the crawleradd code for additional requimentsmake sure you crawler does</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_5" class="studenthdname" href="https://www.tutorsglobe.com/question/we-toss-an-unfair-coin-100-times-in-a-row-we-play-according-52238163.aspx">We toss an unfair coin 100 times in a row we play according</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_5">we toss an unfair coin 100 times in a row we play according to following rules if tail 1 if head -145 p head04 estimate</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_6" class="studenthdname" href="https://www.tutorsglobe.com/question/based-on-the-answer-from-question-9-calculate-90-confidence-52238164.aspx">Based on the answer from question 9 calculate 90 confidence</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_6">hollow is proud of their energy saving program a sample of 29 houses reveals an average saving of 475 kilowatt hours</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_7" class="studenthdname" href="https://www.tutorsglobe.com/question/psyc-164--please-watch-the-following-ted-talk-there-is-some-52238165.aspx">Psyc 164 please watch the following ted talk there is some</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_7">assignmentplease watch the following ted talk there is some overlap with my module - wish id known that before i</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <span class="mar_lft5">Q :</span> <a id="dlMaterials_hypermaterial_8" class="studenthdname" href="https://www.tutorsglobe.com/question/you-are-skeptical-of-the-business-school-claim-and-decide-52238166.aspx">You are skeptical of the business school claim and decide</a></h5> <p class="answer"> <span id="dlMaterials_lblQuestion_8">a local business school claims that its graduating seniors get higher-paying jobs than the national average for</span></p> </div> <!-- /comment-box --> </li> </td> </tr> </table> </ul> <!-- /user-comments-list --> </div> </div> <div class="col-md-4 col-xs-12 login-area innerpage"> <div class="row"> <div class="details col-md-12"> <div class="col-md-4"> <div class="circle orange"> <i class="fa fa-question"></i> </div> <p> 1922741 </p> <p> Questions<br /> Asked</p> </div> <div class="col-md-4"> <div class="circle yellow"> <i class="fa fa-user-secret"></i> </div> <p> 3,689</p> <p> Active Tutors</p> </div> <div class="col-md-4"> <div class="circle green"> <i class="fa fa-thumbs-o-up"></i> </div> <p> 1450459</p> <p> Questions<br /> Answered</p> </div> <p><b> Start Excelling in your courses, Ask a tutor for help and get answers for your problems !! </b></p> <a href="https://www.tutorsglobe.com/post-your-job-for-free.aspx" class="btn btn-primary btn-lg mar_top10">ask Question</a> </div> </div> <div class="row"> <div class="user-comments-area hidden-xs"> <hr /> <h4 class="text-uppercase mar_btm20"> <i class="fa fa-question-circle"></i> Asked Questions</h4> <hr /> <ul class="user-comments-list"> <table id="dlNewReviews" cellspacing="0" style="width:100%;border-collapse:collapse;"> <tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_0" class="studenthdname" href="https://www.tutorsglobe.com/question/describe-nursing-interventions-for-postpartum-patients-53454289.aspx">Describe nursing interventions for postpartum patients</a></h5> <p> <span id="dlNewReviews_lblReviews_0">Problem: Nursing interventions for postpartum patients who have diabetes should include which of the following?</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_1" class="studenthdname" href="https://www.tutorsglobe.com/question/what-postpartum-complication-is-chun-ja-experiencing-53454290.aspx">What postpartum complication is chun-ja experiencing</a></h5> <p> <span id="dlNewReviews_lblReviews_1">Chun-Ja, a 27-year-old G2P1011, delivered 28 hours ago via cesarean section, and you've just assumed her care. She is obese and had meconium-stained</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_2" class="studenthdname" href="https://www.tutorsglobe.com/question/what-postpartum-complication-is-janet-experiencing-53454291.aspx">What postpartum complication is janet experiencing</a></h5> <p> <span id="dlNewReviews_lblReviews_2">Janet, 39 years old, a G4P2204, delivered a 7 lb 5 oz baby girl 12 minutes ago. She has a history of smoking and hypertension. </span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_3" class="studenthdname" href="https://www.tutorsglobe.com/question/effectively-support-children-with-moderate-to-severe-speech-53454292.aspx">Effectively support children with moderate to severe speech</a></h5> <p> <span id="dlNewReviews_lblReviews_3">To effectively support children with moderate to severe speech and language impairments in an inclusive setting, educators and caregivers</span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_4" class="studenthdname" href="https://www.tutorsglobe.com/question/problem-regarding-the-capitation-payment-model-53454293.aspx">Problem regarding the capitation payment model</a></h5> <p> <span id="dlNewReviews_lblReviews_4">Dr. Brown is a gastroenterologist who works for the Premier Health Care Network (PHCN). PHCN pays its physicians using the Capitation Payment Model. </span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_5" class="studenthdname" href="https://www.tutorsglobe.com/question/misconception-associated-with-medication-assisted-treatment-53454294.aspx">Misconception associated with medication assisted treatment</a></h5> <p> <span id="dlNewReviews_lblReviews_5">Question: Which of the following is a common misconception associated with the medication assisted treatment program? </span></p> </div> <!-- /comment-box --> </li> </td> </tr><tr> <td> <li> <div class="comment-box"> <h5> <a id="dlNewReviews_hyperQues_6" class="studenthdname" href="https://www.tutorsglobe.com/question/what-is-meant-by-just-culture-53454295.aspx">What is meant by just culture</a></h5> <p> <span id="dlNewReviews_lblReviews_6">Question: What is meant by "Just Culture?" Explain why it is important in healthcare. What is a Root Cause Analysis?</span></p> </div> <!-- /comment-box --> </li> </td> </tr> </table> </ul> </div> </div> </div> </div> </div> </div> </div> <script> var url = 'https://www.tutorsglobe.com/include/javascript/watiWidget.js'; var s = document.createElement('script'); s.type = 'text/javascript'; s.async = true; s.src = url; var options = { "enabled":true, "chatButtonSetting":{ "backgroundColor":"#00e785", "ctaText":"Whatsapp Support!!", "borderRadius":"25", "marginLeft": "0", "marginRight": "20", "marginBottom": "20", "ctaIconWATI":false, "position":"left" }, "brandSetting":{ "brandName":"Tutorsglobe", "brandSubTitle":"Trusted Since 2005", "brandImg":"https://www.tutorsglobe.com/include/images/chat-logo.svg", "welcomeText":"Hi there!\nDo you Need help?", "messageText":"Hello, Tutorsglobe !! I have a question!", "backgroundColor":"#00e785", "ctaText":"Chat with Whatsapp", "borderRadius":"25", "autoShow":false, "phoneNumber":"441416286080" } }; s.onload = function() { CreateWhatsappChatWidget(options); }; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(s, x); </script> <footer class="site-footer"> <div class="container"> <div class="footerlinks"> <a href="https://www.tutorsglobe.com/">Home</a> | <a href="https://www.tutorsglobe.com/about-us.aspx">Company Overview</a> | <a href="https://www.tutorsglobe.com/services.aspx">Services</a> | <a href="https://www.tutorsglobe.com/library/">Discover Q&A</a> | <a href="https://www.tutorsglobe.com/sitemap.aspx">Sitemap</a> | <a href="https://www.tutorsglobe.com/contact-us.aspx">Contact Us</a> | <a href="https://www.tutorsglobe.com/terms-and-conditions.aspx">T & C</a> | <a href="https://www.tutorsglobe.com/refundcancelpolicy.aspx">Refund Policy</a> | <a href="https://www.tutorsglobe.com/copyright-infringement-policy.aspx">Copyright Policy</a> | <a href="https://www.tutorsglobe.com/blog/archive/">Blog</a> | <a href="https://www.tutorsglobe.com/library/archive.aspx">Q&A</a> | <a href="https://www.tutorsglobe.com/education-directory.aspx">Directory</a> </div> <p>©TutorsGlobe</a> All rights reserved 2022-2023. </p> <script type="application/ld+json"> { "@context": "http://schema.org/", "@type": "product", "name": "Tutorsglobe", "image": "https://www.tutorsglobe.com/IncludeLib/Images/logo.png", "description": "elearning Platform - Tutor Service", "brand": { "@type": "elearning", "name": "Tutorsglobe" }, "aggregateRating": { "@type": "AggregateRating", "ratingValue": "4.9", "ratingCount": "37128" } } </script> <a href="#" class="settings"><i class="fa fa-angle-up"></i></a> <ul class="social-icons"> <li><a href="https://www.facebook.com/TutorsGlobe" rel="nofollow" target="_blank"><i class="fa fa-facebook-square"></i></a></li> <li><a href="https://twitter.com/Tutorsglobe" rel="nofollow" target="_blank"><i class="fa fa-twitter-square"></i></a></li> <li><a href="#" rel="nofollow"><i class="fa fa-youtube-square"></i></a></li> <li><a href="https://www.linkedin.com/company/tutorsglobe" target="_blank" rel="nofollow"><i class="fa fa-linkedin-square"></i></a></li> </ul> </div> <script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-32333066-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.tutorsglobe.com/IncludeLib/js/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script> <script async src="https://www.googletagmanager.com/gtag/js?id=G-5E9QFMFDJR"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-5E9QFMFDJR'); </script> </footer> </div> <!-- /pageWrap --> <div class="overlay"> </div> <!-- JavaScript Files ================================================== --> <script type="text/javascript" src="../IncludeLib/js/jquery-1.11.2.min.js"></script> <script type="text/javascript" src="../IncludeLib/js/bootstrap.min.js"></script> <script type="text/javascript" src="../IncludeLib/js/jquery.mCustomScrollbar.concat.min.js"></script> <script type="text/javascript" src="../IncludeLib/js/script.js"></script> <script type="text/javascript" src="../IncludeLib/js/ie10-viewport-bug-workaround.js"></script> <script type="text/javascript"> //<![CDATA[ var Page_Validators = new Array(document.getElementById("RequiredFieldValidator1")); //]]> </script> <script type="text/javascript"> //<![CDATA[ var RequiredFieldValidator1 = document.all ? document.all["RequiredFieldValidator1"] : document.getElementById("RequiredFieldValidator1"); RequiredFieldValidator1.controltovalidate = "txtcomments"; RequiredFieldValidator1.errormessage = "Write atleast 100 words!!"; RequiredFieldValidator1.validationGroup = "Review"; RequiredFieldValidator1.evaluationfunction = "RequiredFieldValidatorEvaluateIsValid"; RequiredFieldValidator1.initialvalue = ""; //]]> </script> <script type="text/javascript"> //<![CDATA[ var Page_ValidationActive = false; if (typeof(ValidatorOnLoad) == "function") { ValidatorOnLoad(); } function ValidatorOnSubmit() { if (Page_ValidationActive) { return ValidatorCommonOnSubmit(); } else { return true; } } //]]> </script> </form> </body> </html>