Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed an issue in the lustre input polgin with the jobstats only being reporeted for one job per storage target #5771

Merged
merged 1 commit into from
May 6, 2019

Conversation

frroberts
Copy link
Contributor

The current Lustre input plugin will only report jobstats for one job per storage target. This can be tested by running the included tests with the current version of the input plugin.

With this fix all jobs active on each storage target are recorded.

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

@danielnelson danielnelson added this to the 1.10.4 milestone Apr 27, 2019
@danielnelson danielnelson added the fix pr to fix corresponding bug label Apr 27, 2019
@danielnelson danielnelson modified the milestones: 1.10.4, 1.11.0 Apr 27, 2019
@danielnelson
Copy link
Contributor

I don't see your github username as having signed the CLA, could you do it again?

One more favor if possible, I'd like to add a README for this plugin, would you be able to run:

telegraf --input-filter lustre2 --test

@frroberts
Copy link
Contributor Author

I should have a CCLA through my employer, CSC - IT Center for Science LTD

And I ran the command on both a metadata and object storage server in our active Lustre system. The outputs are quite long, ~1500 and ~800 lines, do you want it all? or should I create a representative subset of the output?

@danielnelson
Copy link
Contributor

Just a representative subset will be good. On the CLA, I might need you to sign the ICLA as well, but let me get back with you on that.

@frroberts
Copy link
Contributor Author

Attached are representative outputs from one metadata (mds) and one object (oss) storage server.

subset_mds.txt
subset_oss.txt

@danielnelson danielnelson merged commit 8abf8c1 into influxdata:master May 6, 2019
hwaastad pushed a commit to hwaastad/telegraf that referenced this pull request Jun 13, 2019
@shawnahall71
Copy link

This change seems to have broken the ability to specify which proc files to process. In my case, I would like to disable the job_stats proc files and only process the others. Here's the relevant portion of my telegraf configuration:

# Read metrics from local Lustre service on OST, MDS
[[inputs.lustre2]]
  # An array of /proc globs to search for Lustre stats
  # If not specified, the default will work on Lustre 2.5.x
  #
   ost_procfiles = [
     "/proc/fs/lustre/obdfilter/*/stats",
     "/proc/fs/lustre/osd-ldiskfs/*/stats",
   #  "/proc/fs/lustre/obdfilter/*/job_stats",
   ]
   mds_procfiles = [
     "/proc/fs/lustre/mdt/*/md_stats",
   #  "/proc/fs/lustre/mdt/*/job_stats",
   ]

When I use this config with Telegraf 1.11.2 (which includes merged this pull request), uncommenting the default inputs.lustre2 configuration (even without commenting out job_stats) produces an error:

# telegraf -config /etc/telegraf/telegraf.conf.test -test 
2019-07-10T12:52:08Z I! Starting Telegraf 1.11.2
2019-07-10T12:52:08Z E! [telegraf] Error running agent: Error parsing /etc/telegraf/telegraf.conf.test, line 3009: field corresponding to `Ost_procfiles' is not defined in lustre2.Lustre2

When I revert to telegraf 1.10.4, which was prior to this merge, it works as expected:

# telegraf -config /etc/telegraf/telegraf.conf.test -test                                                                                                                                                                            
2019-07-10T13:01:29Z I! Starting Telegraf 1.10.4
> cpu,cpu=cpu0,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.00000004470348,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.999999996041879,usage_user=0 1562763690000000000
> cpu,cpu=cpu1,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.0392157292421,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843164109648,usage_user=0 1562763690000000000
> cpu,cpu=cpu2,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu3,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.0392157292421,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843164109648,usage_user=0 1562763690000000000
> cpu,cpu=cpu4,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.0392157292421,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843164109648,usage_user=0 1562763690000000000
> cpu,cpu=cpu5,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=94.1176471662425,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=1.9607843214322427,usage_steal=0,usage_system=1.9607843178655968,usage_user=1.9607843214322427 1562763690000000000
> cpu,cpu=cpu6,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=96.0784314584842,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843164109648,usage_user=1.9607843149843065 1562763690000000000
> cpu,cpu=cpu7,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu8,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu9,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu10,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu11,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.0000000372529,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=2.0000000034924597,usage_user=0 1562763690000000000
> cpu,cpu=cpu12,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu13,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu14,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.0392157292421,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=1.9607843164109648,usage_user=0 1562763690000000000
> cpu,cpu=cpu15,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1562763690000000000
> cpu,cpu=cpu-total,host=hpclbo00 usage_guest=0,usage_guest_nice=0,usage_idle=98.87359194657252,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0.876095115754831,usage_user=0.2503128910482447 1562763690000000000
> mem,host=hpclbo00 active=62780149760i,available=121160527872i,available_percent=89.99860804344,buffered=312258560i,cached=125641707520i,commit_limit=84458754048i,committed_as=5131505664i,dirty=217088i,free=1769193472i,high_free=0i,high_total=0i,huge_page_size=2097152i,huge_pages_free=0i,huge_pages_total=0i,inactive=61960777728i,low_free=0i,low_total=0i,mapped=366833664i,page_tables=8921088i,shared=2399195136i,slab=3043733504i,swap_cached=4153344i,swap_free=16910118912i,swap_total=17146310656i,total=134624890880i,used=6901731328i,used_percent=5.126638382312203,vmalloc_chunk=35113034797056i,vmalloc_total=35184372087808i,vmalloc_used=2496950272i,wired=0i,write_back=0i,write_back_tmp=0i 1562763690000000000
> swap,host=hpclbo00 free=16910118912i,total=17146310656i,used=236191744i,used_percent=1.3775076676179872 1562763690000000000
> swap,host=hpclbo00 in=5046272i,out=234377216i 1562763690000000000
> system,host=hpclbo00 load1=0.64,load15=1.4,load5=1.08,n_cpus=16i,n_users=1i 1562763690000000000
> system,host=hpclbo00 uptime=9459500i 1562763690000000000
> system,host=hpclbo00 uptime_format="109 days, 11:38" 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0002 cache_access=123120213i,cache_hit=25947190i,cache_miss=97559888i,read_bytes=134929587335168i,read_calls=123616552i,write_bytes=89420750561807i,write_calls=24129027i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0003 cache_access=118437662i,cache_hit=14497478i,cache_miss=104385527i,read_bytes=131436306927616i,read_calls=119764309i,write_bytes=90039578607203i,write_calls=23623436i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0004 cache_access=115649287i,cache_hit=23159646i,cache_miss=92797258i,read_bytes=125015997210624i,read_calls=116213700i,write_bytes=77198517324719i,write_calls=20875544i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0005 cache_access=114102041i,cache_hit=23505459i,cache_miss=91084552i,read_bytes=121714593878016i,read_calls=115034217i,write_bytes=82191539574894i,write_calls=21949600i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0006 cache_access=119626413i,cache_hit=19527210i,cache_miss=100433697i,read_bytes=136805938708480i,read_calls=120691366i,write_bytes=87532199845233i,write_calls=23370151i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0000 cache_access=138117149i,cache_hit=40481815i,cache_miss=97947712i,read_bytes=149839448764416i,read_calls=138522648i,write_bytes=86155537532047i,write_calls=22975409i 1562763690000000000
> lustre2,host=hpclbo00,name=lustreb-OST0001 cache_access=125596425i,cache_hit=25840974i,cache_miss=100043275i,read_bytes=135576462835712i,read_calls=126334658i,write_bytes=80814717739044i,write_calls=21773289i 1562763690000000000

Let me know if I should open an issue instead of commenting on this PR.

@danielnelson
Copy link
Contributor

@shawnahall71 This is probably related to the changes we made parsing TOML, similar to #5980. Could you open a new issue for this and we will address in 1.11.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants