Как реализовать сито из Eratosthenes с помощью многопоточного С#?

Я пытаюсь реализовать Сито Эратосфена, используя Mutithreading. Вот моя реализация:

Это займет 30 секунд, чтобы произвести выход, есть ли способ ускорить это?

Ответы

Ответ 1

Отредактировано:

Мой ответ на вопрос: да, вы можете определенно использовать параллельную библиотеку задач (TPL), чтобы найти простые числа на один миллиард быстрее. Данный код в вопросе медленный, потому что он неэффективно использует память или многопроцессорность, а конечный результат также неэффективен.

Итак, , а не только многопроцессорная обработка, существует огромное количество вещей, которые вы можете сделать, чтобы ускорить сито Эратосфенеза, следующим образом:

Вы просеиваете все числа, даже и нечетные, которые используют больше памяти (один миллиард байт для вашего диапазона в один миллиард) и медленнее из-за к ненужной обработке. Просто используя тот факт, что два только четное простое, поэтому матрица представляет только нечетные простые числа. половина требований к памяти и уменьшить количество составных число отбирает операции более чем в два раза, так что операция может занять около 20 секунд на вашей машине для простых чисел млрд.
Часть причины, что составное число, отбирающее такой огромный массив памяти настолько медленный, что он значительно превышает кеш процессора так, чтобы многие обращения к памяти были в основной памяти в несколько случайный способ означает, что отбраковка заданного составного числа представление может принимать более ста циклов ЦП, тогда как если все они были в кеше L1, это займет всего один цикл и в кэш L2 - всего около четырех циклов; не все обращения к худшему времена, но это определенно замедляет обработку. Немного упакованный массив для представления основных кандидатов уменьшит использование памяти в восемь раз и сделать наихудший доступ менее общий. Хотя на доступ к вычислительным ресурсам отдельные биты, вы увидите, что есть чистая прибыль, поскольку время экономия в сокращении среднего времени доступа к памяти будет больше, чем это расходы. Простым способом реализации этого является использование BitArray а не массив bool. Написание собственных битовых обращений с использованием сдвиг и "и" будут более эффективными, чем использование Класс BitArray. Вы найдете небольшую экономию, используя BitArray и другой фактор из двух, выполняющих собственные битовые операции для одного с точностью до десяти или двенадцати секунд это изменение.
Ваш вывод количества найденных простых чисел не очень эффективен, поскольку он требуется доступ к массиву и условие if для каждого кандидата. После того, как у вас есть ситовый буфер в виде массива массивных пакетов бит, вы можете сделать это намного эффективнее с подсчетом Look Up Таблица (LUT), которая устраняет условие if и требует только двух массив получает доступ к за бит упакованного слова. Делая это, время счет становится незначительной частью работы по сравнению со временем отбирать составные числа, для дальнейшей экономии, чтобы возможно, восемь секунд для подсчета простых чисел до одного миллиарда.
Дальнейшее сокращение числа обработанных первичных кандидатов может быть результатом применения факторизации колес, который удаляет коэффициенты простых чисел 2, 3 и 5 от обработки и корректировка метода упаковки битов также может повысить эффективность диапазон бит буфера заданного размера в два раза больше. Это может уменьшить количество операций отбраковки композитных чисел с помощью еще один огромный фактор - более трех раз, хотя ценой дополнительной вычислительной сложности.
В целях дальнейшего сокращения использования памяти, делая доступ к памяти даже более эффективны и готовят путь для многопроцессорной обработки на страницу сегмент, можно разделить работу на страницы, которые не больше чем размеры кеша L1 или L2. Это требует, чтобы один держал базу таблица простых чисел всех простых чисел до квадратного корня из максимума основного кандидата и пересчитывает начальные параметры адреса каждый из базовых простых чисел, используемых для отбраковки по заданному сегменту страницы, но это еще более эффективно, чем использование огромных массивов отбраковки. добавленная польза для реализации этой сегментации страниц заключается в том, что тогда не нужно заранее указывать верхний предел просеивания, но скорее, может просто расширить базовые простые числа по мере необходимости, страницы обрабатываются. Со всеми оптимизациями до этого момента, вы, вероятно, можете произвести подсчет простых чисел до одного миллиарда в около 2,5 секунд.
Наконец, можно добавить последние штрихи к многопроцессорной обработке страницы сегменты с использованием TPL или Threads, которые используют размер буфера около размер кеша L2 (на ядро) приведет к увеличению коэффициента усиления коэффициент два на двухъядерном не Hyper Threaded (HT) старше процессор как Intel E7500 Core2Duo для выполнения времени, чтобы найти количество простых чисел до одного миллиарда около 1,25 секунды или около того.

Я реализовал многопоточное сито Eratosthenes как ответ на другой поток, чтобы показать, что нет никакого преимущества для Сита Аткина над ситом Эратосфен. Он использует параллельную библиотеку задач (TPL), как в задачах и TaskFactory, поэтому для этого требуется, по крайней мере, DotNet Framework 4. Я еще раз подстроил этот код, используя все описанные выше оптимизации, как альтернативный ответ на тот же вопрос. Я повторно разместил этот измененный код здесь с добавленными комментариями и более легким для чтения форматированием следующим образом:

  using System;
  using System.Collections;
  using System.Collections.Generic;
  using System.Linq;
  using System.Threading;
  using System.Threading.Tasks;

  class UltimatePrimesSoE : IEnumerable<ulong> {
    #region const and static readonly field's, private struct and classes

    //one can get single threaded performance by setting NUMPRCSPCS = 1
    static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1;
    //the L1CACHEPOW can not be less than 14 and is usually the two raised to the power of the L1 or L2 cache
    const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[]
    const uint CHNKSZ = 17; //this times BWHLWRDS (below) times two should not be bigger than the L2 cache in bytes
    //the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position
    static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3,
                                       2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11;
    static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //look up wheel position from index and vice versa
    static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overflow in size
    static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n); //small wheel circumference for odd numbers
    static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4; //number of wheel candidates
    static readonly byte[] BWHLPRMS = { 2,3,5,7,11,13,17 }; const uint FSTBP = 19; //big wheel primes, following prime
    //the big wheel circumference expressed in number of 16 bit words as in a minimum bit buffer size
    static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
    //page size and range as developed from the above
    static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC;
    //buffer size (multiple chunks) as produced from the above
    static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk
    static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page
    struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; }
    static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table
    static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes

    class Bpa { //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array
      byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object();
      public uint this[uint i] {
        get {
          if (i >= this.sa.Length) lock (this.lck) {
              var lngth = this.sa.Length; while (i >= lngth) {
                var bf = (ushort[])MCPY.Clone(); if (lngth == 0) {
                  for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
                      bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) {
                    if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1;
                    if ((v & msk) == 0) {
                      var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1;
                      if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
                      for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
                        var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt;
                      }
                    }
                  }
                }
                else { this.lwi += PGRNG; cullbf(this.lwi, bf); }
                var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0);
                for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
                    lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) {
                  if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) {
                    var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd;
                  }
                } this.sa = na;
              }
            } return this.sa[i];
        }
      }
    }
    static readonly Bpa baseprms = new Bpa(); //the base primes array using the above class

    struct PrcsSpc { public Task tsk; public ushort[] buf; } //used for multi-threading buffer array processing

    #endregion

    #region private static methods

    static int count(uint bitlim, ushort[] buf) { //very fast counting method using the CLUT look up table
      if (bitlim < BFRNG) {
        var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC;
        for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4));
      }
      var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
        acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc;
    }

    static void cullbf(ulong lwi, ushort[] b) { //fast buffer segment culling method using a Wheel State Look Up Table
      ulong nlwi = lwi;
      for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes.
      for (uint i = 0, pd = 0; ; ++i) {
        pd += (uint)baseprms[i] >> 6;
        var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi];
        var k = ((ulong)p * (ulong)p - FSTBP) >> 1;
        if (k >= nlwi) break; if (k < lwi) {
          k = (lwi - k) % (WCRC * p);
          if (k != 0) {
            var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
          }
        }
        else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
        for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
          var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt;
        }
      }
    }

    static Task cullbftsk(ulong lwi, ushort[] b, Action<ushort[]> f) { // forms a task of the cull buffer operaion
      return Task.Factory.StartNew(() => { cullbf(lwi, b); f(b); });
    }

    //iterates the action over each page up to the page including the top_number,
    //making an adjustment to the top limit for the last page.
    //this method works for non-dependent actions that can be executed in any order.
    static void IterateTo(ulong top_number, Action<ulong, uint, ushort[]> actn) {
      PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc {
        buf = new ushort[BFSZ],
        tsk = Task.Factory.StartNew(() => { })
      };
      var topndx = (top_number - FSTBP) >> 1; for (ulong ndx = 0; ndx <= topndx; ) {
        ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
        var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx < nxtndx ? (uint)(topndx - ndx + 1) : BFRNG;
        ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbftsk(ndx, buf, (b) => actn(lowi, lim, b)) };
        ndx = nxtndx;
      } for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait();
    }

    //iterates the predicate over each page up to the page where the predicate paramenter returns true,
    //this method works for dependent operations that need to be executed in increasing order.
    //it is somewhat slower than the above as the predicate function is executed outside the task.
    static void IterateUntil(Func<ulong, ushort[], bool> prdct) {
      PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS];
      for (var s = 0u; s < NUMPRCSPCS; ++s) {
        var buf = new ushort[BFSZ];
        ps[s] = new PrcsSpc { buf = buf, tsk = cullbftsk(s * BFRNG, buf, (bfr) => { }) };
      }
      for (var ndx = 0UL; ; ndx += BFRNG) {
        ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx; if (prdct(lowi, buf)) break;
        for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
        ps[NUMPRCSPCS - 1] = new PrcsSpc {
          buf = buf,
          tsk = cullbftsk(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { })
        };
      }
    }

    #endregion

    #region initialization

    /// <summary>
    /// the static constructor is used to initialize the static readonly arrays.
    /// </summary>
    static UltimatePrimesSoE() {
      WHLPOS = new byte[WHLPTRN.Length + 1]; //to look up wheel position index from wheel index
      for (byte i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; }
      WHLNDX = new byte[WCRC + 1]; for (byte i = 1; i < WHLPOS.Length; ++i) {
        for (byte j = (byte)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i;
      }
      WHLRNDUP = new byte[WCRC * 2]; for (byte i = 1; i < WHLRNDUP.Length; ++i) {
        if (i > WCRC) WHLRNDUP[i] = (byte)(WCRC + WHLPOS[WHLNDX[i - WCRC]]); else WHLRNDUP[i] = WHLPOS[WHLNDX[i]];
      }
      Func<ushort, int> nmbts = (v) => { var acc = 0; while (v != 0) { acc += (int)v & 1; v >>= 1; } return acc; };
      CLUT = new byte[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1));
      PRLUT = new byte[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) {
        var t = (uint)(WHLPOS[i] * 2) + FSTBP; if (t >= WCRC) t -= WCRC; if (t >= WCRC) t -= WCRC; PRLUT[i] = (byte)t;
      }
      WSLUT = new Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) {
        var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC;
        for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) {
          var m = WHLPTRN[(x + y) % WHTS];
          pos %= WCRC; var posn = WHLNDX[pos]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC];
          WSLUT[x * WHTS + posn] = new Wst {
            msk = (ushort)(1 << (int)(posn & 0xF)),
            mlt = (byte)(m * WPC),
            xtr = (byte)(WPC * nposd + (nposn >> 4) - (posn >> 4)),
            nxt = (ushort)(WHTS * x + nposn)
          };
        }
      }
      MCPY = new ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) {
        var p = (uint)lp;
        var k = (p * p - FSTBP) >> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
        for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) {
          var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt;
        }
      }
    }

    #endregion

    #region public class

    // this class implements the enumeration (IEnumerator).
    //    it works by farming out tasks culling pages, which it then processes in order by
    //    enumerating the found primes as recognized by the remaining non-composite bits
    //    in the cull page buffers.
    class nmrtr : IEnumerator<ulong>, IEnumerator, IDisposable {
      PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; ushort[] buf;
      public nmrtr() {
        for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] };
        for (var s = 1u; s < NUMPRCSPCS; ++s) {
          ps[s].tsk = cullbftsk((s - 1u) * BFRNG, ps[s].buf, (bfr) => { });
        } buf = ps[0].buf;
      }
      ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v, msk = 0;
      public ulong Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
      public bool MoveNext() {
        if (b < 0) {
          if (b == -1) b += buf.Length; //no yield!!! so automatically comes around again
          else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)]; return true; }
        }
        do {
          i += WHLPTRN[wi++]; if (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) {
            if (++b >= BFSZ) {
              b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1];
              ps[NUMPRCSPCS - 1u].buf = buf;
              ps[NUMPRCSPCS - 1u].tsk = cullbftsk(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { });
              ps[0].tsk.Wait(); buf = ps[0].buf;
            } v = buf[b]; this.msk = 1;
          }
        }
        while ((v & msk) != 0u); _curr = FSTBP + i + i; return true;
      }
      public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
      public void Dispose() { }
    }

    #endregion

    #region public instance method and associated sub private method

    /// <summary>
    /// Gets the enumerator for the primes.
    /// </summary>
    /// <returns>The enumerator of the primes.</returns>
    public IEnumerator<ulong> GetEnumerator() { return new nmrtr(); }

    /// <summary>
    /// Gets the enumerator for the primes.
    /// </summary>
    /// <returns>The enumerator of the primes.</returns>
    IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); }

    #endregion

    #region public static methods

    /// <summary>
    /// Gets the count of primes up the number, inclusively.
    /// </summary>
    /// <param name="top_number">The ulong top number to check for prime.</param>
    /// <returns>The long number of primes found.</returns>
    public static long CountTo(ulong top_number) {
      if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count();
      var cnt = (long)BWHLPRMS.Length;
      IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); }); return cnt;
    }

    /// <summary>
    /// Gets the sum of the primes up the number, inclusively.
    /// </summary>
    /// <param name="top_number">The uint top number to check for prime.</param>
    /// <returns>The ulong sum of all the primes found.</returns>
    public static ulong SumTo(uint top_number) {
      if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p);
      var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p);
      Func<ulong, uint, ushort[], long> sumbf = (lowi, bitlim, buf) => {
        var acc = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim;
            i += WHLPTRN[wi++], wi = wi >= WHTS ? 0 : wi) {
          if (msk >= 0x8000) { msk = 1; v = buf[w++]; } else msk <<= 1;
          if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1));
        } return acc;
      };
      IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); }); return (ulong)sum;
    }

    /// <summary>
    /// Gets the prime number at the zero based index number given.
    /// </summary>
    /// <param name="index">The long zero-based index number for the prime.</param>
    /// <returns>The ulong prime found at the given index.</returns>
    public static ulong ElementAt(long index) {
      if (index < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)index);
      long cnt = BWHLPRMS.Length; var ndx = 0UL; var cycl = 0u; var bit = 0u; IterateUntil((lwi, bfr) => {
        var c = count(BFRNG, bfr); if ((cnt += c) < index) return false; ndx = lwi; cnt -= c; c = 0;
        do { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; } while (cnt < index);
        cnt -= c; var y = (--cycl) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y];
        do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; } while (cnt <= index); --bit; return true;
      }); return FSTBP + ((ndx + cycl * WCRC + WHLPOS[bit]) << 1);
    }

    #endregion
  }

Вышеприведенный код будет перечислять простые цифры до одного миллиарда примерно на 1,55 секунды на четырех ядрах (восемь потоков, включая HT) i7-2700K (3,5 ГГц), а ваш E7500 будет, возможно, до четырех раза медленнее из-за меньшего количества потоков и немного меньше тактовой частоты. Примерно три четверти того времени - это просто время, чтобы запустить метод перечисления MoveNext() и текущее свойство, поэтому я предоставляю публичные статические методы "CountTo", "SumTo" и "ElementAt" для вычисления числа или суммы простых чисел в диапазон и n-й нулевой базис, соответственно, без использования перечисления. Использование статического метода UltimatePrimesSoE.CountTo(1000000000) дает 50847534 примерно через 0,32 секунды на моей машине, поэтому на Intel E7500 не должно занимать больше времени примерно 1,28 секунды.

EDIT_ADD: Интересно, что этот код работает на 30% быстрее в 32-разрядном режиме x86, чем в 64-битном режиме 64, вероятно, из-за того, что он избегает незначительных дополнительных накладных расходов, расширяя номера uint32 до ulong. Все приведенные выше тайминги предназначены для 64-битного режима. END_EDIT_ADD

В почти 300 (плотных) строках кода эта реализация непростая, но стоимость выполнения всех описанных оптимизаций делает этот код настолько эффективным. Еще не так много других строк кода, что другой ответ Аарона Мургатроида; хотя его код менее плотный, его код также примерно в четыре раза медленнее. Фактически, почти все время выполнения расходуется в финальном "цикле" моего частного частного метода "cullbf" моего кода, который содержит только четыре оператора плюс плюс проверка состояния диапазона; весь остальной код просто поддерживает повторные приложения этого цикла.

Причины, по которым этот код быстрее, чем другой ответ, по тем же причинам, что этот код быстрее, чем ваш код, кроме того, что он оптимизирует Step (1) только для обработки нечетных первичных кандидатов. Его использование многопроцессорности почти полностью неэффективно, поскольку только в 30% -ном преимуществе, а не в четырехкратном, которое должно быть возможным на истинном четырехъядерном ЦП при правильном применении, поскольку оно накладывается на премьер, а не на все простые числа на небольших страницах, а его использование небезопасного доступа к массиву указателей как метода устранения вычислительной стоимости DotNet для проверки привязки массива к каждому циклу фактически замедляет код по сравнению с использованием массивов напрямую, включая проверку границ, поскольку компилятор DotNet Just In Time (JIT) производит довольно неэффективный код для доступа указателя. Кроме того, его код перечисляет простые числа, так же, как мой код может делать, в котором перечисление имеет 10 циклов тактового цикла процессора за пересчитанное правое, что также немного хуже в его случае, поскольку он использует встроенные итераторы С#, которые несколько меньше эффективный, чем мой интерфейс IEnumerator "roll-your-own". Однако для максимальной скорости мы должны полностью избегать перечисления; однако даже его метод экземпляра "Count" использует цикл "foreach", который означает перечисление.

Таким образом, этот ответный код дает простые ответы примерно в 25 раз быстрее, чем ваш код вопроса на вашем процессоре E7500 (гораздо чаще в процессоре с большим количеством ядер/потоков) использует гораздо меньше памяти и не ограничивается меньшим количеством простых диапазон около 32-битного диапазона, но с учетом повышенной сложности кода.

Ответ 2

Моя реализация с многопоточным (требуется .NET 4.0):

using System;
using System.Collections;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace PrimeGenerator
{
    // The block element type for the bit array, 
    // use any unsigned value. WARNING: UInt64 is 
    // slower even on x64 architectures.
    using BitArrayType = System.UInt32;

    // This should never be any bigger than 256 bits - leave as is.
    using BitsPerBlockType = System.Byte;

    // The prime data type, this can be any unsigned value, the limit
    // of this type determines the limit of Prime value that can be
    // found. WARNING: UInt64 is slower even on x64 architectures.
    using PrimeType = System.UInt32;

    /// <summary>
    /// Calculates prime number using the Sieve of Eratosthenes method.
    /// </summary>
    /// <example>
    /// <code>
    ///     var lpPrimes = new Eratosthenes(1e7);
    ///     foreach (UInt32 luiPrime in lpPrimes)
    ///         Console.WriteLine(luiPrime);
    /// </example>
    public class Eratosthenes : IEnumerable<PrimeType>
    {
        #region Constants

        /// <summary>
        /// Constant for number of bits per block, calculated based on size of BitArrayType.
        /// </summary>
        const BitsPerBlockType cbBitsPerBlock = sizeof(BitArrayType) * 8;

        #endregion

        #region Protected Locals

        /// <summary>
        /// The limit for the maximum prime value to find.
        /// </summary>
        protected readonly PrimeType mpLimit;

        /// <summary>
        /// True if the class is multi-threaded
        /// </summary>
        protected readonly bool mbThreaded;

        /// <summary>
        /// The current bit array where a set bit means
        /// the odd value at that location has been determined
        /// to not be prime.
        /// </summary>
        protected BitArrayType[] mbaOddNotPrime;

        #endregion

        #region Initialisation

        /// <summary>
        /// Create Sieve of Eratosthenes generator.
        /// </summary>
        /// <param name="limit">The limit for the maximum prime value to find.</param>
        /// <param name="threaded">True if threaded, false otherwise.</param>
        public Eratosthenes(PrimeType limit, bool threaded)
        {
            // Check limit range
            if (limit > PrimeType.MaxValue - (PrimeType)Math.Sqrt(PrimeType.MaxValue))
                throw new ArgumentOutOfRangeException();

            mbThreaded = threaded;
            mpLimit = limit;

            FindPrimes();
        }

        /// <summary>
        /// Create Sieve of Eratosthenes generator in multi-threaded mode.
        /// </summary>
        /// <param name="limit">The limit for the maximum prime value to find.</param>
        public Eratosthenes(PrimeType limit)
            : this(limit, true)
        {
        }

        #endregion

        #region Private Methods

        /// <summary>
        /// Calculates compartment indexes for a multi-threaded operation.
        /// </summary>
        /// <param name="startInclusive">The inclusive starting index.</param>
        /// <param name="endExclusive">The exclusive ending index.</param>
        /// <param name="threads">The number of threads.</param>
        /// <returns>An array of thread elements plus 1 containing the starting and exclusive ending indexes to process for each thread.</returns>
        private PrimeType[] CalculateCompartments(PrimeType startInclusive, PrimeType endExclusive, ref int threads)
        {
            if (threads == 0) threads = 1;
            if (threads == -1) threads = Environment.ProcessorCount;
            if (threads > endExclusive - startInclusive) threads = (int)(endExclusive - startInclusive);

            PrimeType[] liThreadIndexes = new PrimeType[threads + 1];
            liThreadIndexes[threads] = endExclusive;
            PrimeType liIndexesPerThread = (endExclusive - startInclusive) / (PrimeType)threads;
            for (PrimeType liCount = 0; liCount < threads; liCount++)
                liThreadIndexes[liCount] = liCount * liIndexesPerThread;

            return liThreadIndexes;
        }

        /// <summary>
        /// Executes a simple for loop in parallel but only creates
        /// a set amount of threads breaking the work up evenly per thread,
        /// calling the body only once per thread, this is different
        /// to the .NET 4.0 For method which calls the body for each index.
        /// </summary>
        /// <typeparam name="ParamType">The type of parameter to pass to the body.</typeparam>
        /// <param name="startInclusive">The starting index.</param>
        /// <param name="endExclusive">The exclusive ending index.</param>
        /// <param name="parameter">The parameter to pass to the body.</param>
        /// <param name="body">The body to execute per thread.</param>
        /// <param name="threads">The number of threads to execute.</param>
        private void For<ParamType>(
            PrimeType startInclusive, PrimeType endExclusive, ParamType parameter,
            Action<PrimeType, PrimeType, ParamType> body,
            int threads)
        {
            PrimeType[] liThreadIndexes = CalculateCompartments(startInclusive, endExclusive, ref threads);

            if (threads > 1)
                Parallel.For(
                    0, threads, new System.Threading.Tasks.ParallelOptions(),
                    (liThread) => { body(liThreadIndexes[liThread], liThreadIndexes[liThread + 1], parameter); }
                );
            else
                body(startInclusive, endExclusive, parameter);
        }

        /// <summary>
        /// Finds the prime number within range.
        /// </summary>
        private unsafe void FindPrimes()
        {
            // Allocate bit array.
            mbaOddNotPrime = new BitArrayType[(((mpLimit >> 1) + 1) / cbBitsPerBlock) + 1];

            // Cache Sqrt of limit.
            PrimeType lpSQRT = (PrimeType)Math.Sqrt(mpLimit);

            int liThreads = Environment.ProcessorCount;
            if (!Threaded) liThreads = 0;

            // Fix the bit array for pointer access
            fixed (BitArrayType* lpbOddNotPrime = &mbaOddNotPrime[0])
            {
                IntPtr lipBits = (IntPtr)lpbOddNotPrime;

                // Scan primes up to lpSQRT
                for (PrimeType lpN = 3; lpN <= lpSQRT; lpN += 2)
                {
                    // If the current bit value for index lpN is cleared (prime)
                    if (
                            (
                                lpbOddNotPrime[(lpN >> 1) / cbBitsPerBlock] &
                                ((BitArrayType)1 << (BitsPerBlockType)((lpN >> 1) % cbBitsPerBlock))
                            ) == 0
                        )
                    {
                        // If multi-threaded
                        if (liThreads > 1)
                        {
                            // Leave it cleared (prime) and mark all multiples of lpN*2 from lpN*lpN as not prime
                            For<PrimeType>(
                                0, ((mpLimit - (lpN * lpN)) / (lpN << 1)) + 1, lpN,
                                (start, end, n) =>
                                {
                                    BitArrayType* lpbBits = (BitArrayType*)lipBits;
                                    PrimeType lpM = n * n + (start * (n << 1));
                                    for (PrimeType lpCount = start; lpCount < end; lpCount++)
                                    {
                                        // Set as not prime
                                        lpbBits[(lpM >> 1) / cbBitsPerBlock] |=
                                            (BitArrayType)((BitArrayType)1 << (BitsPerBlockType)((lpM >> 1) % cbBitsPerBlock));

                                        lpM += n << 1;
                                    }
                                },
                                liThreads);
                        }
                        else
                        {
                            // Leave it cleared (prime) and mark all multiples of lpN*2 from lpN*lpN as not prime
                            for (PrimeType lpM = lpN * lpN; lpM <= mpLimit; lpM += lpN<<1)
                                // Set as not prime
                                lpbOddNotPrime[(lpM >> 1) / cbBitsPerBlock] |=
                                    (BitArrayType)((BitArrayType)1 << (BitsPerBlockType)((lpM >> 1) % cbBitsPerBlock));
                        }
                    }
                }
            }
        }

        /// <summary>
        /// Gets a bit value by index.
        /// </summary>
        /// <param name="bits">The blocks containing the bits.</param>
        /// <param name="index">The index of the bit.</param>
        /// <returns>True if bit is set, false if cleared.</returns>
        private bool GetBitSafe(BitArrayType[] bits, PrimeType index)
        {
            return (bits[index / cbBitsPerBlock] & ((BitArrayType)1 << (BitsPerBlockType)(index % cbBitsPerBlock))) != 0;
        }

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets whether this class is multi-threaded or not.
        /// </summary>
        public bool Threaded
        {
            get
            {
                return mbThreaded;
            }
        }

        /// <summary>
        /// Get the limit for the maximum prime value to find.
        /// </summary>
        public PrimeType Limit
        {
            get
            {
                return mpLimit;
            }
        }

        /// <summary>
        /// Returns the number of primes found in the range.
        /// </summary>
        public PrimeType Count
        {
            get
            {
                PrimeType lptCount = 0;
                foreach (PrimeType liPrime in this)
                    lptCount++;
                return lptCount;
            }
        }

        /// <summary>
        /// Determines if a value in range is prime or not.
        /// </summary>
        /// <param name="test">The value to test for primality.</param>
        /// <returns>True if the value is prime, false otherwise.</returns>
        public bool this[PrimeType test]
        {
            get
            {
                if (test > mpLimit) throw new ArgumentOutOfRangeException();
                if (test <= 1) return false;
                if (test == 2) return true;
                if ((test & 1) == 0) return false;
                return !GetBitSafe(mbaOddNotPrime, test >> 1);
            }
        }

        #endregion

        #region Public Methods

        /// <summary>
        /// Gets the enumerator for the primes.
        /// </summary>
        /// <returns>The enumerator of the primes.</returns>
        public IEnumerator<PrimeType> GetEnumerator()
        {
            // Two always prime.
            yield return 2;

            // Start at first block, second MSB.
            int liBlock = 0;
            byte lbBit = 1;
            BitArrayType lbaCurrent = mbaOddNotPrime[0] >> 1;

            // For each value in range stepping in incrments of two for odd values.
            for (PrimeType lpN = 3; lpN <= mpLimit; lpN += 2)
            {
                // If current bit not set then value is prime.
                if ((lbaCurrent & 1) == 0)
                    yield return lpN;

                // Move to NSB.
                lbaCurrent >>= 1;

                // Increment bit value.
                lbBit++;

                // If block is finished.
                if (lbBit == cbBitsPerBlock) 
                {
                    // Move to first bit of next block.
                    lbBit = 0;
                    liBlock++;
                    lbaCurrent = mbaOddNotPrime[liBlock];
                }
            }
        }

        #endregion

        #region IEnumerable<PrimeType> Implementation

        /// <summary>
        /// Gets the enumerator for the primes.
        /// </summary>
        /// <returns></returns>
        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }

        #endregion
    }
}

Многопоточность работает путем потоковой обработки самого внутреннего цикла, таким образом, нет проблем с блокировкой данных, поскольку несколько потоков работают с подмножеством массива и не перекрываются для каждого выполненного задания.

Скорее всего, он может генерировать все простые числа до 1000 000 000 на процессоре AMD Phenom II X4 965 за 5,8 секунды. Специальные реализации, такие как Аткинс, быстрее, но это быстро для сита Эратосфена.

Как реализовать сито из Eratosthenes с помощью многопоточного С#?

Ответы

Ответ 1

Ответ 2

Ответ 3